Implement GitOps workflows with ArgoCD and Flux for automated, declarative Kubernetes...
npx skills add 404kidwiz/claude-supercode-skills --skill "performance-engineer"
Install specific skill from multi-skill repository
# Description
Expert in system optimization, profiling, and scalability. Specializes in eBPF, Flamegraphs, and kernel-level tuning.
# SKILL.md
name: performance-engineer
description: Expert in system optimization, profiling, and scalability. Specializes in eBPF, Flamegraphs, and kernel-level tuning.
Performance Engineer
Purpose
Provides system optimization and profiling expertise specializing in deep-dive performance analysis, load testing, and kernel-level tuning using eBPF and Flamegraphs. Identifies and resolves performance bottlenecks in applications and infrastructure.
When to Use
- Investigating high latency (P99 spikes) or low throughput
- Analyzing CPU/Memory profiles (Flamegraphs)
- Conducting Load Tests (K6, Gatling, Locust)
- Tuning Linux Kernel parameters (sysctl)
- Implementing Continuous Profiling (Parca, Pyroscope)
- Debugging "It works on my machine but slow in prod" issues
---
2. Decision Framework
Profiling Strategy
What is the bottleneck?
│
├─ **CPU High?**
│ ├─ User Space? → **Language Profiler** (pprof, async-profiler)
│ └─ Kernel Space? → **perf / eBPF** (System calls, Context switches)
│
├─ **Memory High?**
│ ├─ Leak? → **Heap Dump Analysis** (Eclipse MAT, heaptrack)
│ └─ Fragmentation? → **Allocator tuning** (jemalloc, tcmalloc)
│
├─ **I/O Wait?**
│ ├─ Disk? → **iostat / biotop**
│ └─ Network? → **tcpdump / Wireshark**
│
└─ **Latency (Wait Time)?**
└─ Distributed? → **Tracing** (OpenTelemetry, Jaeger)
Load Testing Tools
| Tool | Language | Best For |
|---|---|---|
| K6 | JS | Developer-friendly, CI/CD integration. |
| Gatling | Scala/Java | High concurrency, complex scenarios. |
| Locust | Python | Rapid prototyping, code-based tests. |
| Wrk2 | C | Raw HTTP throughput benchmarking (simple). |
Optimization Hierarchy
- Algorithm: O(n^2) → O(n log n). Biggest wins.
- Architecture: Caching, Async processing.
- Code/Language: Memory allocation, loop unrolling.
- System/Kernel: TCP stack tuning, CPU affinity.
Red Flags → Escalate to database-optimizer:
- "Slow performance" turns out to be a single SQL query missing an index
- Database locks/deadlocks causing application stalls
- Disk I/O saturation on the DB server
---
3. Core Workflows
Workflow 1: CPU Profiling with Flamegraphs
Goal: Identify which function is consuming 80% CPU.
Steps:
-
Capture Profile (Linux perf)
bash # Record stack traces at 99Hz for 30 seconds perf record -F 99 -a -g -- sleep 30 -
Generate Flamegraph
bash perf script > out.perf ./stackcollapse-perf.pl out.perf > out.folded ./flamegraph.pl out.folded > profile.svg -
Analysis
- Open
profile.svgin browser. - Look for wide towers (functions taking time).
- Example:
json_parseis 40% width → Optimize JSON handling.
- Open
---
Workflow 3: Interaction to Next Paint (INP)
Goal: Improve Frontend responsiveness (Core Web Vital).
Steps:
-
Measure
- Use Chrome DevTools Performance tab.
- Look for "Long Tasks" (Red blocks > 50ms).
-
Identify
- Is it hydration? Event handlers?
- Example: A click handler forcing a synchronous layout recalculation.
-
Optimize
- Yield to Main Thread:
await new Promise(r => setTimeout(r, 0))orscheduler.postTask(). - Web Workers: Move heavy logic off-thread.
- Yield to Main Thread:
---
Workflow 5: Interaction to Next Paint (INP) Optimization
Goal: Fix "Laggy Click" (INP > 200ms) on a React button.
Steps:
-
Identify Interaction
- Use React DevTools Profiler (Interaction Tracing).
- Find the
clickhandler duration.
-
Break Up Long Tasks
```javascript
async function handleClick() {
// 1. UI Update (Immediate)
setLoading(true);// 2. Yield to main thread to let browser paint
await new Promise(r => setTimeout(r, 0));// 3. Heavy Logic
await heavyCalculation();
setLoading(false);
}
``` -
Verify
- Use
Web Vitalsextension. Check if INP drops below 200ms.
- Use
---
5. Anti-Patterns & Gotchas
❌ Anti-Pattern 1: Premature Optimization
What it looks like:
- Replacing a readable map() with a complex for loop because "it's faster" without measuring.
Why it fails:
- Wasted dev time.
- Code becomes unreadable.
- Usually negligible impact compared to I/O.
Correct approach:
- Measure First: Only optimize hot paths identified by a profiler.
❌ Anti-Pattern 2: Testing "localhost" vs Production
What it looks like:
- "It handles 10k req/s on my MacBook."
Why it fails:
- Network latency (0ms on localhost).
- Database dataset size (tiny on local).
- Cloud limits (CPU credits, I/O bursts).
Correct approach:
- Test in a Staging Environment that mirrors Prod capacity (or a scaled-down ratio).
❌ Anti-Pattern 3: Ignoring Tail Latency (Averages)
What it looks like:
- "Average latency is 200ms, we are fine."
Why it fails:
- P99 could be 10 seconds. 1% of users are suffering.
- In microservices, tail latencies multiply.
Correct approach:
- Always measure P50, P95, and P99. Optimize for P99.
---
Examples
Example 1: CPU Performance Optimization Using Flamegraphs
Scenario: Production API experiencing 80% CPU utilization causing latency spikes.
Investigation Approach:
1. Profile Collection: Used perf to capture CPU stack traces
2. Flamegraph Generation: Created visualization of CPU usage
3. Analysis: Identified hot functions consuming most CPU
4. Optimization: Targeted the top 3 functions
Key Findings:
| Function | CPU % | Optimization Action |
|----------|-------|-------------------|
| json_serialize | 35% | Switch to binary format |
| crypto_hash | 25% | Batch hashing operations |
| regex_match | 20% | Pre-compile patterns |
Results:
- CPU utilization: 80% → 35%
- P99 latency: 1.2s → 150ms
- Throughput: 500 RPS → 2,000 RPS
Example 2: Distributed Tracing for Microservices Latency
Scenario: Distributed system with 15 services experiencing end-to-end latency issues.
Investigation Approach:
1. Trace Collection: Deployed OpenTelemetry collectors
2. Latency Analysis: Identified service with highest latency contribution
3. Dependency Analysis: Mapped service dependencies and data flows
4. Root Cause: Database connection pool exhaustion
Trace Analysis:
Service A (50ms) → Service B (200ms) → Service C (500ms) → Database (1s)
↑
Connection pool exhaustion
Resolution:
- Increased connection pool size
- Implemented query optimization
- Added read replicas for heavy queries
Results:
- End-to-end P99: 2.5s → 300ms
- Database CPU: 95% → 60%
- Error rate: 5% → 0.1%
Example 3: Load Testing for Capacity Planning
Scenario: E-commerce platform preparing for Black Friday traffic (10x normal load).
Load Testing Approach:
1. Test Design: Created realistic user journey scenarios
2. Test Execution: Gradual ramp-up to target load
3. Bottleneck Identification: Found breaking points
4. Capacity Planning: Determined required resources
Load Test Results:
| Virtual Users | RPS | P95 Latency | Error Rate |
|---------------|-----|--------------|------------|
| 1,000 | 500 | 150ms | 0.1% |
| 5,000 | 2,400 | 280ms | 0.3% |
| 10,000 | 4,800 | 550ms | 1.2% |
| 15,000 | 6,200 | 1.2s | 5.8% |
Capacity Recommendations:
- Scale to 12,000 concurrent users
- Add 3 more application servers
- Increase database read replicas to 5
- Implement rate limiting at 10,000 RPS
Best Practices
Profiling and Analysis
- Measure First: Always profile before optimizing
- Comprehensive Coverage: Analyze CPU, memory, I/O, and network
- Production Safe: Use low-overhead profiling in production
- Regular Baselines: Establish performance baselines for comparison
Load Testing
- Realistic Scenarios: Model actual user behavior and workflows
- Progressive Ramp-up: Start low, increase gradually
- Bottleneck Identification: Find limiting factors systematically
- Repeatability: Maintain consistent test environments
Performance Optimization
- Algorithm First: Optimize algorithms before micro-optimizations
- Caching Strategy: Implement appropriate caching layers
- Database Optimization: Indexes, queries, connection pooling
- Resource Management: Efficient allocation and pooling
Monitoring and Observability
- Comprehensive Metrics: CPU, memory, disk, network, application
- Distributed Tracing: End-to-end visibility in microservices
- Alerting: Proactive identification of performance degradation
- Dashboarding: Real-time visibility into system health
Quality Checklist
Profiling:
- [ ] Symbols: Debug symbols available for accurate stack traces.
- [ ] Overhead: Profiler overhead verified (< 1-2% for production).
- [ ] Scope: Both CPU and Wall-clock time analyzed.
- [ ] Context: Profile includes full request lifecycle.
Load Testing:
- [ ] Scenarios: Realistic user behavior (not just hitting one endpoint).
- [ ] Warmup: System warmed up before measurement (JIT/Caches).
- [ ] Bottleneck: Identified the limiting factor (CPU, DB, Bandwidth).
- [ ] Repeatable: Tests can be run consistently.
Optimization:
- [ ] Validation: Benchmark run after fix to confirm improvement.
- [ ] Regression: Ensured optimization didn't break functionality.
- [ ] Documentation: Documented why the optimization was done.
- [ ] Monitoring: Added metrics to track optimization impact.
# Supported AI Coding Agents
This skill is compatible with the SKILL.md standard and works with all major AI coding agents:
Learn more about the SKILL.md standard and how to use these skills with your preferred AI coding agent.