404kidwiz

performance-engineer

6
0
# Install this skill:
npx skills add 404kidwiz/claude-supercode-skills --skill "performance-engineer"

Install specific skill from multi-skill repository

# Description

Expert in system optimization, profiling, and scalability. Specializes in eBPF, Flamegraphs, and kernel-level tuning.

# SKILL.md


name: performance-engineer
description: Expert in system optimization, profiling, and scalability. Specializes in eBPF, Flamegraphs, and kernel-level tuning.


Performance Engineer

Purpose

Provides system optimization and profiling expertise specializing in deep-dive performance analysis, load testing, and kernel-level tuning using eBPF and Flamegraphs. Identifies and resolves performance bottlenecks in applications and infrastructure.

When to Use

  • Investigating high latency (P99 spikes) or low throughput
  • Analyzing CPU/Memory profiles (Flamegraphs)
  • Conducting Load Tests (K6, Gatling, Locust)
  • Tuning Linux Kernel parameters (sysctl)
  • Implementing Continuous Profiling (Parca, Pyroscope)
  • Debugging "It works on my machine but slow in prod" issues

---

2. Decision Framework

Profiling Strategy

What is the bottleneck?
โ”‚
โ”œโ”€ **CPU High?**
โ”‚  โ”œโ”€ User Space? โ†’ **Language Profiler** (pprof, async-profiler)
โ”‚  โ””โ”€ Kernel Space? โ†’ **perf / eBPF** (System calls, Context switches)
โ”‚
โ”œโ”€ **Memory High?**
โ”‚  โ”œโ”€ Leak? โ†’ **Heap Dump Analysis** (Eclipse MAT, heaptrack)
โ”‚  โ””โ”€ Fragmentation? โ†’ **Allocator tuning** (jemalloc, tcmalloc)
โ”‚
โ”œโ”€ **I/O Wait?**
โ”‚  โ”œโ”€ Disk? โ†’ **iostat / biotop**
โ”‚  โ””โ”€ Network? โ†’ **tcpdump / Wireshark**
โ”‚
โ””โ”€ **Latency (Wait Time)?**
   โ””โ”€ Distributed? โ†’ **Tracing** (OpenTelemetry, Jaeger)

Load Testing Tools

Tool Language Best For
K6 JS Developer-friendly, CI/CD integration.
Gatling Scala/Java High concurrency, complex scenarios.
Locust Python Rapid prototyping, code-based tests.
Wrk2 C Raw HTTP throughput benchmarking (simple).

Optimization Hierarchy

  1. Algorithm: O(n^2) โ†’ O(n log n). Biggest wins.
  2. Architecture: Caching, Async processing.
  3. Code/Language: Memory allocation, loop unrolling.
  4. System/Kernel: TCP stack tuning, CPU affinity.

Red Flags โ†’ Escalate to database-optimizer:
- "Slow performance" turns out to be a single SQL query missing an index
- Database locks/deadlocks causing application stalls
- Disk I/O saturation on the DB server

---

3. Core Workflows

Workflow 1: CPU Profiling with Flamegraphs

Goal: Identify which function is consuming 80% CPU.

Steps:

  1. Capture Profile (Linux perf)
    bash # Record stack traces at 99Hz for 30 seconds perf record -F 99 -a -g -- sleep 30

  2. Generate Flamegraph
    bash perf script > out.perf ./stackcollapse-perf.pl out.perf > out.folded ./flamegraph.pl out.folded > profile.svg

  3. Analysis

    • Open profile.svg in browser.
    • Look for wide towers (functions taking time).
    • Example: json_parse is 40% width โ†’ Optimize JSON handling.

---

Workflow 3: Interaction to Next Paint (INP)

Goal: Improve Frontend responsiveness (Core Web Vital).

Steps:

  1. Measure

    • Use Chrome DevTools Performance tab.
    • Look for "Long Tasks" (Red blocks > 50ms).
  2. Identify

    • Is it hydration? Event handlers?
    • Example: A click handler forcing a synchronous layout recalculation.
  3. Optimize

    • Yield to Main Thread: await new Promise(r => setTimeout(r, 0)) or scheduler.postTask().
    • Web Workers: Move heavy logic off-thread.

---

Workflow 5: Interaction to Next Paint (INP) Optimization

Goal: Fix "Laggy Click" (INP > 200ms) on a React button.

Steps:

  1. Identify Interaction

    • Use React DevTools Profiler (Interaction Tracing).
    • Find the click handler duration.
  2. Break Up Long Tasks
    ```javascript
    async function handleClick() {
    // 1. UI Update (Immediate)
    setLoading(true);

    // 2. Yield to main thread to let browser paint
    await new Promise(r => setTimeout(r, 0));

    // 3. Heavy Logic
    await heavyCalculation();
    setLoading(false);
    }
    ```

  3. Verify

    • Use Web Vitals extension. Check if INP drops below 200ms.

---

5. Anti-Patterns & Gotchas

โŒ Anti-Pattern 1: Premature Optimization

What it looks like:
- Replacing a readable map() with a complex for loop because "it's faster" without measuring.

Why it fails:
- Wasted dev time.
- Code becomes unreadable.
- Usually negligible impact compared to I/O.

Correct approach:
- Measure First: Only optimize hot paths identified by a profiler.

โŒ Anti-Pattern 2: Testing "localhost" vs Production

What it looks like:
- "It handles 10k req/s on my MacBook."

Why it fails:
- Network latency (0ms on localhost).
- Database dataset size (tiny on local).
- Cloud limits (CPU credits, I/O bursts).

Correct approach:
- Test in a Staging Environment that mirrors Prod capacity (or a scaled-down ratio).

โŒ Anti-Pattern 3: Ignoring Tail Latency (Averages)

What it looks like:
- "Average latency is 200ms, we are fine."

Why it fails:
- P99 could be 10 seconds. 1% of users are suffering.
- In microservices, tail latencies multiply.

Correct approach:
- Always measure P50, P95, and P99. Optimize for P99.

---

Examples

Example 1: CPU Performance Optimization Using Flamegraphs

Scenario: Production API experiencing 80% CPU utilization causing latency spikes.

Investigation Approach:
1. Profile Collection: Used perf to capture CPU stack traces
2. Flamegraph Generation: Created visualization of CPU usage
3. Analysis: Identified hot functions consuming most CPU
4. Optimization: Targeted the top 3 functions

Key Findings:
| Function | CPU % | Optimization Action |
|----------|-------|-------------------|
| json_serialize | 35% | Switch to binary format |
| crypto_hash | 25% | Batch hashing operations |
| regex_match | 20% | Pre-compile patterns |

Results:
- CPU utilization: 80% โ†’ 35%
- P99 latency: 1.2s โ†’ 150ms
- Throughput: 500 RPS โ†’ 2,000 RPS

Example 2: Distributed Tracing for Microservices Latency

Scenario: Distributed system with 15 services experiencing end-to-end latency issues.

Investigation Approach:
1. Trace Collection: Deployed OpenTelemetry collectors
2. Latency Analysis: Identified service with highest latency contribution
3. Dependency Analysis: Mapped service dependencies and data flows
4. Root Cause: Database connection pool exhaustion

Trace Analysis:

Service A (50ms) โ†’ Service B (200ms) โ†’ Service C (500ms) โ†’ Database (1s)
                                     โ†‘
                               Connection pool exhaustion

Resolution:
- Increased connection pool size
- Implemented query optimization
- Added read replicas for heavy queries

Results:
- End-to-end P99: 2.5s โ†’ 300ms
- Database CPU: 95% โ†’ 60%
- Error rate: 5% โ†’ 0.1%

Example 3: Load Testing for Capacity Planning

Scenario: E-commerce platform preparing for Black Friday traffic (10x normal load).

Load Testing Approach:
1. Test Design: Created realistic user journey scenarios
2. Test Execution: Gradual ramp-up to target load
3. Bottleneck Identification: Found breaking points
4. Capacity Planning: Determined required resources

Load Test Results:
| Virtual Users | RPS | P95 Latency | Error Rate |
|---------------|-----|--------------|------------|
| 1,000 | 500 | 150ms | 0.1% |
| 5,000 | 2,400 | 280ms | 0.3% |
| 10,000 | 4,800 | 550ms | 1.2% |
| 15,000 | 6,200 | 1.2s | 5.8% |

Capacity Recommendations:
- Scale to 12,000 concurrent users
- Add 3 more application servers
- Increase database read replicas to 5
- Implement rate limiting at 10,000 RPS

Best Practices

Profiling and Analysis

  • Measure First: Always profile before optimizing
  • Comprehensive Coverage: Analyze CPU, memory, I/O, and network
  • Production Safe: Use low-overhead profiling in production
  • Regular Baselines: Establish performance baselines for comparison

Load Testing

  • Realistic Scenarios: Model actual user behavior and workflows
  • Progressive Ramp-up: Start low, increase gradually
  • Bottleneck Identification: Find limiting factors systematically
  • Repeatability: Maintain consistent test environments

Performance Optimization

  • Algorithm First: Optimize algorithms before micro-optimizations
  • Caching Strategy: Implement appropriate caching layers
  • Database Optimization: Indexes, queries, connection pooling
  • Resource Management: Efficient allocation and pooling

Monitoring and Observability

  • Comprehensive Metrics: CPU, memory, disk, network, application
  • Distributed Tracing: End-to-end visibility in microservices
  • Alerting: Proactive identification of performance degradation
  • Dashboarding: Real-time visibility into system health

Quality Checklist

Profiling:
- [ ] Symbols: Debug symbols available for accurate stack traces.
- [ ] Overhead: Profiler overhead verified (< 1-2% for production).
- [ ] Scope: Both CPU and Wall-clock time analyzed.
- [ ] Context: Profile includes full request lifecycle.

Load Testing:
- [ ] Scenarios: Realistic user behavior (not just hitting one endpoint).
- [ ] Warmup: System warmed up before measurement (JIT/Caches).
- [ ] Bottleneck: Identified the limiting factor (CPU, DB, Bandwidth).
- [ ] Repeatable: Tests can be run consistently.

Optimization:
- [ ] Validation: Benchmark run after fix to confirm improvement.
- [ ] Regression: Ensured optimization didn't break functionality.
- [ ] Documentation: Documented why the optimization was done.
- [ ] Monitoring: Added metrics to track optimization impact.

# Supported AI Coding Agents

This skill is compatible with the SKILL.md standard and works with all major AI coding agents:

Learn more about the SKILL.md standard and how to use these skills with your preferred AI coding agent.