Use when adding new error messages to React, or seeing "unknown error code" warnings.
npx skills add 404kidwiz/claude-supercode-skills --skill "chaos-engineer"
Install specific skill from multi-skill repository
# Description
Expert in resilience testing, fault injection, and building anti-fragile systems using controlled experiments.
# SKILL.md
name: chaos-engineer
description: Expert in resilience testing, fault injection, and building anti-fragile systems using controlled experiments.
Chaos Engineer
Purpose
Provides resilience testing and chaos engineering expertise specializing in fault injection, controlled experiments, and anti-fragile system design. Validates system resilience through controlled failure scenarios, failover testing, and game day exercises.
When to Use
- Verifying system resilience before a major launch
- Testing failover mechanisms (Database, Region, Zone)
- Validating alert pipelines (Did PagerDuty fire?)
- Conducting "Game Days" with engineering teams
- Implementing automated chaos in CI/CD (Continuous Verification)
- Debugging elusive distributed system bugs (Race conditions, timeouts)
---
2. Decision Framework
Experiment Design Matrix
What are we testing?
│
├─ **Infrastructure Layer**
│ ├─ Pods/Containers? → **Pod Kill / Container Crash**
│ ├─ Nodes? → **Node Drain / Reboot**
│ └─ Network? → **Latency / Packet Loss / Partition**
│
├─ **Application Layer**
│ ├─ Dependencies? → **Block Access to DB/Redis**
│ ├─ Resources? → **CPU/Memory Stress**
│ └─ Logic? → **Inject HTTP 500 / Delays**
│
└─ **Platform Layer**
├─ IAM? → **Revoke Keys**
└─ DNS? → **Block DNS Resolution**
Tool Selection
| Environment | Tool | Best For |
|---|---|---|
| Kubernetes | Chaos Mesh / Litmus | Native K8s experiments (Network, Pod, IO). |
| AWS/Cloud | AWS FIS / Gremlin | Cloud-level faults (AZ outage, EC2 stop). |
| Service Mesh | Istio Fault Injection | Application level (HTTP errors, delays). |
| Java/Spring | Chaos Monkey for Spring | App-level logic attacks. |
Blast Radius Control
| Level | Scope | Risk | Approval Needed |
|---|---|---|---|
| Local/Dev | Single container | Low | None |
| Staging | Full cluster | Medium | QA Lead |
| Production (Canary) | 1% Traffic | High | Engineering Director |
| Production (Full) | All Traffic | Critical | VP/CTO (Game Day) |
Red Flags → Escalate to sre-engineer:
- No "Stop Button" mechanism available
- Observability gaps (Blind spots)
- Cascading failure risk identified without mitigation
- Lack of backups for stateful data experiments
---
4. Core Workflows
Workflow 1: Kubernetes Pod Chaos (Chaos Mesh)
Goal: Verify that the frontend handles backend pod failures gracefully.
Steps:
-
Define Experiment (
backend-kill.yaml)
yaml apiVersion: chaos-mesh.org/v1alpha1 kind: PodChaos metadata: name: backend-kill namespace: chaos-testing spec: action: pod-kill mode: one selector: namespaces: - prod labelSelectors: app: backend-service duration: "30s" scheduler: cron: "@every 1m" -
Define Hypothesis
- If a backend pod dies, then Kubernetes will restart it within 5 seconds, and the frontend will retry 500s seamlessly ( < 1% error rate).
-
Execute & Monitor
- Apply manifest.
- Watch Grafana dashboard: "HTTP 500 Rate" vs "Pod Restart Count".
-
Verification
- Did the pod restart? Yes.
- Did users see errors? No (Retries worked).
- Result: PASS.
---
Workflow 3: Zone Outage Simulation (Game Day)
Goal: Verify database failover to secondary region.
Steps:
-
Preparation
- Notify on-call team (Game Day).
- Ensure primary DB writes are active.
-
Execution (AWS FIS / Manual)
- Block network traffic to Zone A subnets.
- OR Stop RDS Primary instance (Simulate crash).
-
Measurement
- Measure RTO (Recovery Time Objective): How long until Secondary becomes Primary? (Target: < 60s).
- Measure RPO (Recovery Point Objective): Any data lost? (Target: 0).
---
5. Anti-Patterns & Gotchas
❌ Anti-Pattern 1: Testing in Production First
What it looks like:
- Running a "delete database" script in prod without testing in staging.
Why it fails:
- Catastrophic data loss.
- Resume Generating Event (RGE).
Correct approach:
- Dev → Staging → Canary → Prod.
- Verify hypothesis in lower environments first.
❌ Anti-Pattern 2: No Observability
What it looks like:
- Running chaos without dashboards open.
- "I think it worked, the app is slow."
Why it fails:
- You don't know why it failed.
- You can't prove resilience.
Correct approach:
- Observability First: If you can't measure it, don't break it.
❌ Anti-Pattern 3: Random Chaos (Chaos Monkey Style)
What it looks like:
- Killing random things constantly without purpose.
Why it fails:
- Causes alert fatigue.
- Doesn't test specific failure modes (e.g., network partition vs crash).
Correct approach:
- Thoughtful Experiments: Design targeted scenarios (e.g., "What if Redis is slow?"). Random chaos is for maintenance, targeted chaos is for verification.
---
7. Quality Checklist
Planning:
- [ ] Hypothesis: Clearly defined ("If X happens, Y should occur").
- [ ] Blast Radius: Limited (e.g., 1 zone, 1% users).
- [ ] Approval: Stakeholders notified (or scheduled Game Day).
Safety:
- [ ] Stop Button: Automated abort script ready.
- [ ] Rollback: Plan to restore state if needed.
- [ ] Backup: Data backed up before stateful experiments.
Execution:
- [ ] Monitoring: Dashboards visible during experiment.
- [ ] Logging: Experiment start/end times logged for correlation.
Review:
- [ ] Fix: Action items assigned (Jira).
- [ ] Report: Findings shared with engineering team.
Examples
Example 1: Kubernetes Pod Failure Recovery
Scenario: A microservices platform needs to verify that their cart service handles pod failures gracefully without impacting user checkout flow.
Experiment Design:
1. Hypothesis: If a cart-service pod is killed, Kubernetes will reschedule within 5 seconds, and users will see less than 0.1% error rate
2. Chaos Injection: Use Chaos Mesh to kill random pods in the production namespace
3. Monitoring: Track error rates, pod restart times, and user-facing failures
Execution Results:
- Pod restart time: 3.2 seconds average (within SLA)
- Error rate during experiment: 0.02% (below 0.1% threshold)
- Circuit breakers prevented cascading failures
- Users experienced seamless failover
Lessons Learned:
- Retry logic was working but needed exponential backoff
- Added fallback response for stale cart data
- Created runbook for pod failure scenarios
Example 2: Database Failover Validation
Scenario: A financial services company needs to verify their multi-region database failover meets RTO of 30 seconds and RPO of zero data loss.
Game Day Setup:
1. Preparation: Notified all stakeholders, backed up current state
2. Primary Zone Blockage: Used AWS FIS to simulate zone failure
3. Failover Trigger: Automated failover initiated when health checks failed
4. Measurement: Tracked RTO, RPO, and application recovery
Measured Results:
| Metric | Target | Actual | Status |
|--------|--------|--------|--------|
| RTO | < 30s | 18s | ✅ PASS |
| RPO | 0 data | 0 data | ✅ PASS |
| Application recovery | < 60s | 42s | ✅ PASS |
| Data consistency | 100% | 100% | ✅ PASS |
Improvements Identified:
- DNS TTL was too high (5 minutes), reduced to 30 seconds
- Application connection pooling needed pre-warming
- Added health check for database replication lag
Example 3: Third-Party API Dependency Testing
Scenario: A SaaS platform depends on a payment processor API and needs to verify graceful degradation when the API is slow or unavailable.
Fault Injection Strategy:
1. Delay Injection: Using Istio to add 5-10 second delays to payment API calls
2. Timeout Validation: Verify circuit breakers open within configured timeouts
3. Fallback Testing: Ensure users see appropriate error messages
Test Scenarios:
- 50% of requests delayed 10s: Circuit breaker opens, fallback shown
- 100% delay: System degrades gracefully with queue-based processing
- Recovery: System reconnects properly after fault cleared
Results:
- Circuit breaker threshold: 5 consecutive failures (needed adjustment)
- Fallback UI: 94% of users completed purchase via alternative method
- Alert tuning: Reduced false positives by tuning latency thresholds
Best Practices
Experiment Design
- Start with Hypothesis: Define what you expect to happen before running experiments
- Limit Blast Radius: Always start with small scope and expand gradually
- Measure Steady State: Establish baseline metrics before introducing chaos
- Document Everything: Record experiment parameters, expectations, and outcomes
- Iterate and Evolve: Use findings to design more comprehensive experiments
Safety and Controls
- Always Have a Stop Button: Can you abort the experiment immediately?
- Define Rollback Plan: How do you restore normal operations?
- Communication: Notify stakeholders before and during experiments
- Timing: Avoid experiments during critical business periods
- Escalation Path: Know when to stop and call for help
Tool Selection
- Match Tool to Environment: Kubernetes → Chaos Mesh/Litmus, AWS → FIS
- Service Mesh Integration: Use Istio/Linkerd for application-level faults
- Cloud-Native Tools: Leverage managed chaos services where available
- Custom Tools: Build application-specific chaos when needed
- Multi-Cloud: Consider tools that work across cloud providers
Observability Integration
- Pre-Experiment Validation: Ensure dashboards and alerts are working
- Metrics Collection: Capture before/during/after metrics
- Log Analysis: Review logs for unexpected behavior
- Distributed Tracing: Use traces to understand failure propagation
- Alert Validation: Verify alerts fire as expected during experiments
Cultural Aspects
- Blame-Free Post-Mortems: Focus on system improvement, not finger-pointing
- Regular Game Days: Schedule chaos exercises as routine team activities
- Cross-Team Participation: Include on-call, developers, and operations
- Share Learnings: Document and share experiment results broadly
- Reward Resilience: Recognize teams that build resilient systems
# Supported AI Coding Agents
This skill is compatible with the SKILL.md standard and works with all major AI coding agents:
Learn more about the SKILL.md standard and how to use these skills with your preferred AI coding agent.