Use when adding new error messages to React, or seeing "unknown error code" warnings.
npx skills add aj-geddes/useful-ai-prompts --skill "root-cause-analysis"
Install specific skill from multi-skill repository
# Description
Conduct systematic root cause analysis to identify underlying problems. Use structured methodologies to prevent recurring issues and drive improvements.
# SKILL.md
name: root-cause-analysis
description: Conduct systematic root cause analysis to identify underlying problems. Use structured methodologies to prevent recurring issues and drive improvements.
Root Cause Analysis
Overview
Root cause analysis (RCA) identifies underlying reasons for failures, enabling permanent solutions rather than temporary fixes.
When to Use
- Production incidents
- Customer-impacting issues
- Repeated problems
- Unexpected failures
- Performance degradation
Instructions
1. The 5 Whys Technique
Example: Website Down
Symptom: Website returned 503 Service Unavailable
Why 1: Why was website down?
Answer: Database connection pool exhausted
Why 2: Why was connection pool exhausted?
Answer: Queries taking too long, connections not released
Why 3: Why were queries slow?
Answer: Missing index on frequently queried column
Why 4: Why was index missing?
Answer: Performance testing didn't use production-like data volume
Why 5: Why wasn't production-like data used?
Answer: Load testing environment doesn't mirror production
Root Cause: Load testing environment under-provisioned
Solution: Update load testing environment with production-like data
Prevention: Establish environment parity requirements
2. Systematic RCA Process
Step 1: Gather Facts
- When did issue occur?
- Who detected it?
- How many users affected?
- What error messages?
- What system changes deployed?
- Check logs, metrics, alerts
- Determine impact scope
Step 2: Reproduce
- Can we reproduce consistently?
- What are the exact steps?
- What environment (prod, staging)?
- Can we isolate to component?
- Set up test case
Step 3: Identify Contributing Factors
- Direct cause
- Indirect/enabling factors
- System vulnerabilities
- Procedural gaps
- Knowledge gaps
Step 4: Determine Root Cause
- Use 5 Whys technique
- Ask "why did this control fail?"
- Look for systemic issues
- Separate root cause from symptoms
Step 5: Develop Solutions
- Immediate: Fix the symptom
- Short-term: Prevent recurrence
- Long-term: Systemic fix
- Prioritize by impact/effort
Step 6: Implement & Verify
- Implement solutions
- Test in staging
- Deploy carefully
- Verify improvement
- Monitor metrics
Step 7: Document & Share
- Write RCA report
- Document lesson learned
- Share with team
- Update procedures
- Training if needed
3. RCA Report Template
RCA Report:
Incident: Database connection failure (2024-01-15, 14:30-15:15)
Impact:
- Duration: 45 minutes
- Users affected: 5,000 (10% of user base)
- Revenue lost: ~$2,000
- Severity: P1 (Critical)
Timeline:
14:30: Automated monitoring alert: High error rate (20%)
14:32: On-call engineer notified
14:35: Identified database connection error in logs
14:40: Restarted database connection pool
14:42: Service recovered, error rate returned to 0.1%
14:50: Incident declared resolved
15:15: Full recovery verified
Root Cause:
Poorly optimized query introduced in release 2.5.0 caused
queries to take 10x longer. Connection pool exhausted as
connections weren't released quickly.
Contributing Factors:
1. No query performance testing pre-deployment
2. Load testing environment doesn't match production volume
3. No alerting on query duration
4. Connection pool timeout set too high
Solutions:
Immediate (Done):
- Rolled back problematic query optimization
Short-term (1 week):
- Added query performance alerts (>1s)
- Added index for slow query
- Set query timeout to 5 seconds
Long-term (1 month):
- Updated load testing with production-like data
- Implement performance benchmarks in CI/CD
- Improve monitoring for connection pool health
- Training on query optimization
Prevention:
- Query performance regression tests
- Load testing with production data
- Connection pool metrics monitoring
- Code review of database changes
4. Root Cause Analysis Techniques
Fishbone Diagram:
Main problem: Slow API Response
Branches:
Code:
- Inefficient algorithm
- Missing cache
- Unnecessary queries
Data:
- Large dataset
- Missing index
- Slow database
Infrastructure:
- Low CPU capacity
- Slow network
- Disk I/O bottleneck
Process:
- No monitoring
- No load testing
- Manual deployments
People:
- Lack of knowledge
- Lack of tools
- No peer review
---
Systemic vs. Individual Causes:
Individual: "Developer used inefficient code"
Fix: Training
Risk: Happens again with different person
Systemic: "No code review process"
Fix: Implement mandatory code review
Risk: Prevents similar issues
Prefer systemic solutions for prevention
5. Follow-Up & Prevention
After RCA:
1. Track Action Items
- Assign owner
- Set deadline
- Follow up in retrospective
2. Prevent Recurrence
- Automated tests
- Monitoring/alerts
- Procedural changes
- Training
3. Monitor Metrics
- Track similar incidents
- Verify fix effectiveness
- Monitor preventive measures
- Catch early warnings
4. Share Learnings
- Document incident
- Share with team
- Industry sharing if relevant
- Update procedures
---
Checklist:
[ ] Incident details documented
[ ] Timeline established
[ ] Logs reviewed
[ ] Metrics analyzed
[ ] Root cause identified (via 5 Whys)
[ ] Contributing factors listed
[ ] Immediate actions completed
[ ] Short-term solutions planned
[ ] Long-term solutions identified
[ ] Solutions prioritized
[ ] RCA report written
[ ] Team debriefing scheduled
[ ] Action items assigned
[ ] Prevention measures planned
[ ] Follow-up scheduled
Key Points
- Distinguish symptom from root cause
- Use 5 Whys technique systematically
- Look for systemic issues, not individual blame
- Focus on prevention, not just fixing
- Document thoroughly for team learning
- Assign clear ownership for solutions
- Follow up to verify effectiveness
- Use RCA to drive improvements
# Supported AI Coding Agents
This skill is compatible with the SKILL.md standard and works with all major AI coding agents:
Learn more about the SKILL.md standard and how to use these skills with your preferred AI coding agent.