Use when adding new error messages to React, or seeing "unknown error code" warnings.
npx skills add miles-knowbl/orchestrator --skill "debug-assist"
Install specific skill from multi-skill repository
# Description
Systematic bug isolation for when you're stuck. Guides structured debugging using hypothesis generation, binary search, and systematic elimination. Tracks debugging state across turns. Specialized modes for memory issues, concurrency bugs, performance problems, and production incidents.
# SKILL.md
name: debug-assist
description: "Systematic bug isolation for when you're stuck. Guides structured debugging using hypothesis generation, binary search, and systematic elimination. Tracks debugging state across turns. Specialized modes for memory issues, concurrency bugs, performance problems, and production incidents."
phase: VERIFY
category: core
version: "1.0.0"
depends_on: []
tags: [debugging, diagnosis, troubleshooting, error-analysis, core-workflow]
Debug Assist
Systematic debugging guidance for isolating and fixing bugs.
When to Use
- Stuck on a bug β "I can't figure out why this is failing"
- Intermittent failures β "It works sometimes but not always"
- Production incident β "Something is broken in prod"
- Performance issue β "This is slow but I don't know why"
- When you say: "help me debug", "why isn't this working?", "I'm stuck"
Reference Requirements
MUST read before applying this skill:
| Reference | Why Required |
|---|---|
hypothesis-generation.md |
Systematic debugging approach |
Read if applicable:
| Reference | When Needed |
|---|---|
concurrency-debugging.md |
For race conditions, deadlocks |
memory-debugging.md |
For memory leaks, corruption |
performance-debugging.md |
For slowness issues |
production-debugging.md |
For prod incidents |
Verification: Document hypothesis and verification for each debugging attempt.
Required Deliverables
| Deliverable | Location | Condition |
|---|---|---|
DEBUG-LOG.md |
Project root or inline | When debugging session occurs |
| Bug fix code | src/ |
When bug is fixed |
Core Concept
Debugging is systematic elimination, not random guessing.
The scientific method for bugs:
1. Observe β What exactly is happening?
2. Hypothesize β What could cause this?
3. Test β Design an experiment to confirm or eliminate
4. Repeat β Narrow down until root cause is found
The Debug Loop
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β DEBUG LOOP β
β β
β 1. SPECIFY THE SYMPTOM β
β βββ What exactly is wrong? Be precise. β
β β
β 2. REPRODUCE β
β βββ Can you make it happen reliably? β
β β
β 3. GENERATE HYPOTHESES β
β βββ What could cause this symptom? β
β β
β 4. PRIORITIZE β
β βββ Most likely? Easiest to test? β
β β
β 5. TEST ONE HYPOTHESIS β
β βββ Design experiment, observe result β
β β
β 6. UPDATE AND REPEAT β
β βββ Eliminate hypothesis, refine, continue β
β β
β 7. ROOT CAUSE FOUND β
β βββ Fix, verify, document β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Step 1: Specify the Symptom
Be precise. Vague symptoms lead to vague debugging.
| Bad | Good |
|---|---|
| "It doesn't work" | "Clicking submit returns 500 error" |
| "It's slow" | "API response takes 8 seconds instead of 200ms" |
| "It crashes" | "Process exits with SIGKILL after 10 minutes" |
| "Data is wrong" | "User balance shows $0 after successful payment" |
Capture:
- Exact error message (copy/paste, don't paraphrase)
- Stack trace if available
- When it started (after which change?)
- Who/what is affected
- Frequency (always? sometimes? once?)
β See references/symptom-specification.md
Step 2: Reproduce
If you can't reproduce it, you can't fix it.
| Reproducibility | Approach |
|---|---|
| Always happens | Proceed to hypothesize |
| Sometimes happens | Find the conditions that trigger it |
| Happened once | Gather logs, make best hypothesis |
To improve reproducibility:
- Simplify: Remove variables until minimal reproduction
- Isolate: Test component in isolation
- Control: Fix all variables (data, time, environment)
- Log: Add logging to capture state at failure
Minimal reproduction:
Start: Full application with 100 features
Goal: Smallest code that still shows the bug
Remove half β still fails? β remove half again
β works? β bug is in removed half
β See references/reproduction-techniques.md
Step 3: Generate Hypotheses
List everything that could cause this symptom.
Categories to consider:
| Category | Examples |
|---|---|
| Input | Bad data, unexpected format, edge case |
| State | Stale cache, race condition, corrupted state |
| Environment | Config, permissions, dependencies |
| Code | Logic error, typo, wrong assumption |
| External | API change, network issue, third-party bug |
| Resources | Memory, disk, connections exhausted |
Hypothesis format:
H1: [What might be wrong]
Evidence for: [Why this might be it]
Evidence against: [Why this might not be it]
Test: [How to confirm or eliminate]
Example:
H1: Database connection pool exhausted
Evidence for: Error mentions timeout, high traffic lately
Evidence against: Error is immediate, not after 30s timeout
Test: Check connection pool metrics, try increasing pool size
H2: Query is missing index, timing out
Evidence for: Slow query log shows this table
Evidence against: Query worked yesterday
Test: EXPLAIN ANALYZE the query
β See references/hypothesis-generation.md
Step 4: Prioritize Hypotheses
Test the most likely or easiest to eliminate first.
| Prioritization Factor | Higher Priority |
|---|---|
| Likelihood | More likely causes first |
| Test speed | Quick tests before slow tests |
| Risk | Eliminate dangerous causes early |
| Reversibility | Easy-to-undo tests first |
Quick wins:
- Check recent changes (git log, deploys)
- Check obvious things (config, permissions, typos)
- Check external dependencies (status pages, logs)
Don't skip: Even if you're "sure" it's not X, quick tests to eliminate X are worth it. Assumptions are dangerous.
Step 5: Test One Hypothesis
One at a time. Changing multiple things means you won't know what fixed it.
Experiment design:
Hypothesis: Connection pool is exhausted
Test: Increase pool size from 10 to 50
Expected if true: Errors stop
Expected if false: Same errors continue
Testing techniques:
| Technique | When to Use |
|---|---|
| Binary search | Narrow down in code/commits |
| Substitution | Replace component with known-good |
| Isolation | Test component alone |
| Injection | Force specific conditions |
| Logging | Observe internal state |
| Bisect | Find breaking commit |
β See references/testing-techniques.md
Step 6: Update and Repeat
After each test:
| Result | Action |
|---|---|
| Hypothesis confirmed | You found it! Fix and verify |
| Hypothesis eliminated | Remove from list, update others |
| Inconclusive | Refine test, gather more data |
| New information | Generate new hypotheses |
Track your progress:
## Debug Log
### Symptom
API returns 500 on /users endpoint
### Hypotheses
- [x] H1: Database down β ELIMINATED (DB healthy)
- [x] H2: Auth token expired β ELIMINATED (token valid)
- [ ] H3: Query timeout β TESTING
- [ ] H4: Memory pressure β Not yet tested
### Tests Run
1. Checked DB status β healthy, 50ms ping
2. Validated auth token β token valid, not expired
3. Running: EXPLAIN ANALYZE on user query
Step 7: Root Cause Found
Don't stop at the fix. Document and prevent recurrence.
## Root Cause Analysis
**Symptom:** API returns 500 on /users endpoint
**Root Cause:** Missing index on users.organization_id.
Query did full table scan on 2M rows, timing out.
**Fix:** Added index, query now 50ms.
**Prevention:**
- Added query performance monitoring
- Alert on queries > 1 second
- Review process for schema changes
**Timeline:**
- 14:30 - Reports of errors
- 14:35 - Identified slow query in logs
- 14:45 - Confirmed missing index
- 14:50 - Index deployed, verified fix
Specialized Modes
Memory Issues
Symptoms: OOM kills, growing memory, GC pauses
Approach:
1. Confirm it's memory (monitor RSS over time)
2. Identify growth pattern (steady leak vs. spike)
3. Isolate (which operation causes growth?)
4. Inspect (heap dump, allocation profiling)
β See references/memory-debugging.md
Concurrency Issues
Symptoms: Intermittent failures, race conditions, deadlocks
Approach:
1. Confirm it's concurrency (does single-threaded work?)
2. Identify shared state
3. Add logging around critical sections
4. Try stress testing (increase concurrency)
5. Look for: check-then-act, shared mutable state, lock ordering
β See references/concurrency-debugging.md
Performance Issues
Symptoms: Slow responses, high latency, timeouts
Approach:
1. Measure (where is time spent?)
2. Profile (CPU? I/O? Network? Waiting?)
3. Identify hotspots (what takes the most time?)
4. Focus on the critical path
5. Fix highest-impact issues first
β See references/performance-debugging.md
Production Incidents
Symptoms: Alerts firing, users complaining
Approach:
1. Mitigate first β Restore service, then debug
2. Gather data (logs, metrics, traces)
3. Identify blast radius (what's affected?)
4. Correlate with changes (what changed recently?)
5. Fix or rollback
6. Post-mortem later
β See references/production-debugging.md
Debug State Tracking
When debugging across multiple turns, maintain state:
## Debug Session: [Brief Description]
### Symptom
[Precise description of the bug]
### Environment
[Relevant context: versions, config, etc.]
### Hypotheses
| # | Hypothesis | Status | Evidence |
|---|------------|--------|----------|
| 1 | [Theory] | Eliminated/Testing/Confirmed | [Notes] |
| 2 | [Theory] | Not tested | |
### Tests Run
| # | Test | Result | Conclusion |
|---|------|--------|------------|
| 1 | [What you did] | [What happened] | [What this means] |
### Current Focus
[What we're investigating now]
### Next Steps
1. [Next action]
Anti-Patterns
Shotgun Debugging
Symptom: Changing random things hoping something works.
Problem: You won't know what fixed it, and you might introduce new bugs.
Fix: One change at a time, test between each.
Assumption Blindness
Symptom: "It can't be X" without actually testing X.
Problem: X is frequently the problem.
Fix: Test your assumptions, even "obvious" ones.
Tunnel Vision
Symptom: Convinced it's a certain cause, ignoring evidence.
Problem: Wastes time on wrong path.
Fix: Let evidence guide you. Write down hypotheses, update based on tests.
Insufficient Logging
Symptom: "I don't know what's happening inside."
Problem: Can't debug what you can't observe.
Fix: Add logging, then reproduce.
Debugging in Production
Symptom: Making changes to prod to debug.
Problem: High risk, pressure, incomplete data.
Fix: Reproduce locally first. If impossible, add observability, don't experiment.
Relationship to Other Skills
| Skill | Relationship |
|---|---|
code-verification |
Verification catches bugs before they need debugging |
code-validation |
Validation may surface issues that need debugging |
code-review |
Review may identify potential bug sources |
implement |
Return to implement to fix identified bugs |
test-generation |
Create regression tests after fixing |
Quick Reference
When stuck:
1. Be precise about the symptom
2. Reproduce reliably (or understand why you can't)
3. List all possible causes
4. Test one at a time, most likely first
5. Track what you've tried
6. Ask for help after 30 minutes of no progress
Golden rules:
- Change one thing at a time
- Don't trust assumptionsβverify them
- If you can't reproduce, add logging
- Recent changes are prime suspects
- Take breaksβfresh eyes find bugs
Mode-Specific Behavior
Debugging approach and constraints differ by orchestrator mode:
Greenfield Mode
| Aspect | Behavior |
|---|---|
| Scope | Full system - any code in the new system |
| Approach | Comprehensive hypothesis exploration |
| Patterns | Free choice of debugging techniques |
| Deliverables | Full debug log + fix + regression tests |
| Validation | Standard verification after fix |
| Constraints | Minimal - refactor freely if needed |
Brownfield-Polish Mode
| Aspect | Behavior |
|---|---|
| Scope | Gap-related code primarily |
| Approach | Extend debugging with existing system awareness |
| Patterns | Should match existing logging/testing patterns |
| Deliverables | Delta debug log + fix matching existing style |
| Validation | Existing tests + new regression tests |
| Constraints | Don't break existing functionality |
Polish considerations:
- Bug may be in existing code, not new code
- Existing workarounds may mask root cause
- Fix must not break existing functionality
- Regression testing against baseline required
Brownfield-Enterprise Mode
| Aspect | Behavior |
|---|---|
| Scope | Changed code path only |
| Approach | Surgical fix - minimal modification |
| Patterns | Must conform exactly to existing patterns |
| Deliverables | Change record with audit trail |
| Validation | Full regression + change-specific testing |
| Constraints | Requires approval - escalate if bug in existing code |
Enterprise debugging constraints:
- Do not modify code outside the change scope
- If bug is in existing code, escalate to team
- Rollback is preferred over complex fix
- All debugging steps must be documented for audit
References
references/symptom-specification.md: How to describe bugs preciselyreferences/reproduction-techniques.md: Getting bugs to happen reliablyreferences/hypothesis-generation.md: Thinking of all possible causesreferences/testing-techniques.md: Binary search, isolation, bisectreferences/memory-debugging.md: Leaks, OOM, GC issuesreferences/concurrency-debugging.md: Races, deadlocks, orderingreferences/performance-debugging.md: Profiling and optimizationreferences/production-debugging.md: Incident response
# Supported AI Coding Agents
This skill is compatible with the SKILL.md standard and works with all major AI coding agents:
Learn more about the SKILL.md standard and how to use these skills with your preferred AI coding agent.