Use when adding new error messages to React, or seeing "unknown error code" warnings.
npx skills add aj-geddes/useful-ai-prompts --skill "intermittent-issue-debugging"
Install specific skill from multi-skill repository
# Description
Debug issues that occur sporadically and are hard to reproduce. Use monitoring and systematic investigation to identify root causes of flaky behavior.
# SKILL.md
name: intermittent-issue-debugging
description: Debug issues that occur sporadically and are hard to reproduce. Use monitoring and systematic investigation to identify root causes of flaky behavior.
Intermittent Issue Debugging
Overview
Intermittent issues are the most difficult to debug because they don't occur consistently. Systematic approach and comprehensive monitoring are essential.
When to Use
- Sporadic errors in logs
- Users report occasional issues
- Flaky tests
- Race conditions suspected
- Timing-dependent bugs
- Resource exhaustion issues
Instructions
1. Capturing Intermittent Issues
// Strategy 1: Comprehensive Logging
// Add detailed logging around suspected code
function processPayment(orderId) {
const startTime = Date.now();
console.log(`[${startTime}] Payment start: order=${orderId}`);
try {
const result = chargeCard(orderId);
console.log(`[${Date.now()}] Payment success: ${orderId}`);
return result;
} catch (error) {
const duration = Date.now() - startTime;
console.error(`[${Date.now()}] Payment FAILED:`, {
order: orderId,
error: error.message,
duration_ms: duration,
error_type: error.constructor.name,
stack: error.stack
});
throw error;
}
}
// Strategy 2: Correlation IDs
// Track requests across systems
const correlationId = generateId();
logger.info({
correlationId,
action: 'payment_start',
orderId: 123
});
chargeCard(orderId, {headers: {correlationId}});
logger.info({
correlationId,
action: 'payment_end',
status: 'success'
});
// Later, can grep logs by correlationId to see full trace
// Strategy 3: Error Sampling
// Capture full error context when occurs
window.addEventListener('error', (event) => {
const errorData = {
message: event.message,
url: event.filename,
line: event.lineno,
col: event.colno,
stack: event.error?.stack,
userAgent: navigator.userAgent,
memory: performance.memory?.usedJSHeapSize,
timestamp: new Date().toISOString()
};
sendToMonitoring(errorData); // Send to error tracking
});
2. Common Intermittent Issues
Issue: Race Condition
Symptom: Inconsistent behavior depending on timing
Example:
Thread 1: Read count (5)
Thread 2: Read count (5), increment to 6, write
Thread 1: Increment to 6, write (overrides Thread 2)
Result: Should be 7, but is 6
Debug:
1. Add detailed timestamps
2. Log all operations
3. Look for overlapping operations
4. Check if order matters
Solution:
- Use locks/mutexes
- Use atomic operations
- Use message queues
- Ensure single writer
---
Issue: Timing-Dependent Bug
Symptom: Test passes sometimes, fails others
Example:
test_user_creation:
1. Create user (sometimes slow)
2. Check user exists
3. Fails if create took too long
Debug:
- Add timeout logging
- Increase wait time
- Add explicit waits
- Mock slow operations
Solution:
- Explicit wait for condition
- Remove time-dependent assertions
- Use proper test fixtures
---
Issue: Resource Exhaustion
Symptom: Works fine, but after time fails
Example:
- Memory grows over time
- Connections pool exhausted
- Disk space fills up
- Max open files reached
Debug:
- Monitor resources continuously
- Check for leaks (memory growth)
- Monitor connection count
- Check long-running processes
Solution:
- Fix memory leak
- Increase resource limits
- Implement cleanup
- Add monitoring/alerts
---
Issue: Intermittent Network Failure
Symptom: API calls occasionally fail
Debug:
- Check network logs
- Identify timeout patterns
- Check if time-of-day dependent
- Check if load dependent
Solution:
- Implement exponential backoff retry
- Add circuit breaker
- Increase timeout
- Add redundancy
3. Systematic Investigation Process
Step 1: Understand the Pattern
Questions:
- How often does it occur? (1/100, 1/1000?)
- When does it occur? (time of day, load, specific user?)
- What are the conditions? (network, memory, load?)
- Is it reproducible? (deterministic or random?)
- Any recent changes?
Analysis:
- Review error logs
- Check error rate trends
- Identify patterns
- Correlate with changes
Step 2: Reproduce Reliably
Methods:
- Increase test frequency (run 1000 times)
- Stress test (heavy load)
- Simulate poor conditions (network, memory)
- Run on different machines
- Run in production-like environment
Goal: Make issue consistent to analyze
Step 3: Add Instrumentation
- Add detailed logging
- Add monitoring metrics
- Add trace IDs
- Capture errors fully
- Log system state
Step 4: Capture the Issue
- Recreate scenario
- Capture full context
- Note system state
- Document conditions
- Get reproduction case
Step 5: Analyze Data
- Review logs
- Look for patterns
- Compare normal vs error cases
- Check timing correlations
- Identify root cause
Step 6: Implement Fix
- Based on root cause
- Verify with reproduction case
- Test extensively
- Add regression test
4. Monitoring & Prevention
Monitoring Strategy:
Real User Monitoring (RUM):
- Error rates by feature
- Latency percentiles
- User impact
- Trend analysis
Application Performance Monitoring (APM):
- Request traces
- Database query performance
- External service calls
- Resource usage
Synthetic Monitoring:
- Regular test execution
- Simulate user flows
- Alert on failures
- Trend tracking
---
Alerting:
Setup alerts for:
- Error rate spike
- Response time >threshold
- Memory growth trend
- Failed transactions
---
Prevention Checklist:
[ ] Comprehensive logging in place
[ ] Error tracking configured
[ ] Performance monitoring active
[ ] Resource monitoring enabled
[ ] Correlation IDs used
[ ] Failed requests captured
[ ] Timeout values appropriate
[ ] Retry logic implemented
[ ] Circuit breakers in place
[ ] Load testing performed
[ ] Stress testing performed
[ ] Race conditions reviewed
[ ] Timing dependencies checked
---
Tools:
Monitoring:
- New Relic / DataDog
- Prometheus / Grafana
- Sentry / Rollbar
- Custom logging
Testing:
- Load testing (k6, JMeter)
- Chaos engineering (gremlin)
- Property-based testing (hypothesis)
- Fuzz testing
Debugging:
- Distributed tracing (Jaeger)
- Correlation IDs
- Detailed logging
- Debuggers
Key Points
- Comprehensive logging is essential
- Add correlation IDs for tracing
- Monitor for patterns and trends
- Stress test to reproduce
- Use detailed error context
- Implement exponential backoff for retries
- Monitor resource exhaustion
- Add circuit breakers for external services
- Log system state with errors
- Implement proper monitoring/alerting
# Supported AI Coding Agents
This skill is compatible with the SKILL.md standard and works with all major AI coding agents:
Learn more about the SKILL.md standard and how to use these skills with your preferred AI coding agent.