intermittent-issue-debugging

by @aj-geddes in Development

# Install this skill:

npx skills add aj-geddes/useful-ai-prompts --skill "intermittent-issue-debugging"

Install specific skill from multi-skill repository

# Description

Debug issues that occur sporadically and are hard to reproduce. Use monitoring and systematic investigation to identify root causes of flaky behavior.

# SKILL.md

name: intermittent-issue-debugging
description: Debug issues that occur sporadically and are hard to reproduce. Use monitoring and systematic investigation to identify root causes of flaky behavior.

Intermittent Issue Debugging

Overview

Intermittent issues are the most difficult to debug because they don't occur consistently. Systematic approach and comprehensive monitoring are essential.

When to Use

Sporadic errors in logs
Users report occasional issues
Flaky tests
Race conditions suspected
Timing-dependent bugs
Resource exhaustion issues

Instructions

1. Capturing Intermittent Issues

// Strategy 1: Comprehensive Logging
// Add detailed logging around suspected code

function processPayment(orderId) {
  const startTime = Date.now();
  console.log(`[${startTime}] Payment start: order=${orderId}`);

  try {
    const result = chargeCard(orderId);
    console.log(`[${Date.now()}] Payment success: ${orderId}`);
    return result;
  } catch (error) {
    const duration = Date.now() - startTime;
    console.error(`[${Date.now()}] Payment FAILED:`, {
      order: orderId,
      error: error.message,
      duration_ms: duration,
      error_type: error.constructor.name,
      stack: error.stack
    });
    throw error;
  }
}

// Strategy 2: Correlation IDs
// Track requests across systems

const correlationId = generateId();
logger.info({
  correlationId,
  action: 'payment_start',
  orderId: 123
});

chargeCard(orderId, {headers: {correlationId}});

logger.info({
  correlationId,
  action: 'payment_end',
  status: 'success'
});

// Later, can grep logs by correlationId to see full trace

// Strategy 3: Error Sampling
// Capture full error context when occurs

window.addEventListener('error', (event) => {
  const errorData = {
    message: event.message,
    url: event.filename,
    line: event.lineno,
    col: event.colno,
    stack: event.error?.stack,
    userAgent: navigator.userAgent,
    memory: performance.memory?.usedJSHeapSize,
    timestamp: new Date().toISOString()
  };

  sendToMonitoring(errorData);  // Send to error tracking
});

2. Common Intermittent Issues

Issue: Race Condition

Symptom: Inconsistent behavior depending on timing

Example:
  Thread 1: Read count (5)
  Thread 2: Read count (5), increment to 6, write
  Thread 1: Increment to 6, write (overrides Thread 2)
  Result: Should be 7, but is 6

Debug:
  1. Add detailed timestamps
  2. Log all operations
  3. Look for overlapping operations
  4. Check if order matters

Solution:
  - Use locks/mutexes
  - Use atomic operations
  - Use message queues
  - Ensure single writer

---

Issue: Timing-Dependent Bug

Symptom: Test passes sometimes, fails others

Example:
  test_user_creation:
    1. Create user (sometimes slow)
    2. Check user exists
    3. Fails if create took too long

Debug:
  - Add timeout logging
  - Increase wait time
  - Add explicit waits
  - Mock slow operations

Solution:
  - Explicit wait for condition
  - Remove time-dependent assertions
  - Use proper test fixtures

---

Issue: Resource Exhaustion

Symptom: Works fine, but after time fails

Example:
  - Memory grows over time
  - Connections pool exhausted
  - Disk space fills up
  - Max open files reached

Debug:
  - Monitor resources continuously
  - Check for leaks (memory growth)
  - Monitor connection count
  - Check long-running processes

Solution:
  - Fix memory leak
  - Increase resource limits
  - Implement cleanup
  - Add monitoring/alerts

---

Issue: Intermittent Network Failure

Symptom: API calls occasionally fail

Debug:
  - Check network logs
  - Identify timeout patterns
  - Check if time-of-day dependent
  - Check if load dependent

Solution:
  - Implement exponential backoff retry
  - Add circuit breaker
  - Increase timeout
  - Add redundancy

3. Systematic Investigation Process

Step 1: Understand the Pattern
  Questions:
    - How often does it occur? (1/100, 1/1000?)
    - When does it occur? (time of day, load, specific user?)
    - What are the conditions? (network, memory, load?)
    - Is it reproducible? (deterministic or random?)
    - Any recent changes?

  Analysis:
    - Review error logs
    - Check error rate trends
    - Identify patterns
    - Correlate with changes

Step 2: Reproduce Reliably
  Methods:
    - Increase test frequency (run 1000 times)
    - Stress test (heavy load)
    - Simulate poor conditions (network, memory)
    - Run on different machines
    - Run in production-like environment

  Goal: Make issue consistent to analyze

Step 3: Add Instrumentation
  - Add detailed logging
  - Add monitoring metrics
  - Add trace IDs
  - Capture errors fully
  - Log system state

Step 4: Capture the Issue
  - Recreate scenario
  - Capture full context
  - Note system state
  - Document conditions
  - Get reproduction case

Step 5: Analyze Data
  - Review logs
  - Look for patterns
  - Compare normal vs error cases
  - Check timing correlations
  - Identify root cause

Step 6: Implement Fix
  - Based on root cause
  - Verify with reproduction case
  - Test extensively
  - Add regression test

4. Monitoring & Prevention

Monitoring Strategy:

Real User Monitoring (RUM):
  - Error rates by feature
  - Latency percentiles
  - User impact
  - Trend analysis

Application Performance Monitoring (APM):
  - Request traces
  - Database query performance
  - External service calls
  - Resource usage

Synthetic Monitoring:
  - Regular test execution
  - Simulate user flows
  - Alert on failures
  - Trend tracking

---

Alerting:

Setup alerts for:
  - Error rate spike
  - Response time >threshold
  - Memory growth trend
  - Failed transactions

---

Prevention Checklist:

[ ] Comprehensive logging in place
[ ] Error tracking configured
[ ] Performance monitoring active
[ ] Resource monitoring enabled
[ ] Correlation IDs used
[ ] Failed requests captured
[ ] Timeout values appropriate
[ ] Retry logic implemented
[ ] Circuit breakers in place
[ ] Load testing performed
[ ] Stress testing performed
[ ] Race conditions reviewed
[ ] Timing dependencies checked

---

Tools:

Monitoring:
  - New Relic / DataDog
  - Prometheus / Grafana
  - Sentry / Rollbar
  - Custom logging

Testing:
  - Load testing (k6, JMeter)
  - Chaos engineering (gremlin)
  - Property-based testing (hypothesis)
  - Fuzz testing

Debugging:
  - Distributed tracing (Jaeger)
  - Correlation IDs
  - Detailed logging
  - Debuggers

Key Points

Comprehensive logging is essential
Add correlation IDs for tracing
Monitor for patterns and trends
Stress test to reproduce
Use detailed error context
Implement exponential backoff for retries
Monitor resource exhaustion
Add circuit breakers for external services
Log system state with errors
Implement proper monitoring/alerting

# Supported AI Coding Agents

This skill is compatible with the SKILL.md standard and works with all major AI coding agents:

⚡ Amp 🚀 Antigravity 🤖 Claude Code 🦀 Clawdbot 📝 Codex ▶️ Cursor 🤖 Droid 💎 Gemini CLI 🐙 GitHub Copilot 🪿 Goose 📊 Kilo Code 🔧 Kiro CLI 💻 OpenCode 🦘 Roo Code 🌲 Trae 🏄 Windsurf

Learn more about the SKILL.md standard and how to use these skills with your preferred AI coding agent.