jgarrison929

incident-responder

0
0
# Install this skill:
npx skills add jgarrison929/openclaw-skills --skill "incident-responder"

Install specific skill from multi-skill repository

# Description

Use IMMEDIATELY when production incidents occur. Covers incident triage, severity assessment, mitigation, root cause analysis, post-mortems, runbooks, monitoring/alerting setup, network debugging, and troubleshooting methodology.

# SKILL.md


name: incident-responder
version: 1.0.0
description: Use IMMEDIATELY when production incidents occur. Covers incident triage, severity assessment, mitigation, root cause analysis, post-mortems, runbooks, monitoring/alerting setup, network debugging, and troubleshooting methodology.
triggers:
- incident
- outage
- production down
- site down
- 5xx errors
- high error rate
- latency spike
- post-mortem
- runbook
- on-call
- alerting
- monitoring
- PagerDuty
- incident response
- troubleshooting
- network debug
- DNS issue
- SSL certificate
- connection timeout
- service degraded
role: specialist
scope: implementation
output-format: text


Incident Responder

Incident response specialist with expertise in production triage, mitigation, root cause analysis, post-mortems, runbooks, monitoring/alerting, and network troubleshooting. Acts with urgency while maintaining precision.

Role Definition

You are a senior incident responder / SRE who handles production incidents with calm urgency. You stabilize first, investigate second, and document everything. You also build the runbooks, monitoring, and alerting that prevent future incidents.

Core Principles

  1. Stabilize first, investigate second β€” get users back online before finding root cause
  2. Communicate constantly β€” silence during an incident is worse than bad news
  3. Minimal viable fix β€” do the smallest thing that restores service
  4. Rollback is always an option β€” don't be a hero; revert if it's faster
  5. Blameless post-mortems β€” focus on systems, not people
  6. Automate the runbook β€” if you did it manually, script it for next time

Incident Severity Classification

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Severity β”‚ Definition                 β”‚ Response Time  β”‚ Communication     β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ SEV-1    β”‚ Full outage, data loss,    β”‚ Immediate      β”‚ Every 15 min      β”‚
β”‚ Critical β”‚ security breach            β”‚ All hands      β”‚ Exec + customers  β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ SEV-2    β”‚ Major feature broken,      β”‚ < 30 min       β”‚ Every 30 min      β”‚
β”‚ High     β”‚ significant user impact    β”‚ On-call + lead β”‚ Eng team + PM     β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ SEV-3    β”‚ Minor feature degraded,    β”‚ < 2 hours      β”‚ Hourly            β”‚
β”‚ Medium   β”‚ workaround available       β”‚ On-call        β”‚ Eng team          β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ SEV-4    β”‚ Cosmetic issue, low-       β”‚ Next business  β”‚ As needed         β”‚
β”‚ Low      β”‚ traffic feature affected   β”‚ day            β”‚ Ticket update     β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Incident Response Procedure

Phase 1: Detect and Assess (0-5 minutes)

# Quick system health check
curl -sf https://api.example.com/health | jq .
curl -sf -o /dev/null -w "%{http_code} %{time_total}s" https://api.example.com/

# Check error rates (last 15 minutes)
# Datadog/Grafana/CloudWatch β€” look for spikes in:
# - 5xx error rate
# - p99 latency
# - CPU/memory utilization
# - Queue depth / connection pool exhaustion

# Check recent deployments (most common cause)
git log --oneline -5
kubectl rollout history deployment/app

# Check infrastructure
kubectl get pods -o wide | grep -v Running
kubectl top pods --sort-by=memory
docker stats --no-stream

Phase 2: Triage and Communicate (5-10 minutes)

## Incident Notification Template

🚨 **INCIDENT DECLARED** β€” SEV-2

**What:** API returning 503 errors for ~30% of requests
**Impact:** Users unable to create orders. Read operations working.
**Since:** 14:23 UTC (detected by monitoring)
**Investigating:** @oncall-engineer
**Channel:** #incident-2025-0201

**Next update:** 14:45 UTC (15 minutes)

Phase 3: Mitigate (10-30 minutes)

Decision tree for quick mitigation:

Was there a recent deployment?
β”œβ”€β”€ YES β†’ Rollback immediately
β”‚         kubectl rollout undo deployment/app
β”‚
└── NO β†’ Is it resource exhaustion?
    β”œβ”€β”€ YES β†’ Scale up
    β”‚         kubectl scale deployment/app --replicas=6
    β”‚
    └── NO β†’ Is it a dependency failure?
        β”œβ”€β”€ YES β†’ Enable circuit breaker / disable feature
        β”‚         curl -X POST https://admin.internal/feature-flags \
        β”‚           -d '{"payment_service": false}'
        β”‚
        └── NO β†’ Is it a data issue?
            β”œβ”€β”€ YES β†’ Revert data change / restore from backup
            └── NO β†’ Engage additional engineers

Phase 4: Investigate Root Cause

# ── Log Analysis ────────────────────────────────────────
# Find the error pattern
kubectl logs deployment/app --since=30m | grep -i error | head -50

# Count errors by type
kubectl logs deployment/app --since=30m | \
  grep -oP '"error":"[^"]*"' | sort | uniq -c | sort -rn | head -10

# Check for OOM kills
kubectl get events --sort-by='.lastTimestamp' | grep -i oom

# ── Database Investigation ──────────────────────────────
# Active connections
SELECT count(*) FROM pg_stat_activity WHERE state = 'active';

# Long-running queries
SELECT pid, now() - pg_stat_activity.query_start AS duration, query
FROM pg_stat_activity
WHERE state = 'active'
  AND now() - pg_stat_activity.query_start > interval '30 seconds'
ORDER BY duration DESC;

# Lock contention
SELECT blocked_locks.pid AS blocked_pid,
       blocking_locks.pid AS blocking_pid,
       blocked_activity.query AS blocked_query
FROM pg_locks blocked_locks
JOIN pg_stat_activity blocked_activity ON blocked_activity.pid = blocked_locks.pid
JOIN pg_locks blocking_locks ON blocking_locks.locktype = blocked_locks.locktype
  AND blocking_locks.granted
WHERE NOT blocked_locks.granted;

# ── Network Debugging ──────────────────────────────────
# DNS resolution
dig api.example.com +short
nslookup api.example.com

# SSL certificate check
echo | openssl s_client -connect api.example.com:443 2>/dev/null | \
  openssl x509 -noout -dates

# Connection testing
curl -v --connect-timeout 5 https://api.example.com/health 2>&1

# Trace network path
traceroute api.example.com
mtr --report api.example.com

# Check for port connectivity
nc -zv db.internal 5432
nc -zv redis.internal 6379

# Packet capture (if needed)
tcpdump -i eth0 -n port 5432 -c 100

Network Troubleshooting Playbook

DNS Issues

# Full DNS chain check
dig api.example.com +trace

# Check from multiple resolvers
dig @8.8.8.8 api.example.com
dig @1.1.1.1 api.example.com

# Check TTL and record types
dig api.example.com ANY +short

# Common fix: flush DNS cache
# macOS
sudo dscacheutil -flushcache; sudo killall -HUP mDNSResponder
# Linux
sudo systemd-resolve --flush-caches

SSL/TLS Issues

# Full certificate chain verification
openssl s_client -connect api.example.com:443 -showcerts 2>/dev/null | \
  openssl x509 -text -noout

# Check certificate expiry
echo | openssl s_client -connect api.example.com:443 2>/dev/null | \
  openssl x509 -noout -enddate

# Test specific TLS version
openssl s_client -connect api.example.com:443 -tls1_2
openssl s_client -connect api.example.com:443 -tls1_3

# Verify certificate chain
openssl verify -CAfile ca-bundle.crt server.crt

Load Balancer Debugging

# nginx health check and upstream config
upstream backend {
    server 10.0.1.10:3000 max_fails=3 fail_timeout=30s;
    server 10.0.1.11:3000 max_fails=3 fail_timeout=30s;
    server 10.0.1.12:3000 max_fails=3 fail_timeout=30s;

    # Health check (nginx plus / openresty)
    health_check interval=10 fails=3 passes=2;
}

server {
    listen 443 ssl;

    location /health {
        access_log off;
        return 200 'ok';
    }

    location / {
        proxy_pass http://backend;
        proxy_connect_timeout 5s;
        proxy_read_timeout 30s;
        proxy_next_upstream error timeout http_502 http_503;
        proxy_next_upstream_tries 3;

        # Important headers for debugging
        proxy_set_header X-Request-ID $request_id;
        proxy_set_header X-Real-IP $remote_addr;
        proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;

        add_header X-Upstream-Addr $upstream_addr always;
        add_header X-Request-ID $request_id always;
    }
}

Post-Mortem Template

# Post-Mortem: API Outage β€” 2025-02-01

## Summary
On February 1st, 2025, the API experienced a 23-minute outage affecting
order creation. 100% of write requests failed from 14:23 to 14:46 UTC.
The root cause was a database connection pool exhaustion triggered by a
long-running migration query.

## Impact
- **Duration:** 23 minutes (14:23 β€” 14:46 UTC)
- **Users affected:** ~2,400 users attempted order creation during the window
- **Revenue impact:** Estimated $12,000 in delayed orders (all recovered)
- **Data loss:** None

## Timeline (all times UTC)

| Time  | Event |
|-------|-------|
| 14:15 | Database migration started by automated pipeline |
| 14:20 | Migration query acquires table lock on `orders` |
| 14:23 | Connection pool begins filling as queries queue behind lock |
| 14:24 | Monitoring alert fires: "5xx rate > 5%" |
| 14:25 | On-call engineer acknowledges alert |
| 14:28 | Engineer identifies connection pool at 100% utilization |
| 14:30 | Long-running migration query identified as cause |
| 14:32 | Migration query killed: `SELECT pg_terminate_backend(12345)` |
| 14:34 | Connection pool begins draining, some requests succeeding |
| 14:38 | Error rate drops below 5% |
| 14:46 | Full recovery confirmed, incident resolved |

## Root Cause
The automated CI/CD pipeline ran a database migration that included an
`ALTER TABLE orders ADD COLUMN discount_type VARCHAR(50)`. On PostgreSQL,
this operation acquires an `ACCESS EXCLUSIVE` lock on the table, blocking
all reads and writes until complete. With 45M rows in the orders table,
the lock was held for ~12 minutes.

The application's connection pool (max 20 connections) filled within
3 minutes as all connections waited for the lock to release.

## What Went Well
- Monitoring detected the issue within 1 minute
- On-call responded within 2 minutes
- Root cause identified within 7 minutes
- No data was lost or corrupted

## What Went Wrong
- Migration ran during peak traffic hours
- No lock timeout configured on the database
- Migration didn't use a non-blocking approach
- Connection pool had no queue timeout

## Action Items

| Priority | Action | Owner | Due |
|----------|--------|-------|-----|
| P0 | Add `lock_timeout = '5s'` to migration connections | @db-team | Feb 3 |
| P0 | Use `ALTER TABLE ... ADD COLUMN ... DEFAULT` (non-locking in PG 11+) | @db-team | Feb 3 |
| P1 | Schedule migrations for low-traffic windows only | @devops | Feb 7 |
| P1 | Add connection pool queue timeout (5s) | @backend | Feb 7 |
| P2 | Add connection pool utilization metric + alert at 80% | @sre | Feb 14 |
| P2 | Create migration runbook with safety checks | @docs | Feb 14 |

## Lessons Learned
1. Database migrations need the same care as code deployments
2. Non-blocking DDL operations should be the default
3. Connection pool monitoring was a blind spot

Monitoring and Alerting Setup

Key Metrics to Monitor

# alerting-rules.yml (Prometheus format)
groups:
  - name: application
    rules:
      # High error rate
      - alert: HighErrorRate
        expr: rate(http_requests_total{status=~"5.."}[5m]) / rate(http_requests_total[5m]) > 0.05
        for: 2m
        labels:
          severity: critical
        annotations:
          summary: "Error rate above 5% for 2+ minutes"

      # High latency
      - alert: HighLatency
        expr: histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[5m])) > 2
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "P99 latency above 2s"

      # Connection pool exhaustion
      - alert: ConnectionPoolNearFull
        expr: db_pool_active_connections / db_pool_max_connections > 0.8
        for: 1m
        labels:
          severity: warning
        annotations:
          summary: "Database connection pool at {{ $value | humanizePercentage }}"

  - name: infrastructure
    rules:
      # High CPU
      - alert: HighCPU
        expr: rate(process_cpu_seconds_total[5m]) > 0.8
        for: 5m
        labels:
          severity: warning

      # Memory usage
      - alert: HighMemory
        expr: process_resident_memory_bytes / node_memory_MemTotal_bytes > 0.85
        for: 5m
        labels:
          severity: warning

      # Disk space
      - alert: DiskSpaceLow
        expr: node_filesystem_avail_bytes / node_filesystem_size_bytes < 0.1
        for: 5m
        labels:
          severity: critical

      # SSL certificate expiry
      - alert: SSLCertExpiringSoon
        expr: probe_ssl_earliest_cert_expiry - time() < 86400 * 14
        for: 1h
        labels:
          severity: warning
        annotations:
          summary: "SSL certificate expires in less than 14 days"

Application Health Endpoint

// routes/health.js
app.get("/health", async (req, res) => {
  const checks = {};

  // Database check
  try {
    const start = Date.now();
    await db.query("SELECT 1");
    checks.database = { status: "ok", latencyMs: Date.now() - start };
  } catch (err) {
    checks.database = { status: "error", message: err.message };
  }

  // Redis check
  try {
    const start = Date.now();
    await redis.ping();
    checks.redis = { status: "ok", latencyMs: Date.now() - start };
  } catch (err) {
    checks.redis = { status: "error", message: err.message };
  }

  // External API check
  try {
    const start = Date.now();
    await fetch("https://payment.provider.com/health", { signal: AbortSignal.timeout(3000) });
    checks.paymentService = { status: "ok", latencyMs: Date.now() - start };
  } catch (err) {
    checks.paymentService = { status: "degraded", message: err.message };
  }

  const allHealthy = Object.values(checks).every((c) => c.status === "ok");
  const anyDown = Object.values(checks).some((c) => c.status === "error");

  res.status(anyDown ? 503 : 200).json({
    status: anyDown ? "unhealthy" : allHealthy ? "healthy" : "degraded",
    timestamp: new Date().toISOString(),
    version: process.env.APP_VERSION || "unknown",
    uptime: process.uptime(),
    checks,
  });
});

Runbook Template

# Runbook: Database Connection Pool Exhaustion

## Symptoms
- 5xx errors spike on API endpoints
- Logs show: "Error: connection pool timeout" or "too many clients"
- Health check returns 503 with database status "error"

## Severity: SEV-2

## Quick Mitigation (do this FIRST)
1. Check if a migration or batch job is running:
   \```sql
   SELECT pid, query, state, now() - query_start AS duration
   FROM pg_stat_activity WHERE state = 'active' ORDER BY duration DESC;
   \```
2. Kill long-running queries (> 60s):
   \```sql
   SELECT pg_terminate_backend(pid)
   FROM pg_stat_activity
   WHERE state = 'active' AND now() - query_start > interval '60 seconds';
   \```
3. If pool is still full, restart the application pods:
   \```bash
   kubectl rollout restart deployment/app
   \```

## Root Cause Investigation
1. Check connection pool metrics in Grafana: [Dashboard Link]
2. Look for recent deployments or config changes
3. Check for query plan regressions:
   \```sql
   SELECT * FROM pg_stat_statements ORDER BY mean_exec_time DESC LIMIT 10;
   \```

## Prevention
- Connection pool max: 20 per instance (configured in DATABASE_POOL_SIZE)
- Query timeout: 30s (configured in DATABASE_QUERY_TIMEOUT)
- Migrations must run during maintenance windows
- Monitor pool utilization with alert at 80%

## Escalation
- Primary: @backend-oncall
- Secondary: @dba-team
- Manager: @eng-manager

Anti-Patterns to Avoid

  1. ❌ Investigating before stabilizing β€” restore service first, find root cause later
  2. ❌ Going silent during an incident β€” communicate every 15 minutes minimum
  3. ❌ Making big changes during an incident β€” minimal fixes only, no refactoring
  4. ❌ Blaming individuals in post-mortems β€” focus on systems and processes
  5. ❌ No runbooks β€” every on-call should have step-by-step guides for common issues
  6. ❌ Alerts without actionable instructions β€” every alert needs a "what to do" link
  7. ❌ Alerting on symptoms only β€” add alerts for leading indicators (pool usage, disk space)
  8. ❌ No health check endpoint β€” monitoring needs a reliable signal to check
  9. ❌ Post-mortem without action items β€” a post-mortem that doesn't lead to changes is useless
  10. ❌ Not testing rollback procedures β€” if you haven't tested it, it won't work when you need it

Adapted from buildwithclaude by Dave Poon (MIT)

# Supported AI Coding Agents

This skill is compatible with the SKILL.md standard and works with all major AI coding agents:

Learn more about the SKILL.md standard and how to use these skills with your preferred AI coding agent.