runbook-creator

by @nik-kale in Productivity

# Install this skill:

npx skills add nik-kale/sre-skills --skill "runbook-creator"

Install specific skill from multi-skill repository

# Description

Templates and patterns for creating operational runbooks and playbooks. Use when creating runbooks, writing operational documentation, playbook creation, or documenting procedures for on-call teams.

# SKILL.md

name: runbook-creator
description: Templates and patterns for creating operational runbooks and playbooks. Use when creating runbooks, writing operational documentation, playbook creation, or documenting procedures for on-call teams.

Runbook Creator

Templates and best practices for creating effective operational runbooks.

When to Use This Skill

Creating runbooks for new services
Documenting incident response procedures
Writing operational playbooks
Standardizing on-call documentation
Automating common procedures

Runbook Principles

Actionable: Every step should be executable
Testable: Verify each step works
Current: Update when systems change
Accessible: Available during incidents (not behind VPN-only)
Linked: Referenced from alerts

Standard Runbook Template

Copy and customize this template:

# [Service Name] - [Issue Type]

## Overview
Brief description of what this runbook addresses.

**Last Updated**: YYYY-MM-DD
**Owner**: [Team/Person]
**Related Alerts**: [Alert names that link here]

## Symptoms
What indicates this issue is occurring:
- [ ] Symptom 1
- [ ] Symptom 2
- [ ] Symptom 3

## Impact
- **Users Affected**: [Description]
- **Severity**: [SEV1/SEV2/SEV3/SEV4]
- **Business Impact**: [Description]

## Prerequisites
- Access to [system/tool]
- Permissions: [required permissions]
- Tools: [required CLI tools]

## Diagnostic Steps

### Step 1: [Verify the Issue]
```bash
# Command to run
kubectl get pods -n production | grep -v Running

Expected Output: [What you should see]
If Different: [What to do]

Step 2: [Gather Information]

# Command to run
kubectl logs deployment/my-service -n production --tail=100

Look For: [What to look for in output]

Resolution Steps

Option A: [Quick Fix - e.g., Restart]

Use when: [conditions]

# Step 1: Restart the service
kubectl rollout restart deployment/my-service -n production

# Step 2: Verify pods are coming up
kubectl get pods -n production -w

Verification: [How to confirm fix worked]

Option B: [Rollback]

Use when: [conditions]

# Step 1: Check rollout history
kubectl rollout history deployment/my-service -n production

# Step 2: Rollback to previous version
kubectl rollout undo deployment/my-service -n production

Verification: [How to confirm fix worked]

Verification

How to confirm the issue is resolved:
- [ ] Error rate returned to normal
- [ ] Latency within SLO
- [ ] No related alerts firing
- [ ] User-facing functionality working

Escalation

If this runbook doesn't resolve the issue:
1. First: Contact [Team/Person] via [Slack/Phone]
2. Then: Page [Escalation contact]
3. Finally: [Further escalation path]

Revision History

Date	Author	Change
YYYY-MM-DD	Name	Initial version

## Quick Runbook Templates

### Service Restart

```markdown
# [Service] - Restart Procedure

## When to Use
- Service unresponsive
- Memory leak suspected
- After configuration change

## Steps

1. **Notify team**
   ```
   Post in #incidents: "Restarting [service] due to [reason]"
   ```

2. **Restart service**
   ```bash
   kubectl rollout restart deployment/[service] -n [namespace]
   ```

3. **Monitor rollout**
   ```bash
   kubectl rollout status deployment/[service] -n [namespace]
   ```

4. **Verify health**
   ```bash
   kubectl get pods -n [namespace] | grep [service]
   # All pods should be Running, 1/1 Ready
   ```

5. **Check metrics**
   - Error rate: [dashboard link]
   - Latency: [dashboard link]

## Rollback
If restart makes things worse:
```bash
kubectl rollout undo deployment/[service] -n [namespace]

### Database Failover

```markdown
# [Database] - Failover Procedure

## When to Use
- Primary database unresponsive
- Planned maintenance
- Primary showing errors

## Prerequisites
- Database admin access
- Verify replica is in sync

## Pre-Failover Checks

1. **Check replication status**
   ```sql
   SELECT * FROM pg_stat_replication;
   ```
   Verify: `state = 'streaming'`, lag is minimal

2. **Check replica health**
   ```bash
   pg_isready -h replica-host -p 5432
   ```

## Failover Steps

1. **Stop writes to primary** (if possible)
   ```sql
   ALTER SYSTEM SET default_transaction_read_only = on;
   SELECT pg_reload_conf();
   ```

2. **Promote replica**
   ```bash
   pg_ctl promote -D /var/lib/postgresql/data
   ```

3. **Update connection strings**
   - Update DNS/load balancer to point to new primary
   - Or update application config

4. **Verify applications reconnected**
   ```sql
   SELECT count(*) FROM pg_stat_activity WHERE state = 'active';
   ```

## Post-Failover
- [ ] Monitor error rates
- [ ] Set up new replica from old primary
- [ ] Update documentation

Cache Clear

# [Service] - Cache Clear Procedure

## When to Use
- Stale data being served
- Cache corruption suspected
- After data migration

## Impact Assessment
- Cache clear will cause temporary latency spike
- Database load will increase temporarily

## Steps

1. **Notify team**
   ```
   Post in #incidents: "Clearing [cache] cache due to [reason]"
   ```

2. **Clear cache**

   **Redis - All keys**:
   ```bash
   redis-cli -h [host] FLUSHALL
   ```

   **Redis - Specific pattern**:
   ```bash
   redis-cli -h [host] --scan --pattern "user:*" | xargs redis-cli DEL
   ```

   **Application cache**:
   ```bash
   curl -X POST http://[service]/admin/cache/clear
   ```

3. **Monitor**
   - Watch cache hit rate recover
   - Monitor database load
   - Check latency

## Verification
- Cache hit rate returning to normal
- No errors from cache operations
- Latency stabilizing

Runbook Checklist

Before publishing a runbook, verify:

Runbook Quality Checklist:
- [ ] Title clearly describes the issue/procedure
- [ ] Symptoms section helps identify when to use
- [ ] All commands are copy-pasteable
- [ ] Expected output documented for each command
- [ ] Verification steps confirm success
- [ ] Escalation path is clear
- [ ] Links to dashboards work
- [ ] Tested by someone other than author
- [ ] Linked from relevant alerts

Automation Integration

Runbook with Automation Hooks

# [Service] - Automated Recovery

## Automatic Actions
The following actions run automatically:
1. Pod restart on OOMKilled (Kubernetes)
2. Scale-up on high CPU (HPA)

## Manual Steps (if auto-recovery fails)

### Check why auto-recovery failed
```bash
kubectl describe hpa [service] -n [namespace]
kubectl get events -n [namespace] --sort-by='.lastTimestamp'

Manual intervention

[Steps here]

### Script-Backed Runbook

```markdown
# [Service] - Diagnostic Script

## Quick Diagnosis
Run the diagnostic script:
```bash
./scripts/diagnose-service.sh [service-name]

This script checks:
- Pod status
- Recent logs
- Resource usage
- Dependency health

Interpreting Results

Result	Meaning	Action
`HEALTHY`	All checks pass	No action needed
`DEGRADED`	Some issues	Follow specific recommendations
`CRITICAL`	Major issues	Escalate immediately

## Common Runbook Categories

Every service should have runbooks for:

Essential Runbooks:
- [ ] Service restart
- [ ] Rollback deployment
- [ ] Scale up/down
- [ ] Clear cache
- [ ] Database failover (if applicable)
- [ ] Dependency failure response
- [ ] High error rate investigation
- [ ] High latency investigation
```

Additional Resources

# Supported AI Coding Agents

This skill is compatible with the SKILL.md standard and works with all major AI coding agents:

⚡ Amp 🚀 Antigravity 🤖 Claude Code 🦀 Clawdbot 📝 Codex ▶️ Cursor 🤖 Droid 💎 Gemini CLI 🐙 GitHub Copilot 🪿 Goose 📊 Kilo Code 🔧 Kiro CLI 💻 OpenCode 🦘 Roo Code 🌲 Trae 🏄 Windsurf

Learn more about the SKILL.md standard and how to use these skills with your preferred AI coding agent.