error-handling

by @miles-knowbl in Web & API

# Install this skill:

npx skills add miles-knowbl/orchestrator --skill "error-handling"

Install specific skill from multi-skill repository

# Description

Implement robust error handling strategies for applications. Covers error classification, custom error hierarchies, propagation patterns, recovery mechanisms (retry, fallback, circuit breaker, bulkhead), logging and observability, user-facing messages, and boundary error handling across API, UI, and background job layers.

# SKILL.md

name: error-handling
description: "Implement robust error handling strategies for applications. Covers error classification, custom error hierarchies, propagation patterns, recovery mechanisms (retry, fallback, circuit breaker, bulkhead), logging and observability, user-facing messages, and boundary error handling across API, UI, and background job layers."
phase: IMPLEMENT
category: core
version: "2.0.0"
depends_on: ["implement"]
tags: [implementation, quality, patterns, reliability]

Error Handling

Implement robust error handling strategies that keep systems reliable and debuggable.

When to Use

Implementing a new service or feature -- needs error classification, propagation, and recovery from the start
Adding external integrations -- API calls, databases, message queues require retry, timeout, and fallback logic
Hardening existing code -- swallowed errors, missing context, or silent failures need systematic remediation
Building user-facing flows -- error messages must be safe, helpful, and actionable without leaking internals
Designing distributed systems -- cross-service errors need correlation IDs, circuit breakers, and bulkheads
After a production incident -- post-mortem reveals gaps in error handling, logging, or alerting
When you say: "handle errors properly", "add error handling", "make this resilient", "improve reliability"

Reference Requirements

MUST read before applying this skill:

Reference	Why Required
`error-classification.md`	Operational vs programmer error taxonomy and severity levels
`recovery-strategies.md`	Retry, fallback, circuit breaker, and bulkhead patterns with decision trees

Read if applicable:

Reference	When Needed
`logging-standards.md`	When establishing structured logging and log levels for errors
`retry-patterns.md`	When implementing retry with exponential backoff, jitter, and idempotency
`circuit-breaker-patterns.md`	When protecting services from cascading failures via circuit breakers and bulkheads

Verification: Ensure every public function has explicit error handling, custom errors carry classification and context, and recovery strategies are documented for each failure mode.

Required Deliverables

Deliverable	Location	Condition
Custom error hierarchy	`src/errors/` or `src/common/errors.ts`	Always
Error handling middleware	`src/middleware/errorHandler.ts`	When building an API
Recovery configuration	`src/config/resilience.ts`	When using retry, circuit breaker, or fallback
Error boundary components	`src/components/ErrorBoundary.tsx`	When building a UI
Error logging integration	`src/lib/logger.ts`	Always

Core Concept

Error handling answers: "What happens when things go wrong?"

Robust error handling is:
- Classified -- every error has a type, severity, and whether it is operational or a programmer bug
- Contextual -- errors carry enough information to diagnose root cause without reproducing
- Recoverable -- transient failures are retried, degraded paths exist, blast radius is contained
- Observable -- errors are logged with structured data, correlated across services, and alerted on
- User-respectful -- users see helpful guidance, never stack traces or internal codes

Error handling is NOT:
- Swallowing exceptions silently
- Logging everything at ERROR level regardless of severity
- Wrapping every line in try/catch without a strategy
- Showing raw error messages to end users
- Treating all failures identically (retry everything, fail on everything)

The Error Handling Process

+-------------------------------------------------------------+
|               ERROR HANDLING PROCESS                         |
|                                                              |
|  1. CLASSIFY ERRORS                                          |
|     +-> Operational vs programmer? Transient vs permanent?   |
|                                                              |
|  2. DEFINE ERROR HIERARCHY                                   |
|     +-> Base error class, domain errors, infrastructure      |
|                                                              |
|  3. DESIGN PROPAGATION STRATEGY                              |
|     +-> Where to catch, wrap, rethrow, or transform?         |
|                                                              |
|  4. IMPLEMENT RECOVERY PATTERNS                              |
|     +-> Retry, fallback, circuit breaker, bulkhead           |
|                                                              |
|  5. ADD LOGGING & OBSERVABILITY                              |
|     +-> Structured logs, correlation IDs, metrics, alerts    |
|                                                              |
|  6. CRAFT USER-FACING MESSAGES                               |
|     +-> Safe, helpful, actionable -- no internals leaked     |
|                                                              |
|  7. HANDLE BOUNDARY ERRORS                                   |
|     +-> API responses, UI boundaries, background job DLQs    |
+-------------------------------------------------------------+

Step 1: Classify Errors

Every error falls into one of two fundamental categories. This classification determines how you handle it.

Operational vs Programmer Errors

Aspect	Operational Error	Programmer Error
Cause	External conditions the program cannot prevent	Bugs in the code itself
Examples	Network timeout, disk full, invalid user input, service unavailable	Null reference, type error, index out of bounds, assertion failure
Expected?	Yes -- these will happen in production	No -- these indicate defects
Recovery	Retry, fallback, degrade, report to user	Crash fast, fix the code, deploy a patch
Logging	WARN or ERROR depending on impact	ERROR or FATAL -- always investigate
User message	Helpful guidance ("Try again", "Contact support")	Generic ("Something went wrong")

Transient vs Permanent Classification

Type	Description	Action
Transient	Temporary failure that may succeed on retry	Retry with backoff
Permanent	Condition that will not change without intervention	Fail immediately, notify
Indeterminate	Cannot determine if transient or permanent	Retry with limited attempts, then escalate

Severity Levels

enum ErrorSeverity {
  /** Informational -- operation degraded but succeeded via fallback */
  LOW = 'low',
  /** Warning -- operation failed but system is healthy, user can retry */
  MEDIUM = 'medium',
  /** Error -- operation failed, feature unavailable, needs attention */
  HIGH = 'high',
  /** Critical -- system health impacted, cascading risk, page immediately */
  CRITICAL = 'critical',
}

Classification Decision Tree

Is the error caused by a bug in our code?
+-- Yes -> PROGRAMMER ERROR -> crash fast, log FATAL, fix code
+-- No -> OPERATIONAL ERROR
             |
             Is the error transient?
             +-- Yes -> TRANSIENT OPERATIONAL
             |          Retry with backoff (see Step 4)
             +-- No -> PERMANENT OPERATIONAL
             |          Fail, notify user, log for investigation
             +-- Unknown -> INDETERMINATE
                            Limited retry, then escalate

Step 2: Define Error Hierarchy

Build a custom error hierarchy that carries classification, context, and operational metadata.

Base Error Class

// src/errors/AppError.ts

export interface ErrorContext {
  [key: string]: unknown;
}

export abstract class AppError extends Error {
  /** Machine-readable error code (e.g., 'PAYMENT_FAILED', 'USER_NOT_FOUND') */
  abstract readonly code: string;

  /** HTTP status code for API responses */
  abstract readonly statusCode: number;

  /** Is this an operational error (expected) vs a programmer error (bug)? */
  abstract readonly isOperational: boolean;

  /** Severity level for logging and alerting */
  readonly severity: ErrorSeverity;

  /** Structured context for debugging -- never exposed to users */
  readonly context: ErrorContext;

  /** Correlation ID for distributed tracing */
  readonly correlationId?: string;

  /** Original error that caused this one */
  readonly cause?: Error;

  constructor(
    message: string,
    options: {
      severity?: ErrorSeverity;
      context?: ErrorContext;
      correlationId?: string;
      cause?: Error;
    } = {},
  ) {
    super(message);
    this.name = this.constructor.name;
    this.severity = options.severity ?? ErrorSeverity.MEDIUM;
    this.context = options.context ?? {};
    this.correlationId = options.correlationId;
    this.cause = options.cause;

    // Capture proper stack trace (V8 engines)
    if (Error.captureStackTrace) {
      Error.captureStackTrace(this, this.constructor);
    }
  }

  /** Serialize for logging -- includes all diagnostic data */
  toLogObject(): Record<string, unknown> {
    return {
      name: this.name,
      code: this.code,
      message: this.message,
      severity: this.severity,
      isOperational: this.isOperational,
      statusCode: this.statusCode,
      context: this.context,
      correlationId: this.correlationId,
      cause: this.cause?.message,
      stack: this.stack,
    };
  }

  /** Serialize for API responses -- safe for external consumption */
  toApiResponse(): { error: { code: string; message: string } } {
    return {
      error: {
        code: this.code,
        message: this.isOperational ? this.message : 'An unexpected error occurred',
      },
    };
  }
}

Domain Error Classes

// src/errors/domain.ts

/** Validation errors -- bad user input, schema violations */
export class ValidationError extends AppError {
  readonly code = 'VALIDATION_ERROR';
  readonly statusCode = 400;
  readonly isOperational = true;

  constructor(
    message: string,
    public readonly fieldErrors: Record<string, string[]> = {},
    options?: { context?: ErrorContext; correlationId?: string },
  ) {
    super(message, { severity: ErrorSeverity.LOW, ...options });
  }

  toApiResponse() {
    return {
      error: {
        code: this.code,
        message: this.message,
        fields: this.fieldErrors,
      },
    };
  }
}

/** Resource not found */
export class NotFoundError extends AppError {
  readonly code = 'NOT_FOUND';
  readonly statusCode = 404;
  readonly isOperational = true;

  constructor(resource: string, identifier: string, options?: { correlationId?: string }) {
    super(`${resource} not found: ${identifier}`, {
      severity: ErrorSeverity.LOW,
      context: { resource, identifier },
      ...options,
    });
  }
}

/** Authorization / permission denied */
export class ForbiddenError extends AppError {
  readonly code = 'FORBIDDEN';
  readonly statusCode = 403;
  readonly isOperational = true;

  constructor(message = 'You do not have permission to perform this action', options?: { context?: ErrorContext }) {
    super(message, { severity: ErrorSeverity.MEDIUM, ...options });
  }
}

/** Business rule violation */
export class BusinessRuleError extends AppError {
  readonly code: string;
  readonly statusCode = 422;
  readonly isOperational = true;

  constructor(
    code: string,
    message: string,
    options?: { context?: ErrorContext; correlationId?: string },
  ) {
    super(message, { severity: ErrorSeverity.MEDIUM, ...options });
    this.code = code;
  }
}

/** Conflict -- duplicate, stale data, optimistic lock failure */
export class ConflictError extends AppError {
  readonly code = 'CONFLICT';
  readonly statusCode = 409;
  readonly isOperational = true;

  constructor(message: string, options?: { context?: ErrorContext }) {
    super(message, { severity: ErrorSeverity.MEDIUM, ...options });
  }
}

Infrastructure Error Classes

// src/errors/infrastructure.ts

/** External service failure -- API call, database, message queue */
export class ExternalServiceError extends AppError {
  readonly code = 'EXTERNAL_SERVICE_ERROR';
  readonly statusCode = 502;
  readonly isOperational = true;

  constructor(
    public readonly serviceName: string,
    message: string,
    options?: { cause?: Error; context?: ErrorContext; correlationId?: string },
  ) {
    super(message, { severity: ErrorSeverity.HIGH, ...options });
  }
}

/** Timeout -- operation exceeded time limit */
export class TimeoutError extends AppError {
  readonly code = 'TIMEOUT';
  readonly statusCode = 504;
  readonly isOperational = true;

  constructor(
    operation: string,
    timeoutMs: number,
    options?: { cause?: Error; correlationId?: string },
  ) {
    super(`Operation '${operation}' timed out after ${timeoutMs}ms`, {
      severity: ErrorSeverity.HIGH,
      context: { operation, timeoutMs },
      ...options,
    });
  }
}

/** Rate limited -- too many requests to external resource */
export class RateLimitError extends AppError {
  readonly code = 'RATE_LIMITED';
  readonly statusCode = 429;
  readonly isOperational = true;

  constructor(
    public readonly retryAfterMs: number,
    options?: { context?: ErrorContext },
  ) {
    super(`Rate limited. Retry after ${retryAfterMs}ms`, {
      severity: ErrorSeverity.MEDIUM,
      context: { retryAfterMs },
      ...options,
    });
  }
}

/** Internal / programmer error -- should never reach the user */
export class InternalError extends AppError {
  readonly code = 'INTERNAL_ERROR';
  readonly statusCode = 500;
  readonly isOperational = false;

  constructor(message: string, options?: { cause?: Error; context?: ErrorContext }) {
    super(message, { severity: ErrorSeverity.CRITICAL, ...options });
  }
}

Hierarchy Checklist

- [ ] Base AppError class with code, statusCode, isOperational, severity, context
- [ ] toLogObject() for structured logging (includes all diagnostic data)
- [ ] toApiResponse() for safe external serialization (no internals leaked)
- [ ] Domain errors: ValidationError, NotFoundError, ForbiddenError, BusinessRuleError, ConflictError
- [ ] Infrastructure errors: ExternalServiceError, TimeoutError, RateLimitError, InternalError
- [ ] All errors carry correlationId for distributed tracing
- [ ] All errors preserve cause chain for root cause analysis
- [ ] Error codes are UPPER_SNAKE_CASE strings, stable across versions

Step 3: Design Propagation Strategy

Error propagation determines where errors are caught, where they are transformed, and where they surface.

The Propagation Principle

+-------------------------------------------------------------+
|               ERROR PROPAGATION LAYERS                       |
|                                                              |
|   Layer          Action            Why                       |
|   ------         ------            ---                       |
|   Origin         Throw with        Accurate context          |
|                  full context      at point of failure        |
|                                                              |
|   Service        Catch + wrap      Add business context,     |
|                  if crossing       translate infrastructure   |
|                  boundary          errors into domain errors  |
|                                                              |
|   Controller/    Catch + format    Transform into API         |
|   Handler        for consumer      response or UI state       |
|                                                              |
|   Global         Catch-all         Safety net for             |
|   Handler        for uncaught      unhandled errors           |
+-------------------------------------------------------------+

Rules of Propagation

Rule	Description
Catch at the right level	Catch where you can do something useful -- recover, translate, or report
Never swallow silently	Every catch block must log, rethrow, or return an error value
Wrap at boundaries	When crossing a layer boundary, wrap with context from the new layer
Preserve the cause chain	Always pass the original error as `cause` when wrapping
Transform, do not expose	Infrastructure errors become domain errors at the service boundary
Let programmer errors crash	Programmer errors should propagate to the global handler and crash/restart

Propagation in Practice

// ORIGIN: Repository throws with infrastructure context
class OrderRepository {
  async findById(id: string): Promise<Order | null> {
    try {
      return await db.selectFrom('orders').where('id', '=', id).executeTakeFirst();
    } catch (error) {
      throw new ExternalServiceError('database', `Failed to fetch order ${id}`, {
        cause: error as Error,
        context: { orderId: id, operation: 'findById' },
      });
    }
  }
}

// SERVICE: Wraps infrastructure error into domain context
class OrderService {
  async getOrder(id: string, requestingUserId: string): Promise<Order> {
    const order = await this.orderRepo.findById(id);
    // ExternalServiceError from repo propagates up -- intentional

    if (!order) {
      throw new NotFoundError('Order', id);
    }

    if (order.userId !== requestingUserId) {
      throw new ForbiddenError('You can only view your own orders', {
        context: { orderId: id, requestingUserId },
      });
    }

    return order;
  }
}

// CONTROLLER: Formats for API consumer
class OrderController {
  getById = async (req: Request, res: Response, next: NextFunction) => {
    try {
      const order = await this.orderService.getOrder(req.params.id, req.user.id);
      res.json({ data: order });
    } catch (error) {
      next(error); // Delegate to global error handler
    }
  };
}

// GLOBAL HANDLER: Safety net (see Step 7: Boundary Error Handling)

Anti-Patterns to Avoid

// ANTI-PATTERN 1: Swallowing errors
try {
  await sendEmail(user.email);
} catch (error) {
  // Silently ignored -- email failures are invisible
}

// FIX: Log even if not rethrowing
try {
  await sendEmail(user.email);
} catch (error) {
  logger.warn('Email send failed', { email: user.email, error: (error as Error).message });
  // Intentionally not rethrowing -- email is non-critical
}

// ANTI-PATTERN 2: Losing context
try {
  await externalApi.call(payload);
} catch (error) {
  throw new Error('API call failed'); // Original error lost
}

// FIX: Preserve cause chain
try {
  await externalApi.call(payload);
} catch (error) {
  throw new ExternalServiceError('payment-api', 'Payment processing failed', {
    cause: error as Error,
    context: { payload },
  });
}

// ANTI-PATTERN 3: Catching too broadly
try {
  const data = JSON.parse(untrustedInput); // Could throw SyntaxError
  const result = processData(data);          // Could throw BusinessRuleError
  await saveResult(result);                  // Could throw ExternalServiceError
} catch (error) {
  // Which operation failed? Cannot tell.
  res.status(500).json({ error: 'Something went wrong' });
}

// FIX: Granular try/catch or let errors propagate to appropriate handler

Step 4: Implement Recovery Patterns

Recovery patterns turn transient failures into successful operations and contain blast radius for permanent failures.

Pattern Selection Guide

Situation	Pattern	When
Transient external failure	Retry with backoff	Network blip, temporary overload
Repeated failures to same service	Circuit breaker	Downstream service degraded
Primary path unavailable	Fallback	Cache stale data, use default, degrade feature
Protecting shared resources	Bulkhead	Isolate failures to prevent cascade
Nonessential operation	Fire and forget	Analytics, logging, notifications
Long-running operation	Timeout	Prevent indefinite blocking

Retry with Exponential Backoff

// src/lib/retry.ts

export interface RetryOptions {
  /** Maximum number of attempts (including the first) */
  maxAttempts: number;
  /** Base delay in milliseconds */
  baseDelayMs: number;
  /** Maximum delay cap in milliseconds */
  maxDelayMs: number;
  /** Multiplier for exponential backoff (default: 2) */
  backoffMultiplier?: number;
  /** Add random jitter to prevent thundering herd (default: true) */
  jitter?: boolean;
  /** Predicate to decide if the error is retryable */
  isRetryable?: (error: Error) => boolean;
  /** Called before each retry with attempt number and delay */
  onRetry?: (attempt: number, delayMs: number, error: Error) => void;
}

export async function withRetry<T>(
  fn: () => Promise<T>,
  options: RetryOptions,
): Promise<T> {
  const {
    maxAttempts,
    baseDelayMs,
    maxDelayMs,
    backoffMultiplier = 2,
    jitter = true,
    isRetryable = () => true,
    onRetry,
  } = options;

  let lastError: Error | undefined;

  for (let attempt = 1; attempt <= maxAttempts; attempt++) {
    try {
      return await fn();
    } catch (error) {
      lastError = error as Error;

      if (attempt === maxAttempts || !isRetryable(lastError)) {
        throw lastError;
      }

      // Exponential backoff: baseDelay * multiplier^(attempt-1)
      let delay = Math.min(
        baseDelayMs * Math.pow(backoffMultiplier, attempt - 1),
        maxDelayMs,
      );

      // Add jitter: randomize between 0 and computed delay
      if (jitter) {
        delay = Math.floor(Math.random() * delay);
      }

      onRetry?.(attempt, delay, lastError);
      await sleep(delay);
    }
  }

  throw lastError!;
}

function sleep(ms: number): Promise<void> {
  return new Promise((resolve) => setTimeout(resolve, ms));
}

// Usage
const result = await withRetry(
  () => paymentApi.charge(userId, amountCents),
  {
    maxAttempts: 3,
    baseDelayMs: 500,
    maxDelayMs: 5000,
    isRetryable: (error) =>
      error instanceof ExternalServiceError ||
      error instanceof TimeoutError,
    onRetry: (attempt, delay, error) => {
      logger.warn('Retrying payment charge', { attempt, delay, error: error.message });
    },
  },
);

Circuit Breaker

// src/lib/circuitBreaker.ts

enum CircuitState {
  CLOSED = 'closed',       // Normal operation -- requests pass through
  OPEN = 'open',           // Failing -- requests rejected immediately
  HALF_OPEN = 'half_open', // Testing -- single request allowed to probe
}

export interface CircuitBreakerOptions {
  /** Number of failures before opening the circuit */
  failureThreshold: number;
  /** Time in ms to wait before moving from OPEN to HALF_OPEN */
  resetTimeoutMs: number;
  /** Number of successes in HALF_OPEN before closing */
  successThreshold: number;
  /** Name for logging and metrics */
  name: string;
  /** Predicate to classify which errors count as failures */
  isFailure?: (error: Error) => boolean;
}

export class CircuitBreaker {
  private state: CircuitState = CircuitState.CLOSED;
  private failureCount = 0;
  private successCount = 0;
  private lastFailureTime = 0;

  constructor(private readonly options: CircuitBreakerOptions) {}

  async execute<T>(fn: () => Promise<T>): Promise<T> {
    if (this.state === CircuitState.OPEN) {
      if (Date.now() - this.lastFailureTime >= this.options.resetTimeoutMs) {
        this.state = CircuitState.HALF_OPEN;
        this.successCount = 0;
        logger.info(`Circuit ${this.options.name}: OPEN -> HALF_OPEN`);
      } else {
        throw new ExternalServiceError(
          this.options.name,
          `Circuit breaker is open for ${this.options.name}`,
        );
      }
    }

    try {
      const result = await fn();
      this.onSuccess();
      return result;
    } catch (error) {
      const shouldCountFailure = this.options.isFailure?.(error as Error) ?? true;
      if (shouldCountFailure) {
        this.onFailure();
      }
      throw error;
    }
  }

  private onSuccess(): void {
    if (this.state === CircuitState.HALF_OPEN) {
      this.successCount++;
      if (this.successCount >= this.options.successThreshold) {
        this.state = CircuitState.CLOSED;
        this.failureCount = 0;
        logger.info(`Circuit ${this.options.name}: HALF_OPEN -> CLOSED`);
      }
    } else {
      this.failureCount = 0;
    }
  }

  private onFailure(): void {
    this.failureCount++;
    this.lastFailureTime = Date.now();

    if (
      this.failureCount >= this.options.failureThreshold ||
      this.state === CircuitState.HALF_OPEN
    ) {
      this.state = CircuitState.OPEN;
      logger.warn(`Circuit ${this.options.name}: -> OPEN after ${this.failureCount} failures`);
    }
  }

  getState(): CircuitState {
    return this.state;
  }
}

// Usage
const paymentCircuit = new CircuitBreaker({
  name: 'payment-service',
  failureThreshold: 5,
  resetTimeoutMs: 30_000,
  successThreshold: 2,
  isFailure: (error) => error instanceof ExternalServiceError,
});

const result = await paymentCircuit.execute(() => paymentApi.charge(userId, amount));

Fallback Pattern

// src/lib/fallback.ts

export async function withFallback<T>(
  primary: () => Promise<T>,
  fallback: () => Promise<T>,
  options?: { onFallback?: (error: Error) => void },
): Promise<T> {
  try {
    return await primary();
  } catch (error) {
    options?.onFallback?.(error as Error);
    return await fallback();
  }
}

// Usage: Cache fallback when API is down
const products = await withFallback(
  () => productApi.listFeatured(),
  () => cache.get<Product[]>('featured-products') ?? [],
  {
    onFallback: (error) => {
      logger.warn('Product API unavailable, serving cached data', {
        error: error.message,
      });
    },
  },
);

Bulkhead Pattern

// src/lib/bulkhead.ts

export class Bulkhead {
  private activeCount = 0;
  private queue: Array<{ resolve: () => void; reject: (err: Error) => void }> = [];

  constructor(
    private readonly maxConcurrent: number,
    private readonly maxQueue: number,
    private readonly name: string,
  ) {}

  async execute<T>(fn: () => Promise<T>): Promise<T> {
    if (this.activeCount >= this.maxConcurrent) {
      if (this.queue.length >= this.maxQueue) {
        throw new ExternalServiceError(
          this.name,
          `Bulkhead ${this.name} at capacity: ${this.maxConcurrent} active, ${this.maxQueue} queued`,
        );
      }
      await new Promise<void>((resolve, reject) => {
        this.queue.push({ resolve, reject });
      });
    }

    this.activeCount++;
    try {
      return await fn();
    } finally {
      this.activeCount--;
      const next = this.queue.shift();
      if (next) next.resolve();
    }
  }
}

// Usage: Isolate payment processing from order queries
const paymentBulkhead = new Bulkhead(10, 50, 'payment');
const orderQueryBulkhead = new Bulkhead(50, 200, 'order-query');

Recovery Checklist

- [ ] Every external call has a timeout configured
- [ ] Transient failures use retry with exponential backoff and jitter
- [ ] Retry logic has maxAttempts cap (never retry indefinitely)
- [ ] Only retryable errors trigger retry (validation errors do not)
- [ ] Critical external services use circuit breakers
- [ ] Fallback paths exist for non-critical features
- [ ] Shared resources are protected by bulkheads
- [ ] Recovery events are logged for observability

Step 5: Add Logging and Observability

Errors that are not logged are invisible. Errors without context are useless.

Structured Error Logging

// src/lib/logger.ts

interface LogContext {
  correlationId?: string;
  userId?: string;
  operation?: string;
  duration?: number;
  [key: string]: unknown;
}

class Logger {
  error(message: string, error: Error, context?: LogContext): void {
    const logEntry = {
      level: 'error',
      timestamp: new Date().toISOString(),
      message,
      error: {
        name: error.name,
        message: error.message,
        stack: error.stack,
        ...(error instanceof AppError ? error.toLogObject() : {}),
      },
      ...context,
    };
    console.error(JSON.stringify(logEntry));
  }

  warn(message: string, context?: LogContext): void {
    const logEntry = {
      level: 'warn',
      timestamp: new Date().toISOString(),
      message,
      ...context,
    };
    console.warn(JSON.stringify(logEntry));
  }

  info(message: string, context?: LogContext): void {
    const logEntry = {
      level: 'info',
      timestamp: new Date().toISOString(),
      message,
      ...context,
    };
    console.info(JSON.stringify(logEntry));
  }
}

export const logger = new Logger();

Log Level Guidelines

Level	When	Example
DEBUG	Diagnostic detail for development	"Retrying request, attempt 2 of 3"
INFO	Normal operational events	"Order created", "Payment processed"
WARN	Recoverable issues, degraded operation	"Using cached data, API unavailable"
ERROR	Operation failed, needs attention	"Payment charge failed after 3 retries"
FATAL	System cannot continue	"Database connection pool exhausted"

Correlation IDs

// src/middleware/correlationId.ts
import { randomUUID } from 'crypto';
import { AsyncLocalStorage } from 'async_hooks';

const correlationStorage = new AsyncLocalStorage<string>();

export function correlationMiddleware(req: Request, _res: Response, next: NextFunction): void {
  const correlationId = (req.headers['x-correlation-id'] as string) ?? randomUUID();
  req.correlationId = correlationId;
  correlationStorage.run(correlationId, () => next());
}

export function getCorrelationId(): string | undefined {
  return correlationStorage.getStore();
}

// Attach to outgoing requests
export function withCorrelation(headers: Record<string, string>): Record<string, string> {
  const id = getCorrelationId();
  return id ? { ...headers, 'x-correlation-id': id } : headers;
}

Error Metrics

Track these metrics for alerting and dashboards:

Metric	Type	Purpose
`error_total{code, severity}`	Counter	Total errors by type
`error_rate_5m{service}`	Gauge	Error rate over 5-minute window
`circuit_breaker_state{name}`	Gauge	Current state (0=closed, 1=half_open, 2=open)
`retry_total{operation, outcome}`	Counter	Retry attempts and outcomes
`recovery_fallback_total{name}`	Counter	How often fallback paths are used

Observability Checklist

- [ ] All errors logged with structured JSON
- [ ] Correlation IDs flow through all service calls
- [ ] Log levels match severity (not everything is ERROR)
- [ ] Sensitive data (passwords, tokens, PII) stripped from logs
- [ ] Error metrics exported for dashboards and alerting
- [ ] Alert thresholds defined for error rate spikes
- [ ] Circuit breaker state changes logged
- [ ] Retry attempts and outcomes logged

Step 6: Craft User-Facing Messages

Users deserve helpful error messages. Internal details must never leak.

Message Guidelines

Principle	Bad	Good
Be specific	"An error occurred"	"We could not process your payment"
Be actionable	"Error 500"	"Please try again in a few minutes"
No internals	"NullReferenceException at line 42"	"Something went wrong on our end"
No blame	"You entered an invalid email"	"Please enter a valid email address"
Suggest next steps	"Payment failed"	"Payment failed. Please check your card details or try a different payment method."

Error Message Map

// src/errors/userMessages.ts

const USER_MESSAGES: Record<string, { title: string; message: string; action?: string }> = {
  VALIDATION_ERROR: {
    title: 'Invalid input',
    message: 'Please check the highlighted fields and try again.',
  },
  NOT_FOUND: {
    title: 'Not found',
    message: 'The item you are looking for does not exist or has been removed.',
  },
  FORBIDDEN: {
    title: 'Access denied',
    message: 'You do not have permission to perform this action.',
    action: 'Contact your administrator if you believe this is an error.',
  },
  PAYMENT_FAILED: {
    title: 'Payment failed',
    message: 'We could not process your payment.',
    action: 'Please check your card details or try a different payment method.',
  },
  RATE_LIMITED: {
    title: 'Too many requests',
    message: 'You are making requests too quickly.',
    action: 'Please wait a moment and try again.',
  },
  EXTERNAL_SERVICE_ERROR: {
    title: 'Service temporarily unavailable',
    message: 'One of our services is experiencing issues.',
    action: 'Please try again in a few minutes.',
  },
  TIMEOUT: {
    title: 'Request timed out',
    message: 'The operation took too long to complete.',
    action: 'Please try again. If the problem persists, contact support.',
  },
  INTERNAL_ERROR: {
    title: 'Something went wrong',
    message: 'An unexpected error occurred on our end.',
    action: 'Please try again. If the problem persists, contact support.',
  },
};

export function getUserMessage(code: string): { title: string; message: string; action?: string } {
  return USER_MESSAGES[code] ?? USER_MESSAGES['INTERNAL_ERROR'];
}

API Error Response Format

// Standard error response shape
interface ApiErrorResponse {
  error: {
    code: string;            // Machine-readable: 'VALIDATION_ERROR'
    message: string;         // User-safe message
    action?: string;         // Suggested next step
    fields?: Record<string, string[]>;  // Field-level validation errors
    requestId?: string;      // Correlation ID for support reference
  };
}

// Example response:
// HTTP 422
// {
//   "error": {
//     "code": "VALIDATION_ERROR",
//     "message": "Please check the highlighted fields and try again.",
//     "fields": {
//       "email": ["Must be a valid email address"],
//       "quantity": ["Must be between 1 and 9999"]
//     },
//     "requestId": "req_abc123"
//   }
// }

Step 7: Handle Boundary Errors

Each system boundary (API, UI, background jobs) needs its own error handling strategy.

API Error Middleware

// src/middleware/errorHandler.ts

export function errorHandler(
  error: Error,
  req: Request,
  res: Response,
  _next: NextFunction,
): void {
  // 1. Log the error
  const correlationId = req.correlationId ?? 'unknown';

  if (error instanceof AppError) {
    const logLevel = error.isOperational ? 'warn' : 'error';
    logger[logLevel]('Request error', error, {
      correlationId,
      method: req.method,
      path: req.path,
      userId: req.user?.id,
    });

    // 2. Send appropriate response
    const userMessage = getUserMessage(error.code);
    res.status(error.statusCode).json({
      error: {
        code: error.code,
        message: userMessage.message,
        action: userMessage.action,
        ...(error instanceof ValidationError ? { fields: error.fieldErrors } : {}),
        requestId: correlationId,
      },
    });
  } else {
    // Unclassified error -- treat as programmer error
    logger.error('Unhandled error', error, {
      correlationId,
      method: req.method,
      path: req.path,
    });

    res.status(500).json({
      error: {
        code: 'INTERNAL_ERROR',
        message: 'An unexpected error occurred on our end.',
        action: 'Please try again. If the problem persists, contact support.',
        requestId: correlationId,
      },
    });
  }
}

// Register as the LAST middleware
app.use(errorHandler);

Process-Level Error Handlers

// src/bootstrap.ts

// Catch unhandled promise rejections
process.on('unhandledRejection', (reason: unknown) => {
  logger.error('Unhandled promise rejection', reason as Error, {
    type: 'unhandledRejection',
  });
  // In production: trigger graceful shutdown
  // process.exit(1);
});

// Catch uncaught exceptions
process.on('uncaughtException', (error: Error) => {
  logger.error('Uncaught exception', error, {
    type: 'uncaughtException',
  });
  // MUST exit -- process is in undefined state
  process.exit(1);
});

// Graceful shutdown on SIGTERM
process.on('SIGTERM', () => {
  logger.info('SIGTERM received, starting graceful shutdown');
  // Close server, drain connections, flush logs
  server.close(() => {
    logger.info('Server closed');
    process.exit(0);
  });
  // Force exit after timeout
  setTimeout(() => {
    logger.error('Forced shutdown after timeout', new Error('Shutdown timeout'));
    process.exit(1);
  }, 30_000);
});

UI Error Boundaries (React)

// src/components/ErrorBoundary.tsx
import { Component, type ReactNode, type ErrorInfo } from 'react';

interface ErrorBoundaryProps {
  children: ReactNode;
  fallback?: ReactNode | ((error: Error, reset: () => void) => ReactNode);
  onError?: (error: Error, errorInfo: ErrorInfo) => void;
}

interface ErrorBoundaryState {
  error: Error | null;
}

export class ErrorBoundary extends Component<ErrorBoundaryProps, ErrorBoundaryState> {
  state: ErrorBoundaryState = { error: null };

  static getDerivedStateFromError(error: Error): ErrorBoundaryState {
    return { error };
  }

  componentDidCatch(error: Error, errorInfo: ErrorInfo): void {
    // Log to error tracking service
    logger.error('React error boundary caught error', error, {
      componentStack: errorInfo.componentStack,
    });
    this.props.onError?.(error, errorInfo);
  }

  reset = (): void => {
    this.setState({ error: null });
  };

  render(): ReactNode {
    if (this.state.error) {
      if (typeof this.props.fallback === 'function') {
        return this.props.fallback(this.state.error, this.reset);
      }
      return this.props.fallback ?? <DefaultErrorFallback error={this.state.error} onRetry={this.reset} />;
    }
    return this.props.children;
  }
}

function DefaultErrorFallback({ error, onRetry }: { error: Error; onRetry: () => void }) {
  return (
    <div role="alert" className="error-fallback">
      <h2>Something went wrong</h2>
      <p>We are sorry for the inconvenience. Please try again.</p>
      <button onClick={onRetry}>Try again</button>
    </div>
  );
}

// Usage: Wrap at feature boundaries, not just the app root
// <ErrorBoundary fallback={<CheckoutErrorView />}>
//   <CheckoutFlow />
// </ErrorBoundary>

Background Job Error Handling

// src/jobs/baseJob.ts

export abstract class BaseJob<T> {
  abstract readonly name: string;
  abstract readonly maxRetries: number;

  async run(payload: T, attempt: number): Promise<void> {
    try {
      await this.execute(payload);
    } catch (error) {
      const appError = error instanceof AppError
        ? error
        : new InternalError('Job execution failed', { cause: error as Error });

      logger.error(`Job ${this.name} failed`, appError, {
        attempt,
        maxRetries: this.maxRetries,
        payload: this.sanitizePayload(payload),
      });

      if (attempt < this.maxRetries && this.isRetryable(appError)) {
        // Re-enqueue with backoff
        const delay = Math.min(1000 * Math.pow(2, attempt), 60_000);
        await this.enqueue(payload, attempt + 1, delay);
        logger.info(`Job ${this.name} re-enqueued`, { attempt: attempt + 1, delay });
      } else {
        // Send to dead letter queue for manual investigation
        await this.sendToDeadLetterQueue(payload, appError, attempt);
        logger.error(`Job ${this.name} sent to DLQ after ${attempt} attempts`, appError);
      }
    }
  }

  protected abstract execute(payload: T): Promise<void>;
  protected abstract enqueue(payload: T, attempt: number, delayMs: number): Promise<void>;
  protected abstract sendToDeadLetterQueue(payload: T, error: AppError, attempts: number): Promise<void>;

  protected isRetryable(error: AppError): boolean {
    return error.isOperational;
  }

  protected sanitizePayload(payload: T): unknown {
    return payload; // Override to strip sensitive fields
  }
}

Output Formats

Quick Format (Single Feature)

## Error Handling: [Feature Name]

### Error Classification
| Error Code | Type | Transient? | Recovery |
|------------|------|------------|----------|
| [CODE] | operational/programmer | yes/no | retry/fallback/fail |

### Custom Errors Added
- `[ErrorClass]` -- [when thrown]

### Recovery Strategy
- [Pattern used and configuration]

### User Messages
| Code | Message | Action |
|------|---------|--------|
| [CODE] | [message] | [action] |

Full Format (System-Wide)

## Error Handling Strategy: [System Name]

### Error Hierarchy
[Class diagram or list of all custom errors]

### Error Classification Matrix
[Complete table of all error codes, types, severities, and recovery strategies]

### Recovery Configuration
[Retry policies, circuit breaker settings, fallback chains]

### Boundary Handlers
- **API**: [middleware description]
- **UI**: [error boundary strategy]
- **Jobs**: [retry and DLQ strategy]

### Logging & Observability
- [Structured logging format]
- [Correlation ID flow]
- [Metrics and alerting thresholds]

### User Message Catalog
[Complete mapping of error codes to user-safe messages]

### Testing Strategy
[How error handling is tested -- fault injection, chaos testing]

Common Patterns

Pattern 1: API Error Handling

Standard pattern for REST API error handling with middleware, validation, and consistent response format.

// Validation at API boundary
app.post('/orders', validate(createOrderSchema), async (req, res, next) => {
  try {
    const order = await orderService.create(req.body);
    res.status(201).json({ data: order });
  } catch (error) {
    next(error); // Global error handler formats response
  }
});

// Global handler produces consistent error responses
// See Step 7: API Error Middleware

Key elements: Input validation at boundary, try/catch delegates to global handler, consistent JSON error shape, correlation IDs in every response.

Pattern 2: Background Job Errors

Jobs need retry with backoff, dead letter queues, and idempotency guarantees.

// Idempotent job execution
async execute(payload: { orderId: string }): Promise<void> {
  const lock = await this.acquireLock(`process-order:${payload.orderId}`);
  if (!lock) {
    logger.info('Job already processing, skipping', { orderId: payload.orderId });
    return; // Idempotent -- safe to skip
  }
  try {
    await this.processOrder(payload.orderId);
  } finally {
    await this.releaseLock(lock);
  }
}

Key elements: Idempotency via distributed locks, exponential backoff between retries, dead letter queue after max attempts, sanitized payload logging.

Pattern 3: UI Error Boundaries

Isolate errors to feature boundaries so one broken component does not crash the entire page.

// Feature-level isolation
<Layout>
  <ErrorBoundary fallback={<NavError />}>
    <Navigation />
  </ErrorBoundary>

  <ErrorBoundary fallback={<ContentError onRetry={refetch} />}>
    <MainContent />
  </ErrorBoundary>

  <ErrorBoundary fallback={<SidebarFallback />}>
    <Sidebar />
  </ErrorBoundary>
</Layout>

Key elements: Boundaries at feature level (not just app root), meaningful fallback UI per boundary, retry mechanism, error logging to tracking service.

Pattern 4: Distributed System Errors

Cross-service errors need correlation, circuit breakers, and graceful degradation.

// Composed resilience: circuit breaker + retry + fallback + timeout
async function getProductRecommendations(userId: string): Promise<Product[]> {
  return withFallback(
    () => recommendationCircuit.execute(
      () => withRetry(
        () => withTimeout(
          () => recommendationApi.getForUser(userId),
          { timeoutMs: 2000, operation: 'get-recommendations' },
        ),
        {
          maxAttempts: 2,
          baseDelayMs: 200,
          maxDelayMs: 1000,
          isRetryable: (e) => e instanceof TimeoutError || e instanceof ExternalServiceError,
        },
      ),
    ),
    () => cache.get<Product[]>(`recommendations:${userId}`) ?? [],
    {
      onFallback: (error) => {
        logger.warn('Recommendations unavailable, serving cached', {
          userId,
          error: error.message,
          correlationId: getCorrelationId(),
        });
      },
    },
  );
}

Key elements: Timeout wraps the innermost call, retry wraps timeout, circuit breaker wraps retry, fallback wraps everything. Correlation ID flows through all layers. Degraded experience is better than no experience.

Relationship to Other Skills

Skill	Relationship
`implement`	Error handling is integral to implementation -- every service needs classified errors, propagation strategy, and recovery
`code-verification`	Verification confirms error handling exists at all boundaries and follows the defined hierarchy
`test-generation`	Tests must cover every error path -- happy path is not enough without error case coverage
`deploy`	Deployment must configure monitoring, alerting thresholds, and circuit breaker settings per environment

Key Principles

Classify before you catch. Know whether an error is operational or a bug, transient or permanent. The classification determines the handling strategy.

Errors are data, not just strings. Every error should carry a code, severity, context, and correlation ID. Strings alone cannot drive automated recovery or alerting.

Recover where possible, fail fast where necessary. Transient failures deserve retry and fallback. Programmer errors should crash the process -- do not mask bugs with catch blocks.

Never swallow, never expose. Every catch block must either recover, log and rethrow, or report to the user. But never expose stack traces, internal codes, or infrastructure details to end users.

Contain the blast radius. Use circuit breakers, bulkheads, and error boundaries to prevent one failure from cascading across the system. A broken recommendation engine should not prevent checkout.

Observe everything. Errors without structured logging, correlation IDs, and metrics are invisible. You cannot fix what you cannot see.

References

references/error-classification.md: Operational vs programmer error taxonomy, severity levels, and classification decision trees
references/recovery-strategies.md: Retry, fallback, circuit breaker, and bulkhead patterns with configuration guidance
references/logging-standards.md: Structured logging format, log levels, sensitive data handling, and correlation ID propagation
references/retry-patterns.md: Exponential backoff with jitter, idempotency requirements, and retry budget management
references/circuit-breaker-patterns.md: Circuit breaker state machine, configuration tuning, and bulkhead isolation strategies

# Supported AI Coding Agents

This skill is compatible with the SKILL.md standard and works with all major AI coding agents:

⚡ Amp 🚀 Antigravity 🤖 Claude Code 🦀 Clawdbot 📝 Codex ▶️ Cursor 🤖 Droid 💎 Gemini CLI 🐙 GitHub Copilot 🪿 Goose 📊 Kilo Code 🔧 Kiro CLI 💻 OpenCode 🦘 Roo Code 🌲 Trae 🏄 Windsurf

Learn more about the SKILL.md standard and how to use these skills with your preferred AI coding agent.