soilmass

observability

0
0
# Install this skill:
npx skills add soilmass/vibe-coding-plugin --skill "observability"

Install specific skill from multi-skill repository

# Description

>

# SKILL.md


name: observability
description: >
OpenTelemetry tracing, health check endpoints, circuit breakers, graceful degradation for production Next.js 15
allowed-tools: Read, Grep, Glob


Observability

Purpose

Production observability for Next.js 15 with OpenTelemetry tracing, health/readiness endpoints,
circuit breakers, and graceful degradation. Extends the logging skill with distributed tracing
and resilience patterns.

When to Use

  • Setting up OpenTelemetry SDK with exporters (Vercel, Datadog, Grafana)
  • Adding /api/health (liveness) and /api/ready (readiness) endpoints
  • Implementing trace propagation across Server Components, Actions, and Prisma
  • Adding circuit breakers for external service calls
  • Configuring retry with exponential backoff

When NOT to Use

  • Basic structured logging β†’ logging
  • Error boundaries and error.tsx β†’ error-handling
  • Vercel Analytics / PostHog tracking β†’ analytics
  • Performance profiling and bundle analysis β†’ performance

Pattern

OpenTelemetry SDK setup

// src/lib/tracing.ts
import "server-only";
import { NodeSDK } from "@opentelemetry/sdk-node";
import { OTLPTraceExporter } from "@opentelemetry/exporter-trace-otlp-http";
import { getNodeAutoInstrumentations } from "@opentelemetry/auto-instrumentations-node";
import { Resource } from "@opentelemetry/resources";
import { ATTR_SERVICE_NAME } from "@opentelemetry/semantic-conventions";

const sdk = new NodeSDK({
  resource: new Resource({
    [ATTR_SERVICE_NAME]: process.env.OTEL_SERVICE_NAME ?? "next-app",
  }),
  traceExporter: new OTLPTraceExporter({
    url: process.env.OTEL_EXPORTER_OTLP_ENDPOINT,
  }),
  instrumentations: [getNodeAutoInstrumentations()],
});

sdk.start();

Instrumentation file

// instrumentation.ts (root β€” Next.js auto-loads this)
export async function register() {
  if (process.env.NEXT_RUNTIME === "nodejs") {
    await import("./src/lib/tracing");
  }
}

Health check endpoint (liveness)

// src/app/api/health/route.ts
import { NextResponse } from "next/server";

export async function GET() {
  return NextResponse.json({ status: "ok", timestamp: Date.now() });
}

Readiness check endpoint

// src/app/api/ready/route.ts
import { NextResponse } from "next/server";
import { db } from "@/lib/db";

let cachedStatus: { ok: boolean; checkedAt: number } | null = null;
const CACHE_TTL = 10_000; // 10 seconds

async function checkDependencies() {
  const now = Date.now();
  if (cachedStatus && now - cachedStatus.checkedAt < CACHE_TTL) {
    return cachedStatus.ok;
  }

  try {
    await db.$queryRaw`SELECT 1`;
    cachedStatus = { ok: true, checkedAt: now };
    return true;
  } catch {
    cachedStatus = { ok: false, checkedAt: now };
    return false;
  }
}

export async function GET() {
  const dbReady = await checkDependencies();

  if (!dbReady) {
    return NextResponse.json(
      { status: "not_ready", db: "down" },
      { status: 503 }
    );
  }

  return NextResponse.json({ status: "ready", db: "up" });
}

Trace ID correlation with logger

// In Server Actions or route handlers
import { trace, context } from "@opentelemetry/api";
import { logger } from "@/lib/logger";

export async function myAction(formData: FormData) {
  const span = trace.getActiveSpan();
  const traceId = span?.spanContext().traceId;
  const log = logger.child({ traceId });

  log.info("Action started");
  // ... business logic
  log.info("Action completed");
}

Custom span instrumentation

import { trace } from "@opentelemetry/api";
import { after } from "next/server";

const tracer = trace.getTracer("app");

export async function fetchExternalData(userId: string) {
  return tracer.startActiveSpan("fetchExternalData", async (span) => {
    span.setAttribute("user.id", userId);
    try {
      const data = await fetch("https://api.example.com/data", {
        signal: AbortSignal.timeout(5000),
      });
      span.setAttribute("http.status_code", data.status);
      return data.json();
    } catch (error) {
      span.recordException(error as Error);
      span.setAttribute("error.type", (error as Error).name);
      throw error;
    } finally {
      span.end();
    }
  });
}

Custom metrics (counters, gauges, histograms)

// src/lib/metrics.ts
import "server-only";
import { metrics } from "@opentelemetry/api";

const meter = metrics.getMeter("app");

export const httpRequestDuration = meter.createHistogram("http.request.duration", {
  description: "HTTP request duration in milliseconds",
  unit: "ms",
});

export const activeConnections = meter.createUpDownCounter("db.connections.active", {
  description: "Number of active database connections",
});

export const ordersCreated = meter.createCounter("orders.created", {
  description: "Total number of orders created",
});

// Usage in Server Action
export async function createOrder(formData: FormData) {
  const start = Date.now();
  try {
    // ... order logic
    ordersCreated.add(1, { plan: "pro" });
  } finally {
    httpRequestDuration.record(Date.now() - start, { route: "/api/orders" });
  }
}

Circuit breaker pattern

// src/lib/circuit-breaker.ts
import CircuitBreaker from "opossum";
import { logger } from "@/lib/logger";

export function createBreaker<T>(
  fn: (...args: unknown[]) => Promise<T>,
  options?: Partial<CircuitBreaker.Options>
) {
  const breaker = new CircuitBreaker(fn, {
    timeout: 5000,
    errorThresholdPercentage: 50,
    resetTimeout: 30000,
    ...options,
  });

  breaker.on("open", () => logger.warn("Circuit breaker opened"));
  breaker.on("halfOpen", () => logger.info("Circuit breaker half-open"));
  breaker.on("close", () => logger.info("Circuit breaker closed"));

  return breaker;
}

Retry with exponential backoff

// src/lib/retry.ts
import pRetry from "p-retry";

export async function withRetry<T>(
  fn: () => Promise<T>,
  options?: { retries?: number; label?: string }
) {
  return pRetry(fn, {
    retries: options?.retries ?? 3,
    onFailedAttempt: (error) => {
      logger.warn({
        msg: `Retry attempt ${error.attemptNumber} for ${options?.label}`,
        retriesLeft: error.retriesLeft,
      });
    },
  });
}

Graceful degradation

// Fallback when external service is unavailable
const paymentBreaker = createBreaker(processPayment);

export async function checkout(orderId: string) {
  try {
    return await paymentBreaker.fire(orderId);
  } catch {
    // Fallback: queue for retry instead of failing
    await queuePaymentRetry(orderId);
    return { status: "queued", message: "Payment will be processed shortly" };
  }
}

Anti-pattern

Tracing everything

Don't add spans to every function β€” high cardinality kills performance.
Trace boundaries: HTTP handlers, Server Actions, database queries, external APIs.
Internal utility functions don't need individual spans.

Health check hitting DB on every request

Cache health status with a TTL. Kubernetes probes hit health endpoints frequently β€”
an uncached check creates unnecessary database load.

No timeout on external calls

Always use AbortSignal.timeout() or set explicit timeouts. Hanging requests
consume server resources and eventually cascade into failures.

Common Mistakes

  • Forgetting instrumentation.ts β€” Next.js won't auto-load tracing without it
  • Tracing in Edge Runtime β€” OpenTelemetry Node SDK doesn't work in Edge
  • Not setting OTEL_EXPORTER_OTLP_ENDPOINT β€” traces go nowhere
  • Health check returning 200 when DB is down β€” defeats the purpose
  • No fallback for circuit breaker β€” open circuit throws instead of degrading gracefully

Checklist

  • [ ] instrumentation.ts exists at project root
  • [ ] OpenTelemetry SDK configured with OTLP exporter
  • [ ] /api/health returns liveness status
  • [ ] /api/ready checks database and returns readiness
  • [ ] External API calls have AbortSignal.timeout()
  • [ ] Circuit breaker wraps non-critical external services
  • [ ] Trace IDs correlate with Pino logger
  • [ ] after() used for non-blocking span export where needed

Composes With

  • logging β€” trace ID correlation with Pino structured logs
  • prisma β€” trace database queries via auto-instrumentation
  • api-routes β€” instrument route handlers with spans
  • deploy β€” health endpoints for Kubernetes/Vercel probes
  • error-handling β€” error spans with attributes
  • performance β€” tracing identifies slow paths

# Supported AI Coding Agents

This skill is compatible with the SKILL.md standard and works with all major AI coding agents:

Learn more about the SKILL.md standard and how to use these skills with your preferred AI coding agent.