williamzujkowski

Observability Stack Configurator

3
0
# Install this skill:
npx skills add williamzujkowski/cognitive-toolworks --skill "Observability Stack Configurator"

Install specific skill from multi-skill repository

# Description

Grafana or CloudWatch dashboard definitions

# SKILL.md


name: "Observability Stack Configurator"
slug: observability-stack-configurator
description: "Configure comprehensive observability with metrics, logging, tracing, and alerting using Prometheus, OpenTelemetry, CloudWatch, and Grafana."
capabilities:
- configure_metrics_collection
- configure_logging_aggregation
- configure_distributed_tracing
- configure_alerting_rules
inputs:
platform:
type: string
description: "Platform: kubernetes, aws, azure, gcp, on-premise"
required: true
tech_stack:
type: array
description: "Application technologies to instrument"
required: true
requirements:
type: object
description: "SLIs, alerting rules, retention policies, dashboard specifications"
required: true
outputs:
metrics_config:
type: code
description: "Prometheus, CloudWatch, or Datadog configuration"
logging_config:
type: code
description: "Logging stack configuration (ELK, Loki, CloudWatch Logs)"
tracing_config:
type: code
description: "OpenTelemetry or X-Ray instrumentation"
dashboards:
type: array
description: "Grafana or CloudWatch dashboard definitions"
keywords:
- observability
- prometheus
- opentelemetry
- grafana
- cloudwatch
- logging
- tracing
- metrics
- alerting
- monitoring
version: 1.0.0
owner: william@cognitive-toolworks
license: MIT
security:
pii: false
secrets: false
sandbox: required
links:
- https://prometheus.io/docs/
- https://opentelemetry.io/docs/
- https://grafana.com/docs/
- https://aws.amazon.com/cloudwatch/


Purpose & When-To-Use

Trigger conditions:

  • Production incidents reveal lack of visibility into system behavior
  • Application deployment without monitoring or alerting
  • Troubleshooting requires distributed tracing across microservices
  • SLO/SLA commitments require metrics and alerting
  • Compliance or audit requires centralized logging
  • Performance optimization needs detailed metrics

Use this skill when you need a complete observability stack with metrics collection, log aggregation, distributed tracing, and intelligent alerting.


Pre-Checks

Before execution, verify:

  1. Time normalization: NOW_ET = 2025-10-26T01:33:56-04:00 (NIST/time.gov semantics, America/New_York)
  2. Input schema validation:
  3. platform is one of: kubernetes, aws, azure, gcp, on-premise
  4. tech_stack contains instrumentable technologies
  5. requirements.slis defines key service level indicators
  6. requirements.alerting_rules specifies conditions and thresholds
  7. requirements.retention_policies defines data retention periods
  8. Source freshness: All cited sources accessed on NOW_ET; verify documentation links current
  9. Platform compatibility: Confirm observability tools available on target platform

Abort conditions:

  • Platform doesn't support required observability tools
  • Tech stack cannot be instrumented (proprietary, closed-source without metrics endpoint)
  • Conflicting requirements (e.g., "zero cost" with "15-second granularity metrics")
  • Retention requirements violate regulatory constraints

Procedure

Tier 1 (Fast Path, ≤2k tokens)

Token budget: ≤2k tokens

Scope: Basic observability with essential metrics, logs, and simple alerting.

Steps:

  1. Design observability architecture (500 tokens):
  2. Select observability stack based on platform:
    • Kubernetes: Prometheus + Grafana + Loki
    • AWS: CloudWatch Metrics + Logs + X-Ray
    • Azure: Azure Monitor + Application Insights
    • GCP: Cloud Monitoring + Cloud Logging + Cloud Trace
  3. Identify instrumentation points in application code
  4. Define essential metrics (RED: Rate, Errors, Duration; USE: Utilization, Saturation, Errors)

  5. Generate observability configurations (1500 tokens):

  6. Metrics:
    • Prometheus scrape configs or CloudWatch metric filters
    • Application instrumentation snippets (client libraries)
    • Essential metrics: request rate, error rate, latency percentiles (p50, p95, p99)
  7. Logging:
    • Log aggregation configuration (Loki, CloudWatch Logs, ELK)
    • Structured logging format (JSON)
    • Log retention policies (7-30 days for development)
  8. Basic alerting:
    • Critical alerts: service down, error rate >5%, latency >1s
    • Alert routing configuration (email, Slack, PagerDuty)
  9. Simple dashboards:
    • Service health overview (uptime, request rate, error rate, latency)
    • Infrastructure metrics (CPU, memory, disk, network)

Decision point: If requirements include distributed tracing, SLO tracking, advanced analytics, or multi-cluster → escalate to T2.


Tier 2 (Extended Analysis, ≤6k tokens)

Token budget: ≤6k tokens

Scope: Comprehensive observability with distributed tracing, SLO tracking, advanced alerting, and correlation.

Steps:

  1. Design comprehensive observability (2000 tokens):
  2. Distributed tracing (accessed 2025-10-26T01:33:56-04:00):
    • OpenTelemetry: Language-agnostic instrumentation for metrics, logs, traces
    • Trace context propagation across service boundaries (W3C Trace Context)
    • Sampling strategies (head-based, tail-based) for cost optimization
    • Integration with Jaeger, Zipkin, or cloud-native solutions (X-Ray, Cloud Trace)
  3. SLO tracking:
    • Define SLIs from requirements (availability, latency, error rate)
    • Calculate SLO compliance and error budgets
    • Configure SLO dashboards with burn rate alerts
  4. Advanced metrics:
    • Business metrics (conversion rate, transaction volume)
    • Application performance monitoring (APM) with detailed breakdowns
    • Custom metrics for domain-specific monitoring
  5. Log correlation:

    • Trace ID injection into logs for correlation
    • Structured logging with consistent fields
    • Log-based metrics for pattern detection
  6. Generate comprehensive configurations (4000 tokens):

  7. Prometheus/CloudWatch advanced:
    • Recording rules for precomputed aggregations
    • Federation for multi-cluster metrics
    • Long-term storage (Thanos, Cortex, or cloud-native)
    • Service discovery for dynamic targets
  8. OpenTelemetry instrumentation:
    • Auto-instrumentation for common frameworks
    • Custom spans for business-critical operations
    • Baggage propagation for cross-service context
    • Collector configuration with processors and exporters
  9. Advanced alerting:
    • Multi-condition alerts with logical operators
    • Anomaly detection for dynamic thresholds
    • Alert grouping and deduplication
    • Escalation policies and on-call schedules
    • Runbook links in alert descriptions
  10. Comprehensive dashboards:
    • Service dependency maps
    • SLO compliance tracking
    • Cost attribution and optimization
    • Capacity planning metrics
  11. Log analytics:
    • Full-text search and filtering
    • Log-based alerting
    • Anomaly detection in logs
    • Compliance audit trails

Sources cited (accessed 2025-10-26T01:33:56-04:00):

  • Prometheus Best Practices: https://prometheus.io/docs/practices/
  • OpenTelemetry: https://opentelemetry.io/docs/concepts/
  • Grafana Dashboards: https://grafana.com/docs/grafana/latest/dashboards/
  • Google SRE Monitoring: https://sre.google/sre-book/monitoring-distributed-systems/

Tier 3 (Deep Dive, ≤12k tokens)

Token budget: ≤12k tokens

Scope: Enterprise observability with AI/ML insights, cost optimization, and security monitoring.

Steps:

  1. AI/ML-enhanced observability (4000 tokens):
  2. Anomaly detection with machine learning models
  3. Predictive alerting based on historical patterns
  4. Root cause analysis automation
  5. Capacity forecasting with time-series prediction
  6. Automated incident triage and correlation

  7. Advanced analytics and optimization (4000 tokens):

  8. Observability data lake for long-term analysis
  9. Cost optimization through sampling and aggregation strategies
  10. Multi-tenancy with namespace isolation
  11. Cardinality management for high-dimensional metrics
  12. Query optimization and performance tuning
  13. Data retention tiering (hot/warm/cold storage)

  14. Security and compliance monitoring (4000 tokens):

  15. Security event logging and SIEM integration
  16. Audit trail generation for compliance (SOC2, HIPAA, PCI-DSS)
  17. Sensitive data masking in logs
  18. Access control and authentication for observability tools
  19. Encryption at rest and in transit for telemetry data
  20. Compliance reporting and evidence collection

Additional sources (accessed 2025-10-26T01:33:56-04:00):

  • OpenTelemetry Collector: https://opentelemetry.io/docs/collector/
  • Thanos: https://thanos.io/tip/thanos/quick-tutorial.md
  • AWS Observability Best Practices: https://aws-observability.github.io/observability-best-practices/

Decision Rules

Observability stack selection:

  • Prometheus + Grafana: Open-source, Kubernetes-native, vendor-neutral
  • CloudWatch: AWS-native, tight integration, managed service
  • Datadog/New Relic: Comprehensive SaaS, fast setup, higher cost
  • Elastic Stack (ELK): Powerful log analytics, full-text search
  • OpenTelemetry: Vendor-agnostic instrumentation, future-proof

Metric collection strategy:

  • Pull-based (Prometheus): Good for dynamic environments, service discovery
  • Push-based (CloudWatch): Good for ephemeral workloads (Lambda, batch jobs)
  • Hybrid: Use both based on workload characteristics

Retention policies:

  • Metrics: 15 days high-resolution, 90 days aggregated, 1 year downsampled
  • Logs: 7-30 days searchable, longer for compliance (1-7 years)
  • Traces: 7-14 days with sampling (1-10% of traces)

Escalation conditions:

  • Novel platform without established observability patterns
  • Requirements exceed T3 scope (custom data pipeline, ML model training)
  • Compliance requirements need specialized tools (SIEM, DLP)

Abort conditions:

  • Platform restrictions prevent telemetry export
  • Conflicting requirements (e.g., "no network egress" with "SaaS monitoring")
  • Cost constraints incompatible with retention/granularity requirements

Output Contract

Required outputs:

{
  "metrics_config": {
    "type": "object",
    "properties": {
      "platform": "string (prometheus|cloudwatch|datadog)",
      "scrape_configs": "string (YAML configuration)",
      "recording_rules": "string (optional aggregation rules)",
      "retention": "string (duration)"
    }
  },
  "logging_config": {
    "type": "object",
    "properties": {
      "platform": "string (loki|cloudwatch-logs|elasticsearch)",
      "aggregation_config": "string (configuration)",
      "retention_policy": "string (duration or storage class)",
      "structured_format": "string (JSON schema)"
    }
  },
  "tracing_config": {
    "type": "object",
    "properties": {
      "platform": "string (opentelemetry|jaeger|x-ray)",
      "instrumentation": "string (language-specific code)",
      "sampling_rate": "number (0.0 to 1.0)",
      "exporter_config": "string (backend configuration)"
    }
  },
  "dashboards": {
    "type": "array",
    "items": {
      "name": "string",
      "platform": "string (grafana|cloudwatch)",
      "definition": "string (JSON or YAML)"
    }
  },
  "alerting_rules": {
    "type": "array",
    "items": {
      "name": "string",
      "condition": "string (PromQL or equivalent)",
      "severity": "string (critical|warning|info)",
      "notification_channel": "string"
    }
  }
}

Quality guarantees:

  • Metrics cover RED (Rate, Errors, Duration) and USE (Utilization, Saturation, Errors) methods
  • Logs are structured with consistent fields (timestamp, level, message, trace_id)
  • Traces propagate context across service boundaries
  • Alerting rules avoid false positives with appropriate thresholds
  • Dashboards provide actionable insights (not vanity metrics)

Examples

Example: Prometheus scrape config with OpenTelemetry

# prometheus.yml
global:
  scrape_interval: 15s
  evaluation_interval: 15s

scrape_configs:
  - job_name: 'api-service'
    kubernetes_sd_configs:
      - role: pod
        namespaces:
          names: ['production']
    relabel_configs:
      - source_labels: [__meta_kubernetes_pod_label_app]
        regex: api
        action: keep

alerting_rules:
  - alert: HighErrorRate
    expr: rate(http_requests_total{status=~"5.."}[5m]) > 0.05
    for: 5m
    labels:
      severity: critical

Quality Gates

Token budgets:

  • T1: ≤2k tokens (basic metrics, logs, alerting)
  • T2: ≤6k tokens (distributed tracing, SLO tracking, advanced alerting)
  • T3: ≤12k tokens (AI/ML insights, security monitoring, compliance)

Safety checks:

  • No sensitive data (PII, credentials) in logs or metrics
  • Encryption configured for telemetry data in transit and at rest
  • Access controls on observability dashboards and data
  • Cost controls to prevent runaway metric cardinality

Auditability:

  • All configuration changes version-controlled
  • Alert history retained for incident retrospectives
  • Compliance logs immutable and tamper-evident

Determinism:

  • Same inputs produce identical observability configurations
  • Alerting thresholds based on data-driven baselines
  • Dashboard definitions reproducible from code

Resources

Official Documentation (accessed 2025-10-26T01:33:56-04:00):

  • Prometheus: https://prometheus.io/docs/
  • OpenTelemetry: https://opentelemetry.io/docs/
  • Grafana: https://grafana.com/docs/
  • AWS CloudWatch: https://docs.aws.amazon.com/cloudwatch/

Best Practices (accessed 2025-10-26T01:33:56-04:00):

  • Google SRE Book - Monitoring: https://sre.google/sre-book/monitoring-distributed-systems/
  • RED Method: https://www.weave.works/blog/the-red-method-key-metrics-for-microservices-architecture/
  • USE Method: https://www.brendangregg.com/usemethod.html

Templates (in repository /resources/):

  • Prometheus configurations for common platforms
  • OpenTelemetry instrumentation examples
  • Grafana dashboard templates
  • CloudWatch alarm and dashboard definitions

# Supported AI Coding Agents

This skill is compatible with the SKILL.md standard and works with all major AI coding agents:

Learn more about the SKILL.md standard and how to use these skills with your preferred AI coding agent.