mindrally

monitoring-guidelines

3
0
# Install this skill:
npx skills add Mindrally/skills --skill "monitoring-guidelines"

Install specific skill from multi-skill repository

# Description

Monitoring guidelines for applications and infrastructure including metrics collection, alerting strategies, and SLO-based monitoring

# SKILL.md


name: monitoring-guidelines
description: Monitoring guidelines for applications and infrastructure including metrics collection, alerting strategies, and SLO-based monitoring


Monitoring Guidelines

Apply these monitoring principles to ensure system reliability, performance visibility, and proactive issue detection.

Core Monitoring Principles

  • Monitor the four golden signals: latency, traffic, errors, and saturation
  • Implement monitoring as code for reproducibility
  • Design monitoring around user experience and business impact
  • Use SLOs (Service Level Objectives) to guide alerting decisions
  • Balance comprehensive coverage with actionable insights

Key Metrics to Monitor

Application Metrics

  • Request rate (requests per second)
  • Error rate (percentage of failed requests)
  • Response time (p50, p90, p95, p99 latencies)
  • Active connections and concurrent users
  • Queue depths and processing times

Infrastructure Metrics

  • CPU utilization and load average
  • Memory usage and available memory
  • Disk I/O and available storage
  • Network throughput and error rates
  • Container and pod health (for Kubernetes)

Business Metrics

  • Transaction volumes and values
  • User signups and conversions
  • Feature usage and adoption rates
  • Revenue-impacting events
  • Customer satisfaction indicators

Alerting Strategy

Alert Design Principles

  • Alert on symptoms, not causes
  • Make alerts actionable with clear remediation steps
  • Set appropriate severity levels (critical, warning, info)
  • Avoid alert fatigue through proper threshold tuning
  • Include runbook links in alert notifications

SLO-Based Alerting

  • Define SLOs for critical user journeys
  • Calculate error budgets and burn rates
  • Alert when error budget consumption is high
  • Use multi-window, multi-burn-rate alerts
  • Review and adjust SLOs quarterly

Alert Configuration

  • Set meaningful thresholds based on baseline data
  • Use hysteresis to prevent flapping alerts
  • Implement alert dependencies to reduce noise
  • Route alerts to appropriate teams
  • Configure escalation policies

Dashboard Design

Effective Dashboards

  • Create overview dashboards for service health
  • Build detailed dashboards for debugging
  • Use consistent layouts and naming conventions
  • Include time range selectors and drill-down capabilities
  • Display SLO status prominently

Dashboard Content

  • Show current state and recent trends
  • Include comparison to baseline or previous periods
  • Display deployment markers for correlation
  • Add annotations for significant events
  • Include links to related dashboards and logs

Monitoring Tools Integration

Data Collection

  • Use agents or sidecars for metric collection
  • Implement service discovery for dynamic environments
  • Configure appropriate scrape intervals
  • Use push vs pull based on use case
  • Ensure metric cardinality is manageable

Data Storage and Retention

  • Set retention periods based on use case
  • Implement downsampling for long-term storage
  • Use appropriate storage backends for scale
  • Plan for disaster recovery of monitoring data
  • Monitor your monitoring infrastructure

Health Checks and Probes

  • Implement liveness probes for crash detection
  • Use readiness probes for traffic management
  • Create deep health checks that verify dependencies
  • Expose health endpoints in a standard format
  • Monitor health check latency as a metric

Incident Response

  • Use monitoring data to detect incidents early
  • Correlate metrics, logs, and traces during investigation
  • Document findings and update monitoring post-incident
  • Track MTTR (Mean Time to Recovery) metrics
  • Conduct regular monitoring reviews and improvements

Capacity Planning

  • Track resource utilization trends
  • Set alerts for approaching capacity limits
  • Use forecasting for proactive scaling
  • Document capacity requirements and headroom
  • Review capacity quarterly

# Supported AI Coding Agents

This skill is compatible with the SKILL.md standard and works with all major AI coding agents:

Learn more about the SKILL.md standard and how to use these skills with your preferred AI coding agent.