Use when adding new error messages to React, or seeing "unknown error code" warnings.
npx skills add Mindrally/skills --skill "monitoring-guidelines"
Install specific skill from multi-skill repository
# Description
Monitoring guidelines for applications and infrastructure including metrics collection, alerting strategies, and SLO-based monitoring
# SKILL.md
name: monitoring-guidelines
description: Monitoring guidelines for applications and infrastructure including metrics collection, alerting strategies, and SLO-based monitoring
Monitoring Guidelines
Apply these monitoring principles to ensure system reliability, performance visibility, and proactive issue detection.
Core Monitoring Principles
- Monitor the four golden signals: latency, traffic, errors, and saturation
- Implement monitoring as code for reproducibility
- Design monitoring around user experience and business impact
- Use SLOs (Service Level Objectives) to guide alerting decisions
- Balance comprehensive coverage with actionable insights
Key Metrics to Monitor
Application Metrics
- Request rate (requests per second)
- Error rate (percentage of failed requests)
- Response time (p50, p90, p95, p99 latencies)
- Active connections and concurrent users
- Queue depths and processing times
Infrastructure Metrics
- CPU utilization and load average
- Memory usage and available memory
- Disk I/O and available storage
- Network throughput and error rates
- Container and pod health (for Kubernetes)
Business Metrics
- Transaction volumes and values
- User signups and conversions
- Feature usage and adoption rates
- Revenue-impacting events
- Customer satisfaction indicators
Alerting Strategy
Alert Design Principles
- Alert on symptoms, not causes
- Make alerts actionable with clear remediation steps
- Set appropriate severity levels (critical, warning, info)
- Avoid alert fatigue through proper threshold tuning
- Include runbook links in alert notifications
SLO-Based Alerting
- Define SLOs for critical user journeys
- Calculate error budgets and burn rates
- Alert when error budget consumption is high
- Use multi-window, multi-burn-rate alerts
- Review and adjust SLOs quarterly
Alert Configuration
- Set meaningful thresholds based on baseline data
- Use hysteresis to prevent flapping alerts
- Implement alert dependencies to reduce noise
- Route alerts to appropriate teams
- Configure escalation policies
Dashboard Design
Effective Dashboards
- Create overview dashboards for service health
- Build detailed dashboards for debugging
- Use consistent layouts and naming conventions
- Include time range selectors and drill-down capabilities
- Display SLO status prominently
Dashboard Content
- Show current state and recent trends
- Include comparison to baseline or previous periods
- Display deployment markers for correlation
- Add annotations for significant events
- Include links to related dashboards and logs
Monitoring Tools Integration
Data Collection
- Use agents or sidecars for metric collection
- Implement service discovery for dynamic environments
- Configure appropriate scrape intervals
- Use push vs pull based on use case
- Ensure metric cardinality is manageable
Data Storage and Retention
- Set retention periods based on use case
- Implement downsampling for long-term storage
- Use appropriate storage backends for scale
- Plan for disaster recovery of monitoring data
- Monitor your monitoring infrastructure
Health Checks and Probes
- Implement liveness probes for crash detection
- Use readiness probes for traffic management
- Create deep health checks that verify dependencies
- Expose health endpoints in a standard format
- Monitor health check latency as a metric
Incident Response
- Use monitoring data to detect incidents early
- Correlate metrics, logs, and traces during investigation
- Document findings and update monitoring post-incident
- Track MTTR (Mean Time to Recovery) metrics
- Conduct regular monitoring reviews and improvements
Capacity Planning
- Track resource utilization trends
- Set alerts for approaching capacity limits
- Use forecasting for proactive scaling
- Document capacity requirements and headroom
- Review capacity quarterly
# Supported AI Coding Agents
This skill is compatible with the SKILL.md standard and works with all major AI coding agents:
Learn more about the SKILL.md standard and how to use these skills with your preferred AI coding agent.