Use when you have a written implementation plan to execute in a separate session with review checkpoints
npx skills add williamzujkowski/cognitive-toolworks --skill "SRE SLO/SLI Calculator"
Install specific skill from multi-skill repository
# Description
Alert rules based on error budget consumption
# SKILL.md
name: "SRE SLO/SLI Calculator"
slug: observability-slo-calculator
description: "Define and calculate SLIs, SLOs, SLAs, and error budgets with monitoring integration for site reliability engineering."
capabilities:
- define_slis_from_user_journey
- calculate_slo_thresholds
- compute_error_budgets
- generate_monitoring_queries
- design_alerting_policies
inputs:
service_type:
type: string
description: "Service category: api, frontend, batch, streaming, storage, or database"
required: true
user_journey:
type: array
description: "Critical user paths or workflows that define service success"
required: true
availability_target:
type: number
description: "Desired availability percentage (e.g., 99.9, 99.95, 99.99)"
required: false
monitoring_platform:
type: string
description: "Target platform: prometheus, datadog, cloudwatch, newrelic, or generic"
required: false
outputs:
sli_definitions:
type: json
description: "Service Level Indicators with measurement methods and data sources"
slo_targets:
type: json
description: "Service Level Objectives with thresholds and time windows"
error_budget:
type: json
description: "Error budget calculations with burn rate and remaining budget"
monitoring_queries:
type: code
description: "Platform-specific queries for SLI measurement"
alerting_policy:
type: json
description: "Alert rules based on error budget consumption"
keywords:
- sre
- slo
- sli
- sla
- error-budget
- reliability
- monitoring
- prometheus
- observability
- availability
version: 1.0.0
owner: william@cognitive-toolworks
license: MIT
security:
pii: false
secrets: false
sandbox: not_required
links:
- https://sre.google/sre-book/service-level-objectives/
- https://sre.google/workbook/implementing-slos/
- https://landing.google.com/sre/workbook/chapters/alerting-on-slos/
- https://prometheus.io/docs/practices/alerting/
Purpose & When-To-Use
Trigger conditions:
- New service needs reliability targets and SLO definition
- Existing service requires error budget calculation and tracking
- Stakeholders demand SLA commitments with measurable guarantees
- Incident postmortem reveals need for better reliability metrics
- Service experiences reliability issues; need objective measurement
- DevOps team adopting SRE practices and needs SLO framework
- Monitoring alerts are noisy; need error budget-based alerting
Use this skill when you need to establish measurable reliability targets (SLIs/SLOs), calculate error budgets, and integrate with monitoring platforms to track service reliability against user expectations.
Pre-Checks
Before execution, verify:
- Time normalization:
NOW_ET = 2025-10-25T21:30:36-04:00(NIST/time.gov semantics, America/New_York) - Input schema validation:
service_typeis one of:api,frontend,batch,streaming,storage,databaseuser_journeyis non-empty array describing critical user workflowsavailability_target(if provided) is between 90.0 and 100.0monitoring_platformis one of:prometheus,datadog,cloudwatch,newrelic,generic- Source freshness: All cited sources (Google SRE books, monitoring docs) accessed on
NOW_ET - Baseline data: Historical service metrics available for realistic SLO calibration
Abort conditions:
- Service has no measurable user-facing behavior (cannot define meaningful SLIs)
- No monitoring/observability infrastructure exists (cannot measure SLIs)
- Availability target >99.999% without justification (unrealistic for most services)
- User journey description is vague or non-specific
Procedure
Tier 1 (Fast Path)
Token budget: T1 ≤2k tokens
Scope: Common 80% case - standard service with typical reliability requirements.
Steps:
- Identify service golden signals based on
service_type(accessed 2025-10-25T21:30:36-04:00: https://sre.google/sre-book/monitoring-distributed-systems/): - API/Frontend: Request latency, error rate, throughput
- Batch: Job completion rate, processing time, data quality
- Streaming: Event processing lag, throughput, data loss rate
- Storage: Read/write latency, availability, durability
-
Database: Query latency, connection availability, replication lag
-
Map user journey to SLIs:
- For each critical path in
user_journey, identify primary golden signal - Define measurement window (rolling 30-day default per SRE Workbook, accessed 2025-10-25T21:30:36-04:00: https://sre.google/workbook/implementing-slos/)
-
Specify aggregation method (e.g., 99th percentile latency, error ratio)
-
Calculate SLO thresholds:
- Use
availability_targetor default to 99.9% for user-facing services - Convert availability to error budget:
error_budget = (1 - availability_target) * measurement_window -
Example: 99.9% over 30 days = 43.2 minutes downtime allowed
-
Generate basic monitoring query (Prometheus example):
promql # Availability SLI sum(rate(http_requests_total{status!~"5.."}[5m])) / sum(rate(http_requests_total[5m])) -
Output: SLI definitions, SLO targets, error budget, and sample query
Tier 2 (Extended Design)
Token budget: T2 ≤6k tokens
Scope: Production services requiring comprehensive SRE implementation with alerting and error budget policies.
Steps:
- Detailed SLI specification (accessed 2025-10-25T21:30:36-04:00: https://sre.google/workbook/implementing-slos/):
- Availability SLI: Ratio of successful requests to total requests
- Latency SLI: Proportion of requests faster than threshold (e.g., 95% <200ms)
- Freshness SLI: For batch/streaming - data staleness acceptable threshold
- Correctness SLI: Data quality or functional correctness metric
-
Throughput SLI: Minimum requests/events processed per time window
-
Multi-window SLO definition:
- 30-day rolling window: Primary SLO for long-term reliability trends
- 7-day rolling window: Short-term health indicator
- 28-day sliding window: Monthly reporting alignment
-
Specify different thresholds per window if needed (looser short-term, stricter long-term)
-
Error budget calculation with burn rate (accessed 2025-10-25T21:30:36-04:00: https://landing.google.com/sre/workbook/chapters/alerting-on-slos/):
- Calculate total error budget for each time window
- Define burn rate thresholds:
- Fast burn (2% budget consumed in 1 hour): Page immediately
- Moderate burn (5% budget consumed in 6 hours): Alert on-call
- Slow burn (10% budget consumed in 3 days): Ticket for review
-
Generate burn rate queries:
promql # 1-hour burn rate (fast burn) (1 - (sum(rate(http_requests_total{status!~"5.."}[1h])) / sum(rate(http_requests_total[1h])))) > (14.4 * (1 - 0.999)) # 14.4x faster than budget allows -
Platform-specific query generation:
- Prometheus: PromQL with recording rules for SLI components
- Datadog: Datadog Query Language with monitors
- CloudWatch: CloudWatch Metrics Math expressions
- New Relic: NRQL queries with baseline
-
Include both SLI measurement and error budget consumption queries
-
Alerting policy design:
- Define alert severity levels (page, alert, ticket)
- Map burn rate to severity:
- Critical: >2% error budget consumed in 1 hour
- Warning: >5% error budget consumed in 6 hours
- Info: >10% error budget consumed in 3 days
-
Include alert content: current error rate, time to budget exhaustion, remediation guidance
-
SLA derivation (if customer-facing):
- SLA = SLO - safety margin (typically 0.1-0.5%)
- Example: Internal SLO 99.95%, external SLA 99.9%
-
Include consequences for SLA breach (credits, penalties)
-
Error budget policy:
- Define feature freeze conditions: when error budget <10% remaining in 30-day window
- Escalation path: SRE team → Engineering manager → VP Engineering
-
Recovery actions: halt risky deploys, prioritize reliability work
-
Comprehensive output:
- Detailed SLI definitions with measurement methodology
- Multi-window SLO targets
- Error budget calculations with burn rates
- Complete monitoring queries for chosen platform
- Alerting policy with severity mapping
- Error budget policy document
- (Optional) SLA terms if customer-facing
Sources cited (accessed 2025-10-25T21:30:36-04:00):
- Google SRE Book Ch. 4: https://sre.google/sre-book/service-level-objectives/
- Google SRE Workbook Ch. 2: https://sre.google/workbook/implementing-slos/
- Google SRE Workbook Ch. 5: https://landing.google.com/sre/workbook/chapters/alerting-on-slos/
- Prometheus Best Practices: https://prometheus.io/docs/practices/alerting/
Tier 3 (Deep Dive, not implemented)
Token budget: T3 not implemented for this skill
Rationale: T2 tier provides comprehensive SRE implementation with multi-window SLOs, error budget policies, and platform-specific monitoring integration. T3 would potentially cover highly specialized scenarios (custom SLI aggregation algorithms, multi-cluster federation, advanced anomaly detection) that are better handled through expert SRE consultation or custom tooling development.
Decision Rules
SLI selection criteria:
- Choose availability SLI if user journey depends on service being reachable (most common)
- Choose latency SLI if user experience degrades with slow responses (interactive services)
- Choose freshness SLI if data staleness impacts user decisions (analytics, dashboards)
- Choose correctness SLI if wrong data is worse than no data (financial, healthcare)
SLO threshold selection:
- 99.9% (three nines): Standard for most user-facing services; 43.2 min downtime/month
- 99.95% (three nines five): High-value services; 21.6 min downtime/month
- 99.99% (four nines): Mission-critical services; 4.32 min downtime/month
- <99.9%: Internal tools, non-critical batch jobs
- >99.99%: Rarely justified; requires significant investment and architectural complexity
Ambiguity thresholds:
- If
user_journeycontains >10 critical paths → ask user to prioritize top 3-5 - If
availability_targetnot specified → default to 99.9% for user-facing, 99.5% for internal - If service has multiple user types (free vs paid) → ask if different SLOs needed per tier
- If no historical data available → start with conservative SLO (99.0%), iterate after 1 month
Abort/stop conditions:
- User cannot articulate what "service working" means → SLI definition impossible
- Requested SLO (e.g., 99.999%) exceeds architectural capabilities → warn and recommend realistic target
- Monitoring platform cannot measure proposed SLI → suggest alternative SLI or platform upgrade
Output Contract
Required fields:
{
"sli_definitions": [
{
"name": "string (e.g., 'API Availability')",
"type": "availability | latency | freshness | correctness | throughput",
"description": "string (what this SLI measures)",
"measurement_method": "string (how to calculate)",
"data_source": "string (metrics endpoint, logs, traces)",
"good_events": "string (numerator definition)",
"total_events": "string (denominator definition)",
"threshold": "number (for latency/freshness SLIs)",
"unit": "string (ms, %, count)"
}
],
"slo_targets": [
{
"sli_name": "string (references SLI above)",
"target": "number (e.g., 99.9)",
"window": "string (30d, 7d, 28d)",
"window_type": "rolling | calendar",
"budget_threshold": "number (alert threshold %)"
}
],
"error_budget": {
"total_minutes_30d": "number (allowable downtime)",
"burn_rate_thresholds": {
"critical_1h": "number (14.4x for 99.9%)",
"warning_6h": "number (6x for 99.9%)",
"info_3d": "number (1x for 99.9%)"
},
"current_consumption": "number | null (if historical data available)"
},
"monitoring_queries": {
"platform": "string (prometheus, datadog, etc.)",
"sli_queries": ["array of query strings"],
"error_budget_queries": ["array of query strings"]
},
"alerting_policy": [
{
"severity": "critical | warning | info",
"condition": "string (burn rate threshold)",
"notification": "string (page, email, ticket)",
"duration": "string (alert evaluation window)"
}
]
}
Optional fields:
sla_terms: Customer-facing SLA commitments (if applicable)error_budget_policy: Freeze/escalation proceduresdashboard_config: Monitoring dashboard layout recommendations
Validation:
- All SLI names must be unique
- SLO targets must reference defined SLIs
- Error budget calculations must align with SLO targets
- Monitoring queries must be syntactically valid for target platform
Examples
Input:
{
"service_type": "api",
"user_journey": [
"User searches products",
"User adds item to cart",
"User completes checkout"
],
"availability_target": 99.9,
"monitoring_platform": "prometheus"
}
Output (abbreviated):
{
"sli_definitions": [
{
"name": "API Availability",
"type": "availability",
"measurement_method": "ratio of successful HTTP responses (2xx/3xx) to total requests",
"good_events": "http_requests_total{status=~\"2..|3..\"}",
"total_events": "http_requests_total"
}
],
"slo_targets": [{"sli_name": "API Availability", "target": 99.9, "window": "30d"}],
"error_budget": {"total_minutes_30d": 43.2},
"monitoring_queries": {
"platform": "prometheus",
"sli_queries": [
"sum(rate(http_requests_total{status=~\"2..|3..\"}[5m])) / sum(rate(http_requests_total[5m]))"
]
}
}
Quality Gates
Token budgets:
- T1: ≤2k tokens for simple SLI/SLO definition with basic monitoring
- T2: ≤6k tokens for comprehensive SRE implementation with error budget policies
Safety requirements:
- Validate SLO targets are achievable given current architecture
- Warn if SLO >99.99% (requires significant investment)
- Ensure error budget policies prevent excessive risk-taking
Auditability:
- All SLI definitions include explicit measurement methodology
- Monitoring queries are reproducible and version-controlled
- Error budget calculations show math and assumptions
Determinism:
- Same inputs produce same SLI definitions
- Error budget formulas are consistent with SRE literature
- Burn rate thresholds use standard multipliers (1x, 6x, 14.4x)
Resources
Primary sources:
- Google SRE Book: https://sre.google/sre-book/table-of-contents/
- Google SRE Workbook: https://sre.google/workbook/table-of-contents/
- Prometheus Documentation: https://prometheus.io/docs/
- Datadog SLO Monitoring: https://docs.datadoghq.com/monitors/service_level_objectives/
Reference implementations:
- Prometheus SLO Recording Rules (accessed 2025-10-25T21:30:36-04:00)
- Sloth SLO Generator (accessed 2025-10-25T21:30:36-04:00)
- OpenSLO Specification (accessed 2025-10-25T21:30:36-04:00)
Additional reading:
- The Art of SLOs (book): https://www.alex-hidalgo.com/the-art-of-slos
- SLO Workshop: https://www.usenix.org/conference/srecon19americas/presentation/fong-jones
# Supported AI Coding Agents
This skill is compatible with the SKILL.md standard and works with all major AI coding agents:
Learn more about the SKILL.md standard and how to use these skills with your preferred AI coding agent.