Refactor high-complexity React components in Dify frontend. Use when `pnpm analyze-component...
npx skills add cosmix/loom --skill "prometheus"
Install specific skill from multi-skill repository
# Description
|
# SKILL.md
name: prometheus
description: |
Prometheus monitoring and alerting for cloud-native observability.
USE WHEN: Writing PromQL queries, configuring Prometheus scrape targets, creating alerting rules, setting up recording rules, instrumenting applications with Prometheus metrics, configuring service discovery.
DO NOT USE: For building dashboards (use /grafana), for log analysis (use /logging-observability), for general observability architecture (use senior-infrastructure-engineer).
TRIGGERS: metrics, prometheus, promql, counter, gauge, histogram, summary, alert, alertmanager, alerting rule, recording rule, scrape, target, label, service discovery, relabeling, exporter, instrumentation, slo, error budget.
triggers:
- metrics
- prometheus
- promql
- counter
- gauge
- histogram
- summary
- alert
- alertmanager
- alerting rule
- recording rule
- scrape
- target
- label
- service discovery
- relabeling
- exporter
- instrumentation
- slo
- error budget
allowed-tools: Read, Grep, Glob, Edit, Write, Bash
Prometheus Monitoring and Alerting
Overview
Prometheus is a powerful open-source monitoring and alerting system designed for reliability and scalability in cloud-native environments. Built for multi-dimensional time-series data with flexible querying via PromQL.
Architecture Components
- Prometheus Server: Core component that scrapes and stores time-series data with local TSDB
- Alertmanager: Handles alerts, deduplication, grouping, routing, and notifications to receivers
- Pushgateway: Allows ephemeral jobs to push metrics (use sparingly - prefer pull model)
- Exporters: Convert metrics from third-party systems to Prometheus format (node, blackbox, etc.)
- Client Libraries: Instrument application code (Go, Java, Python, Rust, etc.)
- Prometheus Operator: Kubernetes-native deployment and management via CRDs
- Remote Storage: Long-term storage via Thanos, Cortex, Mimir for multi-cluster federation
Data Model
- Metrics: Time-series data identified by metric name and key-value labels
- Format:
metric_name{label1="value1", label2="value2"} sample_value timestamp - Metric Types:
- Counter: Monotonically increasing value (requests, errors) - use
rate()orincrease()for querying - Gauge: Value that can go up/down (temperature, memory usage, queue length)
- Histogram: Observations in configurable buckets (latency, request size) - exposes
_bucket,_sum,_count - Summary: Similar to histogram but calculates quantiles client-side - use histograms for aggregation
Setup and Configuration
Basic Prometheus Server Configuration
# prometheus.yml
global:
scrape_interval: 15s
scrape_timeout: 10s
evaluation_interval: 15s
external_labels:
cluster: "production"
region: "us-east-1"
# Alertmanager configuration
alerting:
alertmanagers:
- static_configs:
- targets:
- alertmanager:9093
# Load rules files
rule_files:
- "alerts/*.yml"
- "rules/*.yml"
# Scrape configurations
scrape_configs:
# Prometheus itself
- job_name: "prometheus"
static_configs:
- targets: ["localhost:9090"]
# Application services
- job_name: "application"
metrics_path: "/metrics"
static_configs:
- targets:
- "app-1:8080"
- "app-2:8080"
labels:
env: "production"
team: "backend"
# Kubernetes service discovery
- job_name: "kubernetes-pods"
kubernetes_sd_configs:
- role: pod
relabel_configs:
# Only scrape pods with prometheus.io/scrape annotation
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
action: keep
regex: true
# Use custom metrics path if specified
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path]
action: replace
target_label: __metrics_path__
regex: (.+)
# Use custom port if specified
- source_labels:
[__address__, __meta_kubernetes_pod_annotation_prometheus_io_port]
action: replace
regex: ([^:]+)(?::\d+)?;(\d+)
replacement: $1:$2
target_label: __address__
# Add namespace label
- source_labels: [__meta_kubernetes_namespace]
action: replace
target_label: kubernetes_namespace
# Add pod name label
- source_labels: [__meta_kubernetes_pod_name]
action: replace
target_label: kubernetes_pod_name
# Add service name label
- source_labels: [__meta_kubernetes_pod_label_app]
action: replace
target_label: app
# Node Exporter for host metrics
- job_name: "node-exporter"
static_configs:
- targets:
- "node-exporter:9100"
Alertmanager Configuration
# alertmanager.yml
global:
resolve_timeout: 5m
slack_api_url: "https://hooks.slack.com/services/YOUR/WEBHOOK/URL"
pagerduty_url: "https://events.pagerduty.com/v2/enqueue"
# Template files for custom notifications
templates:
- "/etc/alertmanager/templates/*.tmpl"
# Route alerts to appropriate receivers
route:
group_by: ["alertname", "cluster", "service"]
group_wait: 10s
group_interval: 10s
repeat_interval: 12h
receiver: "default"
routes:
# Critical alerts go to PagerDuty
- match:
severity: critical
receiver: "pagerduty"
continue: true
# Database alerts to DBA team
- match:
team: database
receiver: "dba-team"
group_by: ["alertname", "instance"]
# Development environment alerts
- match:
env: development
receiver: "slack-dev"
group_wait: 5m
repeat_interval: 4h
# Inhibition rules (suppress alerts)
inhibit_rules:
# Suppress warning alerts if critical alert is firing
- source_match:
severity: "critical"
target_match:
severity: "warning"
equal: ["alertname", "instance"]
# Suppress instance alerts if entire service is down
- source_match:
alertname: "ServiceDown"
target_match_re:
alertname: ".*"
equal: ["service"]
receivers:
- name: "default"
slack_configs:
- channel: "#alerts"
title: "Alert: {{ .GroupLabels.alertname }}"
text: "{{ range .Alerts }}{{ .Annotations.description }}{{ end }}"
- name: "pagerduty"
pagerduty_configs:
- service_key: "YOUR_PAGERDUTY_SERVICE_KEY"
description: "{{ .GroupLabels.alertname }}"
- name: "dba-team"
slack_configs:
- channel: "#database-alerts"
email_configs:
- to: "[email protected]"
headers:
Subject: "Database Alert: {{ .GroupLabels.alertname }}"
- name: "slack-dev"
slack_configs:
- channel: "#dev-alerts"
send_resolved: true
Best Practices
Metric Naming Conventions
Follow these naming patterns for consistency:
# Format: <namespace>_<subsystem>_<metric>_<unit>
# Counters (always use _total suffix)
http_requests_total
http_request_errors_total
cache_hits_total
# Gauges
memory_usage_bytes
active_connections
queue_size
# Histograms (use _bucket, _sum, _count suffixes automatically)
http_request_duration_seconds
response_size_bytes
db_query_duration_seconds
# Use consistent base units
- seconds for duration (not milliseconds)
- bytes for size (not kilobytes)
- ratio for percentages (0.0-1.0, not 0-100)
Label Cardinality Management
DO
# Good: Bounded cardinality
http_requests_total{method="GET", status="200", endpoint="/api/users"}
# Good: Reasonable number of label values
db_queries_total{table="users", operation="select"}
DON'T
# Bad: Unbounded cardinality (user IDs, email addresses, timestamps)
http_requests_total{user_id="12345"}
http_requests_total{email="[email protected]"}
http_requests_total{timestamp="1234567890"}
# Bad: High cardinality (full URLs, IP addresses)
http_requests_total{url="/api/users/12345/profile"}
http_requests_total{client_ip="192.168.1.100"}
Guidelines
- Keep label values to < 10 per label (ideally)
- Total unique time-series per metric should be < 10,000
- Use recording rules to pre-aggregate high-cardinality metrics
- Avoid labels with unbounded values (IDs, timestamps, user input)
Recording Rules for Performance
Use recording rules to pre-compute expensive queries:
# rules/recording_rules.yml
groups:
- name: performance_rules
interval: 30s
rules:
# Pre-calculate request rates
- record: job:http_requests:rate5m
expr: sum(rate(http_requests_total[5m])) by (job)
# Pre-calculate error rates
- record: job:http_request_errors:rate5m
expr: sum(rate(http_request_errors_total[5m])) by (job)
# Pre-calculate error ratio
- record: job:http_request_error_ratio:rate5m
expr: |
job:http_request_errors:rate5m
/
job:http_requests:rate5m
# Pre-aggregate latency percentiles
- record: job:http_request_duration_seconds:p95
expr: histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (job, le))
- record: job:http_request_duration_seconds:p99
expr: histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket[5m])) by (job, le))
- name: aggregation_rules
interval: 1m
rules:
# Multi-level aggregation for dashboards
- record: instance:node_cpu_utilization:ratio
expr: |
1 - avg(rate(node_cpu_seconds_total{mode="idle"}[5m])) by (instance)
- record: cluster:node_cpu_utilization:ratio
expr: avg(instance:node_cpu_utilization:ratio)
# Memory aggregation
- record: instance:node_memory_utilization:ratio
expr: |
1 - (
node_memory_MemAvailable_bytes
/
node_memory_MemTotal_bytes
)
Alert Design (Symptoms vs Causes)
Alert on symptoms (user-facing impact), not causes
# alerts/symptom_based.yml
groups:
- name: symptom_alerts
rules:
# GOOD: Alert on user-facing symptoms
- alert: HighErrorRate
expr: |
(
sum(rate(http_requests_total{status=~"5.."}[5m]))
/
sum(rate(http_requests_total[5m]))
) > 0.05
for: 5m
labels:
severity: critical
team: backend
annotations:
summary: "High error rate detected"
description: "Error rate is {{ $value | humanizePercentage }} (threshold: 5%)"
runbook: "https://wiki.example.com/runbooks/high-error-rate"
- alert: HighLatency
expr: |
histogram_quantile(0.95,
sum(rate(http_request_duration_seconds_bucket[5m])) by (le, service)
) > 1
for: 5m
labels:
severity: warning
team: backend
annotations:
summary: "High latency on {{ $labels.service }}"
description: "P95 latency is {{ $value }}s (threshold: 1s)"
impact: "Users experiencing slow page loads"
# GOOD: SLO-based alerting
- alert: SLOBudgetBurnRate
expr: |
(
1 - (
sum(rate(http_requests_total{status!~"5.."}[1h]))
/
sum(rate(http_requests_total[1h]))
)
) > (14.4 * (1 - 0.999)) # 14.4x burn rate for 99.9% SLO
for: 5m
labels:
severity: critical
team: sre
annotations:
summary: "SLO budget burning too fast"
description: "At current rate, monthly error budget will be exhausted in {{ $value | humanizeDuration }}"
Cause-based alerts (use for debugging, not paging)
# alerts/cause_based.yml
groups:
- name: infrastructure_alerts
rules:
# Lower severity for infrastructure issues
- alert: HighMemoryUsage
expr: |
(
node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes
) / node_memory_MemTotal_bytes > 0.9
for: 10m
labels:
severity: warning # Not critical unless symptoms appear
team: infrastructure
annotations:
summary: "High memory usage on {{ $labels.instance }}"
description: "Memory usage is {{ $value | humanizePercentage }}"
- alert: DiskSpaceLow
expr: |
(
node_filesystem_avail_bytes{mountpoint="/"}
/
node_filesystem_size_bytes{mountpoint="/"}
) < 0.1
for: 5m
labels:
severity: warning
team: infrastructure
annotations:
summary: "Low disk space on {{ $labels.instance }}"
description: "Only {{ $value | humanizePercentage }} disk space remaining"
action: "Clean up logs or expand disk"
Alert Best Practices
- For duration: Use
forclause to avoid flapping - Meaningful annotations: Include summary, description, runbook URL, impact
- Proper severity levels: critical (page immediately), warning (ticket), info (log)
- Actionable alerts: Every alert should require human action
- Include context: Add labels for team ownership, service, environment
PromQL Query Patterns
PromQL is the query language for Prometheus. Key concepts: instant vectors, range vectors, scalar, string literals, selectors, operators, functions, and aggregation.
Selectors and Matchers
# Instant vector selector (latest sample for each time-series)
http_requests_total
# Filter by label values
http_requests_total{method="GET", status="200"}
# Regex matching (=~) and negative regex (!~)
http_requests_total{status=~"5.."} # 5xx errors
http_requests_total{endpoint!~"/admin.*"} # exclude admin endpoints
# Label absence/presence
http_requests_total{job="api", status=""} # empty label
http_requests_total{job="api", status!=""} # non-empty label
# Range vector selector (samples over time)
http_requests_total[5m] # last 5 minutes of samples
Rate Calculations
# Request rate (requests per second) - ALWAYS use rate() for counters
rate(http_requests_total[5m])
# Sum by service
sum(rate(http_requests_total[5m])) by (service)
# Increase over time window (total count) - for alerts/dashboards showing total
increase(http_requests_total[1h])
# irate() for volatile, fast-moving counters (more sensitive to spikes)
irate(http_requests_total[5m])
Error Ratios
# Error rate ratio
sum(rate(http_requests_total{status=~"5.."}[5m]))
/
sum(rate(http_requests_total[5m]))
# Success rate
sum(rate(http_requests_total{status=~"2.."}[5m]))
/
sum(rate(http_requests_total[5m]))
Histogram Queries
# P95 latency
histogram_quantile(0.95,
sum(rate(http_request_duration_seconds_bucket[5m])) by (le)
)
# P50, P95, P99 latency by service
histogram_quantile(0.50, sum(rate(http_request_duration_seconds_bucket[5m])) by (le, service))
histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le, service))
histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket[5m])) by (le, service))
# Average request duration
sum(rate(http_request_duration_seconds_sum[5m])) by (service)
/
sum(rate(http_request_duration_seconds_count[5m])) by (service)
Aggregation Operations
# Sum across all instances
sum(node_memory_MemTotal_bytes) by (cluster)
# Average CPU usage
avg(rate(node_cpu_seconds_total{mode="idle"}[5m])) by (instance)
# Maximum value
max(http_request_duration_seconds) by (service)
# Minimum value
min(node_filesystem_avail_bytes) by (instance)
# Count number of instances
count(up == 1) by (job)
# Standard deviation
stddev(http_request_duration_seconds) by (service)
Advanced Queries
# Top 5 services by request rate
topk(5, sum(rate(http_requests_total[5m])) by (service))
# Bottom 3 instances by available memory
bottomk(3, node_memory_MemAvailable_bytes)
# Predict disk full time (linear regression)
predict_linear(node_filesystem_avail_bytes{mountpoint="/"}[1h], 4 * 3600) < 0
# Compare with 1 day ago
http_requests_total - http_requests_total offset 1d
# Rate of change (derivative)
deriv(node_memory_MemAvailable_bytes[5m])
# Absent metric detection
absent(up{job="critical-service"})
Complex Aggregations
# Calculate Apdex score (Application Performance Index)
(
sum(rate(http_request_duration_seconds_bucket{le="0.1"}[5m]))
+
sum(rate(http_request_duration_seconds_bucket{le="0.5"}[5m])) * 0.5
)
/
sum(rate(http_request_duration_seconds_count[5m]))
# Multi-window multi-burn-rate SLO
(
sum(rate(http_requests_total{status=~"5.."}[1h]))
/
sum(rate(http_requests_total[1h]))
> 0.001 * 14.4
)
and
(
sum(rate(http_requests_total{status=~"5.."}[5m]))
/
sum(rate(http_requests_total[5m]))
> 0.001 * 14.4
)
Binary Operators and Vector Matching
# Arithmetic operators (+, -, *, /, %, ^)
node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes
# Comparison operators (==, !=, >, <, >=, <=) - filter to matching values
http_request_duration_seconds > 1
# Logical operators (and, or, unless)
up{job="api"} and rate(http_requests_total[5m]) > 100
# One-to-one matching (default)
method:http_requests:rate5m / method:http_requests:total
# Many-to-one matching with group_left
sum(rate(http_requests_total[5m])) by (instance, method)
/ on(instance) group_left
sum(rate(http_requests_total[5m])) by (instance)
# One-to-many matching with group_right
sum(rate(http_requests_total[5m])) by (instance)
/ on(instance) group_right
sum(rate(http_requests_total[5m])) by (instance, method)
Time Functions and Offsets
# Compare with previous time period
rate(http_requests_total[5m]) / rate(http_requests_total[5m] offset 1h)
# Day-over-day comparison
http_requests_total - http_requests_total offset 1d
# Time-based filtering
http_requests_total and hour() >= 9 and hour() < 17 # business hours
day_of_week() == 0 or day_of_week() == 6 # weekends
# Timestamp functions
time() - process_start_time_seconds # uptime in seconds
Service Discovery
Prometheus supports multiple service discovery mechanisms for dynamic environments where targets appear and disappear.
Static Configuration
scrape_configs:
- job_name: 'static-targets'
static_configs:
- targets:
- 'host1:9100'
- 'host2:9100'
labels:
env: production
region: us-east-1
File-based Service Discovery
scrape_configs:
- job_name: 'file-sd'
file_sd_configs:
- files:
- '/etc/prometheus/targets/*.json'
- '/etc/prometheus/targets/*.yml'
refresh_interval: 30s
# targets/webservers.json
[
{
"targets": ["web1:8080", "web2:8080"],
"labels": {
"job": "web",
"env": "prod"
}
}
]
Kubernetes Service Discovery
scrape_configs:
# Pod-based discovery
- job_name: 'kubernetes-pods'
kubernetes_sd_configs:
- role: pod
namespaces:
names:
- production
- staging
relabel_configs:
# Keep only pods with prometheus.io/scrape=true annotation
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
action: keep
regex: true
# Extract custom scrape path from annotation
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path]
action: replace
target_label: __metrics_path__
regex: (.+)
# Extract custom port from annotation
- source_labels: [__address__, __meta_kubernetes_pod_annotation_prometheus_io_port]
action: replace
regex: ([^:]+)(?::\d+)?;(\d+)
replacement: $1:$2
target_label: __address__
# Add standard Kubernetes labels
- action: labelmap
regex: __meta_kubernetes_pod_label_(.+)
- source_labels: [__meta_kubernetes_namespace]
target_label: kubernetes_namespace
- source_labels: [__meta_kubernetes_pod_name]
target_label: kubernetes_pod_name
# Service-based discovery
- job_name: 'kubernetes-services'
kubernetes_sd_configs:
- role: service
relabel_configs:
- source_labels: [__meta_kubernetes_service_annotation_prometheus_io_scrape]
action: keep
regex: true
- source_labels: [__meta_kubernetes_service_annotation_prometheus_io_scheme]
action: replace
target_label: __scheme__
regex: (https?)
- source_labels: [__meta_kubernetes_service_annotation_prometheus_io_path]
action: replace
target_label: __metrics_path__
regex: (.+)
# Node-based discovery (for node exporters)
- job_name: 'kubernetes-nodes'
kubernetes_sd_configs:
- role: node
relabel_configs:
- action: labelmap
regex: __meta_kubernetes_node_label_(.+)
- target_label: __address__
replacement: kubernetes.default.svc:443
- source_labels: [__meta_kubernetes_node_name]
regex: (.+)
target_label: __metrics_path__
replacement: /api/v1/nodes/${1}/proxy/metrics
# Endpoints discovery (for service endpoints)
- job_name: 'kubernetes-endpoints'
kubernetes_sd_configs:
- role: endpoints
relabel_configs:
- source_labels: [__meta_kubernetes_service_annotation_prometheus_io_scrape]
action: keep
regex: true
- source_labels: [__meta_kubernetes_endpoint_port_name]
action: keep
regex: metrics
Consul Service Discovery
scrape_configs:
- job_name: 'consul-services'
consul_sd_configs:
- server: 'consul.example.com:8500'
datacenter: 'dc1'
services: ['web', 'api', 'cache']
tags: ['production']
relabel_configs:
- source_labels: [__meta_consul_service]
target_label: service
- source_labels: [__meta_consul_tags]
target_label: tags
EC2 Service Discovery
scrape_configs:
- job_name: 'ec2-instances'
ec2_sd_configs:
- region: us-east-1
access_key: YOUR_ACCESS_KEY
secret_key: YOUR_SECRET_KEY
port: 9100
filters:
- name: tag:Environment
values: [production]
- name: instance-state-name
values: [running]
relabel_configs:
- source_labels: [__meta_ec2_tag_Name]
target_label: instance_name
- source_labels: [__meta_ec2_availability_zone]
target_label: availability_zone
- source_labels: [__meta_ec2_instance_type]
target_label: instance_type
DNS Service Discovery
scrape_configs:
- job_name: 'dns-srv-records'
dns_sd_configs:
- names:
- '_prometheus._tcp.example.com'
type: 'SRV'
refresh_interval: 30s
relabel_configs:
- source_labels: [__meta_dns_name]
target_label: instance
Relabeling Actions Reference
| Action | Description | Use Case |
|---|---|---|
keep |
Keep targets where regex matches source labels | Filter targets by annotation/label |
drop |
Drop targets where regex matches source labels | Exclude specific targets |
replace |
Replace target label with value from source labels | Extract custom labels/paths/ports |
labelmap |
Map source label names to target labels via regex | Copy all Kubernetes labels |
labeldrop |
Drop labels matching regex | Remove internal metadata labels |
labelkeep |
Keep only labels matching regex | Reduce cardinality |
hashmod |
Set target label to hash of source labels modulo N | Sharding/routing |
High Availability and Scalability
Prometheus High Availability Setup
# Deploy multiple identical Prometheus instances scraping same targets
# Use external labels to distinguish instances
global:
external_labels:
replica: prometheus-1 # Change to prometheus-2, etc.
cluster: production
# Alertmanager will deduplicate alerts from multiple Prometheus instances
alerting:
alertmanagers:
- static_configs:
- targets:
- alertmanager-1:9093
- alertmanager-2:9093
- alertmanager-3:9093
Alertmanager Clustering
# alertmanager.yml - HA cluster configuration
global:
resolve_timeout: 5m
route:
receiver: 'default'
group_by: ['alertname', 'cluster']
group_wait: 10s
group_interval: 10s
repeat_interval: 12h
receivers:
- name: 'default'
slack_configs:
- api_url: 'https://hooks.slack.com/services/YOUR/WEBHOOK'
channel: '#alerts'
# Start Alertmanager cluster members
# alertmanager-1: --cluster.peer=alertmanager-2:9094 --cluster.peer=alertmanager-3:9094
# alertmanager-2: --cluster.peer=alertmanager-1:9094 --cluster.peer=alertmanager-3:9094
# alertmanager-3: --cluster.peer=alertmanager-1:9094 --cluster.peer=alertmanager-2:9094
Federation for Hierarchical Monitoring
# Global Prometheus federating from regional instances
scrape_configs:
- job_name: 'federate'
scrape_interval: 15s
honor_labels: true
metrics_path: '/federate'
params:
'match[]':
# Pull aggregated metrics only
- '{job="prometheus"}'
- '{__name__=~"job:.*"}' # Recording rules
- 'up'
static_configs:
- targets:
- 'prometheus-us-east-1:9090'
- 'prometheus-us-west-2:9090'
- 'prometheus-eu-west-1:9090'
labels:
region: 'us-east-1'
Remote Storage for Long-term Retention
# Prometheus remote write to Thanos/Cortex/Mimir
remote_write:
- url: "http://thanos-receive:19291/api/v1/receive"
queue_config:
capacity: 10000
max_shards: 50
min_shards: 1
max_samples_per_send: 5000
batch_send_deadline: 5s
min_backoff: 30ms
max_backoff: 100ms
write_relabel_configs:
# Drop high-cardinality metrics before remote write
- source_labels: [__name__]
regex: 'go_.*'
action: drop
# Prometheus remote read from long-term storage
remote_read:
- url: "http://thanos-query:9090/api/v1/read"
read_recent: true
Thanos Architecture for Global View
# Thanos Sidecar - runs alongside Prometheus
thanos sidecar \
--prometheus.url=http://localhost:9090 \
--tsdb.path=/prometheus \
--objstore.config-file=/etc/thanos/bucket.yml \
--grpc-address=0.0.0.0:10901 \
--http-address=0.0.0.0:10902
# Thanos Store - queries object storage
thanos store \
--data-dir=/var/thanos/store \
--objstore.config-file=/etc/thanos/bucket.yml \
--grpc-address=0.0.0.0:10901 \
--http-address=0.0.0.0:10902
# Thanos Query - global query interface
thanos query \
--http-address=0.0.0.0:9090 \
--grpc-address=0.0.0.0:10901 \
--store=prometheus-1-sidecar:10901 \
--store=prometheus-2-sidecar:10901 \
--store=thanos-store:10901
# Thanos Compactor - downsample and compact blocks
thanos compact \
--data-dir=/var/thanos/compact \
--objstore.config-file=/etc/thanos/bucket.yml \
--retention.resolution-raw=30d \
--retention.resolution-5m=90d \
--retention.resolution-1h=365d
Horizontal Sharding with Hashmod
# Split scrape targets across multiple Prometheus instances using hashmod
scrape_configs:
- job_name: 'kubernetes-pods-shard-0'
kubernetes_sd_configs:
- role: pod
relabel_configs:
# Hash pod name and keep only shard 0 (mod 3)
- source_labels: [__meta_kubernetes_pod_name]
modulus: 3
target_label: __tmp_hash
action: hashmod
- source_labels: [__tmp_hash]
regex: "0"
action: keep
- job_name: 'kubernetes-pods-shard-1'
kubernetes_sd_configs:
- role: pod
relabel_configs:
- source_labels: [__meta_kubernetes_pod_name]
modulus: 3
target_label: __tmp_hash
action: hashmod
- source_labels: [__tmp_hash]
regex: "1"
action: keep
# shard-2 similar pattern...
Kubernetes Integration
ServiceMonitor for Prometheus Operator
# servicemonitor.yaml
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: app-metrics
namespace: monitoring
labels:
app: myapp
release: prometheus
spec:
# Select services to monitor
selector:
matchLabels:
app: myapp
# Define namespaces to search
namespaceSelector:
matchNames:
- production
- staging
# Endpoint configuration
endpoints:
- port: metrics # Service port name
path: /metrics
interval: 30s
scrapeTimeout: 10s
# Relabeling
relabelings:
- sourceLabels: [__meta_kubernetes_pod_name]
targetLabel: pod
- sourceLabels: [__meta_kubernetes_namespace]
targetLabel: namespace
# Metric relabeling (filter/modify metrics)
metricRelabelings:
- sourceLabels: [__name__]
regex: "go_.*"
action: drop # Drop Go runtime metrics
- sourceLabels: [status]
regex: "[45].."
targetLabel: error
replacement: "true"
# Optional: TLS configuration
# tlsConfig:
# insecureSkipVerify: true
# ca:
# secret:
# name: prometheus-tls
# key: ca.crt
PodMonitor for Direct Pod Scraping
# podmonitor.yaml
apiVersion: monitoring.coreos.com/v1
kind: PodMonitor
metadata:
name: app-pods
namespace: monitoring
labels:
release: prometheus
spec:
# Select pods to monitor
selector:
matchLabels:
app: myapp
# Namespace selection
namespaceSelector:
matchNames:
- production
# Pod metrics endpoints
podMetricsEndpoints:
- port: metrics
path: /metrics
interval: 15s
# Relabeling
relabelings:
- sourceLabels: [__meta_kubernetes_pod_label_version]
targetLabel: version
- sourceLabels: [__meta_kubernetes_pod_node_name]
targetLabel: node
PrometheusRule for Alerts and Recording Rules
# prometheusrule.yaml
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
name: app-rules
namespace: monitoring
labels:
release: prometheus
role: alert-rules
spec:
groups:
- name: app_alerts
interval: 30s
rules:
- alert: HighErrorRate
expr: |
(
sum(rate(http_requests_total{status=~"5..", app="myapp"}[5m]))
/
sum(rate(http_requests_total{app="myapp"}[5m]))
) > 0.05
for: 5m
labels:
severity: critical
team: backend
annotations:
summary: "High error rate on {{ $labels.namespace }}/{{ $labels.pod }}"
description: "Error rate is {{ $value | humanizePercentage }}"
dashboard: "https://grafana.example.com/d/app-overview"
runbook: "https://wiki.example.com/runbooks/high-error-rate"
- alert: PodCrashLooping
expr: |
rate(kube_pod_container_status_restarts_total[15m]) > 0
for: 5m
labels:
severity: warning
annotations:
summary: "Pod {{ $labels.namespace }}/{{ $labels.pod }} is crash looping"
description: "Container {{ $labels.container }} has restarted {{ $value }} times in 15m"
- name: app_recording_rules
interval: 30s
rules:
- record: app:http_requests:rate5m
expr: sum(rate(http_requests_total{app="myapp"}[5m])) by (namespace, pod, method, status)
- record: app:http_request_duration_seconds:p95
expr: |
histogram_quantile(0.95,
sum(rate(http_request_duration_seconds_bucket{app="myapp"}[5m])) by (le, namespace, pod)
)
Prometheus Custom Resource
# prometheus.yaml
apiVersion: monitoring.coreos.com/v1
kind: Prometheus
metadata:
name: prometheus
namespace: monitoring
spec:
replicas: 2
version: v2.45.0
# Service account for Kubernetes API access
serviceAccountName: prometheus
# Select ServiceMonitors
serviceMonitorSelector:
matchLabels:
release: prometheus
# Select PodMonitors
podMonitorSelector:
matchLabels:
release: prometheus
# Select PrometheusRules
ruleSelector:
matchLabels:
release: prometheus
role: alert-rules
# Resource limits
resources:
requests:
memory: 2Gi
cpu: 1000m
limits:
memory: 4Gi
cpu: 2000m
# Storage
storage:
volumeClaimTemplate:
spec:
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 50Gi
storageClassName: fast-ssd
# Retention
retention: 30d
retentionSize: 45GB
# Alertmanager configuration
alerting:
alertmanagers:
- namespace: monitoring
name: alertmanager
port: web
# External labels
externalLabels:
cluster: production
region: us-east-1
# Security context
securityContext:
fsGroup: 2000
runAsNonRoot: true
runAsUser: 1000
# Enable admin API for management operations
enableAdminAPI: false
# Additional scrape configs (from Secret)
additionalScrapeConfigs:
name: additional-scrape-configs
key: prometheus-additional.yaml
Application Instrumentation Examples
Go Application
// main.go
package main
import (
"net/http"
"time"
"github.com/prometheus/client_golang/prometheus"
"github.com/prometheus/client_golang/prometheus/promauto"
"github.com/prometheus/client_golang/prometheus/promhttp"
)
var (
// Counter for total requests
httpRequestsTotal = promauto.NewCounterVec(
prometheus.CounterOpts{
Name: "http_requests_total",
Help: "Total number of HTTP requests",
},
[]string{"method", "endpoint", "status"},
)
// Histogram for request duration
httpRequestDuration = promauto.NewHistogramVec(
prometheus.HistogramOpts{
Name: "http_request_duration_seconds",
Help: "HTTP request duration in seconds",
Buckets: []float64{.001, .005, .01, .025, .05, .1, .25, .5, 1, 2.5, 5, 10},
},
[]string{"method", "endpoint"},
)
// Gauge for active connections
activeConnections = promauto.NewGauge(
prometheus.GaugeOpts{
Name: "active_connections",
Help: "Number of active connections",
},
)
// Summary for response sizes
responseSizeBytes = promauto.NewSummaryVec(
prometheus.SummaryOpts{
Name: "http_response_size_bytes",
Help: "HTTP response size in bytes",
Objectives: map[float64]float64{0.5: 0.05, 0.9: 0.01, 0.99: 0.001},
},
[]string{"endpoint"},
)
)
// Middleware to instrument HTTP handlers
func instrumentHandler(endpoint string, handler http.HandlerFunc) http.HandlerFunc {
return func(w http.ResponseWriter, r *http.Request) {
start := time.Now()
activeConnections.Inc()
defer activeConnections.Dec()
// Wrap response writer to capture status code
wrapped := &responseWriter{ResponseWriter: w, statusCode: 200}
handler(wrapped, r)
duration := time.Since(start).Seconds()
httpRequestDuration.WithLabelValues(r.Method, endpoint).Observe(duration)
httpRequestsTotal.WithLabelValues(r.Method, endpoint,
http.StatusText(wrapped.statusCode)).Inc()
}
}
type responseWriter struct {
http.ResponseWriter
statusCode int
}
func (rw *responseWriter) WriteHeader(code int) {
rw.statusCode = code
rw.ResponseWriter.WriteHeader(code)
}
func handleUsers(w http.ResponseWriter, r *http.Request) {
w.Header().Set("Content-Type", "application/json")
w.Write([]byte(`{"users": []}`))
}
func main() {
// Register handlers
http.HandleFunc("/api/users", instrumentHandler("/api/users", handleUsers))
http.Handle("/metrics", promhttp.Handler())
// Start server
http.ListenAndServe(":8080", nil)
}
Python Application (Flask)
# app.py
from flask import Flask, request
from prometheus_client import Counter, Histogram, Gauge, generate_latest
import time
app = Flask(__name__)
# Define metrics
request_count = Counter(
'http_requests_total',
'Total HTTP requests',
['method', 'endpoint', 'status']
)
request_duration = Histogram(
'http_request_duration_seconds',
'HTTP request duration in seconds',
['method', 'endpoint'],
buckets=[.001, .005, .01, .025, .05, .1, .25, .5, 1, 2.5, 5, 10]
)
active_requests = Gauge(
'active_requests',
'Number of active requests'
)
# Middleware for instrumentation
@app.before_request
def before_request():
active_requests.inc()
request.start_time = time.time()
@app.after_request
def after_request(response):
active_requests.dec()
duration = time.time() - request.start_time
request_duration.labels(
method=request.method,
endpoint=request.endpoint or 'unknown'
).observe(duration)
request_count.labels(
method=request.method,
endpoint=request.endpoint or 'unknown',
status=response.status_code
).inc()
return response
@app.route('/metrics')
def metrics():
return generate_latest()
@app.route('/api/users')
def users():
return {'users': []}
if __name__ == '__main__':
app.run(host='0.0.0.0', port=8080)
Production Deployment Checklist
- [ ] Set appropriate retention period (balance storage vs history needs)
- [ ] Configure persistent storage with adequate size
- [ ] Enable high availability (multiple Prometheus replicas or federation)
- [ ] Set up remote storage for long-term retention (Thanos, Cortex, Mimir)
- [ ] Configure service discovery for dynamic environments
- [ ] Implement recording rules for frequently-used queries
- [ ] Create symptom-based alerts with proper annotations
- [ ] Set up Alertmanager with appropriate routing and receivers
- [ ] Configure inhibition rules to reduce alert noise
- [ ] Add runbook URLs to all critical alerts
- [ ] Implement proper label hygiene (avoid high cardinality)
- [ ] Monitor Prometheus itself (meta-monitoring)
- [ ] Set up authentication and authorization
- [ ] Enable TLS for scrape targets and remote storage
- [ ] Configure rate limiting for queries
- [ ] Test alert and recording rule validity (
promtool check rules) - [ ] Implement backup and disaster recovery procedures
- [ ] Document metric naming conventions for the team
- [ ] Create dashboards in Grafana for common queries
- [ ] Set up log aggregation alongside metrics (Loki)
Troubleshooting Commands
# Check Prometheus configuration syntax
promtool check config prometheus.yml
# Check rules file syntax
promtool check rules alerts/*.yml
# Test PromQL queries
promtool query instant http://localhost:9090 'up'
# Check which targets are up
curl http://localhost:9090/api/v1/targets
# Query current metric values
curl 'http://localhost:9090/api/v1/query?query=up'
# Check service discovery
curl http://localhost:9090/api/v1/targets/metadata
# View TSDB stats
curl http://localhost:9090/api/v1/status/tsdb
# Check runtime information
curl http://localhost:9090/api/v1/status/runtimeinfo
Quick Reference
Common PromQL Patterns
# Request rate per second
rate(http_requests_total[5m])
# Error ratio percentage
100 * sum(rate(http_requests_total{status=~"5.."}[5m])) / sum(rate(http_requests_total[5m]))
# P95 latency from histogram
histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le))
# Average latency from histogram
sum(rate(http_request_duration_seconds_sum[5m])) / sum(rate(http_request_duration_seconds_count[5m]))
# Memory utilization percentage
100 * (1 - node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)
# CPU utilization (non-idle)
100 * (1 - avg(rate(node_cpu_seconds_total{mode="idle"}[5m])))
# Disk space remaining percentage
100 * node_filesystem_avail_bytes / node_filesystem_size_bytes
# Top 5 endpoints by request rate
topk(5, sum(rate(http_requests_total[5m])) by (endpoint))
# Service uptime in days
(time() - process_start_time_seconds) / 86400
# Request rate growth compared to 1 hour ago
rate(http_requests_total[5m]) / rate(http_requests_total[5m] offset 1h)
Alert Rule Patterns
# High error rate (symptom-based)
alert: HighErrorRate
expr: |
sum(rate(http_requests_total{status=~"5.."}[5m]))
/ sum(rate(http_requests_total[5m])) > 0.05
for: 5m
labels:
severity: critical
annotations:
summary: "Error rate is {{ $value | humanizePercentage }}"
runbook: "https://runbooks.example.com/high-error-rate"
# High latency P95
alert: HighLatency
expr: |
histogram_quantile(0.95,
sum(rate(http_request_duration_seconds_bucket[5m])) by (le, service)
) > 1
for: 5m
labels:
severity: warning
# Service down
alert: ServiceDown
expr: up{job="critical-service"} == 0
for: 2m
labels:
severity: critical
# Disk space low (cause-based, warning only)
alert: DiskSpaceLow
expr: |
node_filesystem_avail_bytes{mountpoint="/"}
/ node_filesystem_size_bytes{mountpoint="/"} < 0.1
for: 10m
labels:
severity: warning
# Pod crash looping
alert: PodCrashLooping
expr: rate(kube_pod_container_status_restarts_total[15m]) > 0
for: 5m
labels:
severity: warning
Recording Rule Naming Convention
# Format: level:metric:operations
# level = aggregation level (job, instance, cluster)
# metric = base metric name
# operations = transformations applied (rate5m, sum, ratio)
groups:
- name: aggregation_rules
rules:
# Instance-level aggregation
- record: instance:node_cpu_utilization:ratio
expr: 1 - avg(rate(node_cpu_seconds_total{mode="idle"}[5m])) by (instance)
# Job-level aggregation
- record: job:http_requests:rate5m
expr: sum(rate(http_requests_total[5m])) by (job)
# Job-level error ratio
- record: job:http_request_errors:ratio
expr: |
sum(rate(http_requests_total{status=~"5.."}[5m])) by (job)
/ sum(rate(http_requests_total[5m])) by (job)
# Cluster-level aggregation
- record: cluster:cpu_utilization:ratio
expr: avg(instance:node_cpu_utilization:ratio)
Metric Naming Best Practices
| Pattern | Good Example | Bad Example |
|---|---|---|
| Counter suffix | http_requests_total |
http_requests |
| Base units | http_request_duration_seconds |
http_request_duration_ms |
| Ratio range | cache_hit_ratio (0.0-1.0) |
cache_hit_percentage (0-100) |
| Byte units | response_size_bytes |
response_size_kb |
| Namespace prefix | myapp_http_requests_total |
http_requests_total |
| Label naming | {method="GET", status="200"} |
{httpMethod="GET", statusCode="200"} |
Label Cardinality Guidelines
| Cardinality | Examples | Recommendation |
|---|---|---|
| Low (<10) | HTTP method, status code, environment | Safe for all labels |
| Medium (10-100) | API endpoint, service name, pod name | Safe with aggregation |
| High (100-1000) | Container ID, hostname | Use only when necessary |
| Unbounded | User ID, IP address, timestamp, URL path | Never use as label |
Kubernetes Annotation-based Scraping
# Pod annotations for automatic Prometheus scraping
apiVersion: v1
kind: Pod
metadata:
annotations:
prometheus.io/scrape: "true"
prometheus.io/port: "8080"
prometheus.io/path: "/metrics"
prometheus.io/scheme: "http"
spec:
containers:
- name: app
image: myapp:latest
ports:
- containerPort: 8080
name: metrics
Alertmanager Routing Patterns
route:
receiver: default
group_by: ['alertname', 'cluster']
routes:
# Critical alerts to PagerDuty
- match:
severity: critical
receiver: pagerduty
continue: true # Also send to default
# Team-based routing
- match:
team: database
receiver: dba-team
group_by: ['alertname', 'instance']
# Environment-based routing
- match:
env: development
receiver: slack-dev
repeat_interval: 4h
# Time-based routing (office hours only)
- match:
severity: warning
receiver: email
active_time_intervals:
- business-hours
time_intervals:
- name: business-hours
time_intervals:
- times:
- start_time: '09:00'
end_time: '17:00'
weekdays: ['monday:friday']
Additional Resources
# Supported AI Coding Agents
This skill is compatible with the SKILL.md standard and works with all major AI coding agents:
Learn more about the SKILL.md standard and how to use these skills with your preferred AI coding agent.