cosmix

grafana

6
0
# Install this skill:
npx skills add cosmix/loom --skill "grafana"

Install specific skill from multi-skill repository

# Description

|

# SKILL.md


name: grafana
description: |
Observability visualization with Grafana and LGTM stack. Dashboard design, panel configuration, alerting, variables/templating, and data sources.

USE WHEN: Creating Grafana dashboards, configuring panels and visualizations, writing LogQL/TraceQL queries, setting up Grafana data sources, configuring dashboard variables and templates, building Grafana alerts.
DO NOT USE: For writing PromQL queries (use /prometheus), for alerting rule strategy (use /prometheus), for general observability architecture (use senior-infrastructure-engineer).

TRIGGERS: grafana, dashboard, panel, visualization, logql, traceql, loki, tempo, mimir, data source, annotation, variable, template, row, stat, graph, table, heatmap, gauge, bar chart, pie chart, time series, logs panel, traces panel, LGTM stack.
triggers:
- grafana
- dashboard
- panel
- visualization
- logql
- traceql
- loki
- tempo
- mimir
- data source
- annotation
- variable
- template
- row
- stat
- graph
- table
- heatmap
- gauge
- bar chart
- pie chart
- time series
- logs panel
- traces panel
- LGTM stack
allowed-tools: Read, Grep, Glob, Edit, Write, Bash


Grafana and LGTM Stack Skill

Overview

The LGTM stack provides a complete observability solution with comprehensive visualization and dashboard capabilities:

  • Loki: Log aggregation and querying (LogQL)
  • Grafana: Visualization, dashboarding, alerting, and exploration
  • Tempo: Distributed tracing (TraceQL)
  • Mimir: Long-term metrics storage (Prometheus-compatible)

This skill covers setup, configuration, dashboard creation, panel design, querying, alerting, templating, and production observability best practices.

When to Use This Skill

Primary Use Cases

  • Creating or modifying Grafana dashboards
  • Designing panels and visualizations (graphs, stats, tables, heatmaps, etc.)
  • Writing queries (PromQL, LogQL, TraceQL)
  • Configuring data sources (Prometheus, Loki, Tempo, Mimir)
  • Setting up alerting rules and notification policies
  • Implementing dashboard variables and templates
  • Dashboard provisioning and GitOps workflows
  • Troubleshooting observability queries
  • Analyzing application performance, errors, or system behavior

Who Uses This Skill

  • senior-infrastructure-engineer (PRIMARY): Production observability setup, LGTM stack deployment, dashboard architecture
  • software-engineer: Application dashboards, service metrics visualization
  • devops-engineer: Infrastructure monitoring, deployment dashboards

LGTM Stack Components

Loki - Log Aggregation

Architecture - Loki

Horizontally scalable log aggregation inspired by Prometheus

  • Indexes only metadata (labels), not log content
  • Cost-effective storage with object stores (S3, GCS, etc.)
  • LogQL query language similar to PromQL

Key Concepts - Loki

  • Labels for indexing (low cardinality)
  • Log streams identified by unique label sets
  • Parsers: logfmt, JSON, regex, pattern
  • Line filters and label filters

Grafana - Visualization

Features

  • Multi-datasource dashboarding
  • Panel types: Graph, Stat, Table, Heatmap, Bar Chart, Pie Chart, Gauge, Logs, Traces, Time Series
  • Templating and variables for dynamic dashboards
  • Alerting (unified alerting with contact points and notification policies)
  • Dashboard provisioning and GitOps integration
  • Role-based access control (RBAC)
  • Explore mode for ad-hoc queries
  • Annotations for event markers
  • Dashboard folders and organization

Tempo - Distributed Tracing

Architecture - Tempo

Scalable distributed tracing backend

  • Cost-effective trace storage
  • TraceQL for trace querying
  • Integration with logs and metrics (trace-to-logs, trace-to-metrics)
  • OpenTelemetry compatible

Mimir - Metrics Storage

Architecture - Mimir

Horizontally scalable long-term Prometheus storage

  • Multi-tenancy support
  • Query federation
  • High availability
  • Prometheus remote_write compatible

Dashboard Design and Best Practices

Dashboard Organization Principles

  1. Hierarchy: Overview -> Service -> Component -> Deep Dive
  2. Golden Signals: Latency, Traffic, Errors, Saturation (RED/USE method)
  3. Variable-driven: Use templates for flexibility across environments
  4. Consistent Layouts: Grid alignment (24-column grid), logical top-to-bottom flow
  5. Performance: Limit queries, use query caching, appropriate time intervals

Panel Types and When to Use Them

Panel Type Use Case Best For
Time Series / Graph Trends over time Request rates, latency, resource usage
Stat Single metric value Error rates, current values, percentage
Gauge Progress toward limit CPU usage, memory, disk space
Bar Gauge Comparative values Top N items, distribution
Table Structured data Service lists, error details, resource inventory
Pie Chart Proportions Traffic distribution, error breakdown
Heatmap Distribution over time Latency percentiles, request patterns
Logs Log streams Error investigation, debugging
Traces Distributed tracing Performance analysis, dependency mapping

Panel Configuration Best Practices

Titles and Descriptions

  • Clear, descriptive titles: Include units and metric context
  • Tooltips: Add description fields for panel documentation
  • Examples:
  • Good: "P95 Latency (seconds) by Endpoint"
  • Bad: "Latency"

Legends and Labels

  • Show legends only when needed (multiple series)
  • Use {{label}} format for dynamic legend names
  • Place legends appropriately (bottom, right, or hidden)
  • Sort by value when showing Top N

Axes and Units

  • Always label axes with units
  • Use appropriate unit formats (seconds, bytes, percent, requests/sec)
  • Set reasonable min/max ranges to avoid misleading scales
  • Use logarithmic scales for wide value ranges

Thresholds and Colors

  • Use thresholds for visual cues (green/yellow/red)
  • Standard threshold pattern:
  • Green: Normal operation
  • Yellow: Warning (action may be needed)
  • Red: Critical (immediate attention required)
  • Examples:
  • Error rate: 0% (green), 1% (yellow), 5% (red)
  • P95 latency: <1s (green), 1-3s (yellow), >3s (red)
  • Link panels to related dashboards
  • Use data links for context (logs, traces, related services)
  • Create drill-down paths: Overview -> Service -> Component -> Details
  • Link to runbooks for alert panels

Dashboard Variables and Templating

Dashboard variables enable reusable, dynamic dashboards that work across environments, services, and time ranges.

Variable Types

Type Purpose Example
Query Populate from data source Namespaces, services, pods
Custom Static list of options Environments (prod/staging/dev)
Interval Time interval selection Auto-adjusted query intervals
Datasource Switch between data sources Multiple Prometheus instances
Constant Hidden values for queries Cluster name, region
Text box Free-form input Custom filters

Common Variable Patterns

{
  "templating": {
    "list": [
      {
        "name": "datasource",
        "type": "datasource",
        "query": "prometheus",
        "description": "Select Prometheus data source"
      },
      {
        "name": "namespace",
        "type": "query",
        "datasource": "${datasource}",
        "query": "label_values(kube_pod_info, namespace)",
        "multi": true,
        "includeAll": true,
        "description": "Kubernetes namespace filter"
      },
      {
        "name": "app",
        "type": "query",
        "datasource": "${datasource}",
        "query": "label_values(kube_pod_info{namespace=~\"$namespace\"}, app)",
        "multi": true,
        "includeAll": true,
        "description": "Application filter (depends on namespace)"
      },
      {
        "name": "interval",
        "type": "interval",
        "auto": true,
        "auto_count": 30,
        "auto_min": "10s",
        "options": ["1m", "5m", "15m", "30m", "1h", "6h", "12h", "1d"],
        "description": "Query resolution interval"
      },
      {
        "name": "environment",
        "type": "custom",
        "options": [
          { "text": "Production", "value": "prod" },
          { "text": "Staging", "value": "staging" },
          { "text": "Development", "value": "dev" }
        ],
        "current": { "text": "Production", "value": "prod" }
      }
    ]
  }
}

Variable Usage in Queries

Variables are referenced with $variable_name or ${variable_name} syntax:

# Simple variable reference
rate(http_requests_total{namespace="$namespace"}[5m])

# Multi-select with regex match
rate(http_requests_total{namespace=~"$namespace"}[5m])

# Variable in legend
rate(http_requests_total{app="$app"}[5m]) by (method)
# Legend format: "{{method}}"

# Using interval variable for adaptive queries
rate(http_requests_total[$__interval])

# Chained variables (app depends on namespace)
rate(http_requests_total{namespace="$namespace", app="$app"}[5m])

Advanced Variable Techniques

Regex filtering:

{
  "name": "pod",
  "type": "query",
  "query": "label_values(kube_pod_info{namespace=\"$namespace\"}, pod)",
  "regex": "/^$app-.*/",
  "description": "Filter pods by app prefix"
}

All option with custom value:

{
  "name": "status",
  "type": "custom",
  "options": ["200", "404", "500"],
  "includeAll": true,
  "allValue": ".*",
  "description": "HTTP status code filter"
}

Dependent variables (variable chain):
1. $datasource (datasource type)
2. $cluster (query: depends on datasource)
3. $namespace (query: depends on cluster)
4. $app (query: depends on namespace)
5. $pod (query: depends on app)

Annotations

Annotations display events as vertical markers on time series panels:

{
  "annotations": {
    "list": [
      {
        "name": "Deployments",
        "datasource": "Prometheus",
        "expr": "changes(kube_deployment_spec_replicas{namespace=\"$namespace\"}[5m])",
        "tagKeys": "deployment,namespace",
        "textFormat": "Deployment: {{deployment}}",
        "iconColor": "blue"
      },
      {
        "name": "Alerts",
        "datasource": "Loki",
        "expr": "{app=\"alertmanager\"} | json | alertname!=\"\"",
        "textFormat": "Alert: {{alertname}}",
        "iconColor": "red"
      }
    ]
  }
}

Dashboard Performance Optimization

Query Optimization

  • Limit number of panels (< 15 per dashboard)
  • Use appropriate time ranges (avoid queries over months)
  • Leverage $__interval for adaptive sampling
  • Avoid high-cardinality grouping (too many series)
  • Use query caching when available

Panel Performance

  • Set max data points to reasonable values
  • Use instant queries for current-state panels
  • Combine related metrics into single queries when possible
  • Disable auto-refresh on heavy dashboards

Dashboard as Code and Provisioning

Dashboard Provisioning

Dashboard provisioning enables GitOps workflows and version-controlled dashboard definitions.

Provisioning Provider Configuration

File: /etc/grafana/provisioning/dashboards/dashboards.yaml

apiVersion: 1

providers:
  - name: 'default'
    orgId: 1
    folder: ''
    type: file
    disableDeletion: false
    updateIntervalSeconds: 10
    allowUiUpdates: true
    options:
      path: /etc/grafana/provisioning/dashboards

  - name: 'application'
    orgId: 1
    folder: 'Applications'
    type: file
    disableDeletion: true
    editable: false
    options:
      path: /var/lib/grafana/dashboards/application

  - name: 'infrastructure'
    orgId: 1
    folder: 'Infrastructure'
    type: file
    options:
      path: /var/lib/grafana/dashboards/infrastructure

Dashboard JSON Structure

Complete dashboard JSON with metadata and provisioning:

{
  "dashboard": {
    "title": "Application Observability - ${app}",
    "uid": "app-observability",
    "tags": ["observability", "application"],
    "timezone": "browser",
    "editable": true,
    "graphTooltip": 1,
    "time": {
      "from": "now-1h",
      "to": "now"
    },
    "refresh": "30s",
    "templating": { "list": [] },
    "panels": [],
    "links": []
  },
  "overwrite": true,
  "folderId": null,
  "folderUid": null
}

Kubernetes ConfigMap Provisioning

apiVersion: v1
kind: ConfigMap
metadata:
  name: grafana-dashboards
  namespace: monitoring
  labels:
    grafana_dashboard: "1"
data:
  application-dashboard.json: |
    {
      "dashboard": {
        "title": "Application Metrics",
        "uid": "app-metrics",
        "tags": ["application"],
        "panels": []
      }
    }

Grafana Operator (CRD)

apiVersion: grafana.integreatly.org/v1beta1
kind: GrafanaDashboard
metadata:
  name: application-observability
  namespace: monitoring
spec:
  instanceSelector:
    matchLabels:
      dashboards: "grafana"
  json: |
    {
      "dashboard": {
        "title": "Application Observability",
        "panels": []
      }
    }

Data Source Provisioning

Loki Data Source

File: /etc/grafana/provisioning/datasources/loki.yaml

apiVersion: 1

datasources:
  - name: Loki
    type: loki
    access: proxy
    url: http://loki:3100
    jsonData:
      maxLines: 1000
      derivedFields:
        - datasourceUid: tempo_uid
          matcherRegex: "trace_id=(\\w+)"
          name: TraceID
          url: "$${__value.raw}"
    editable: false

Tempo Data Source

File: /etc/grafana/provisioning/datasources/tempo.yaml

apiVersion: 1

datasources:
  - name: Tempo
    type: tempo
    access: proxy
    url: http://tempo:3200
    uid: tempo_uid
    jsonData:
      httpMethod: GET
      tracesToLogs:
        datasourceUid: loki_uid
        tags: ["job", "instance", "pod", "namespace"]
        mappedTags: [{ key: "service.name", value: "service" }]
        spanStartTimeShift: "1h"
        spanEndTimeShift: "1h"
      tracesToMetrics:
        datasourceUid: prometheus_uid
        tags: [{ key: "service.name", value: "service" }]
      serviceMap:
        datasourceUid: prometheus_uid
      nodeGraph:
        enabled: true
    editable: false

Mimir/Prometheus Data Source

File: /etc/grafana/provisioning/datasources/mimir.yaml

apiVersion: 1

datasources:
  - name: Mimir
    type: prometheus
    access: proxy
    url: http://mimir:8080/prometheus
    uid: prometheus_uid
    jsonData:
      httpMethod: POST
      exemplarTraceIdDestinations:
        - datasourceUid: tempo_uid
          name: trace_id
      prometheusType: Mimir
      prometheusVersion: 2.40.0
      cacheLevel: "High"
      incrementalQuerying: true
      incrementalQueryOverlapWindow: 10m
    editable: false

Alerting

Alert Rule Configuration

Grafana unified alerting supports multi-datasource alerts with flexible evaluation and routing.

Prometheus/Mimir Alert Rule

File: /etc/grafana/provisioning/alerting/rules.yaml

apiVersion: 1

groups:
  - name: application_alerts
    interval: 1m
    rules:
      - uid: error_rate_high
        title: High Error Rate
        condition: A
        data:
          - refId: A
            queryType: ""
            relativeTimeRange:
              from: 300
              to: 0
            datasourceUid: prometheus_uid
            model:
              expr: |
                sum(rate(http_requests_total{status=~"5.."}[5m]))
                /
                sum(rate(http_requests_total[5m]))
                > 0.05
              intervalMs: 1000
              maxDataPoints: 43200
        noDataState: NoData
        execErrState: Error
        for: 5m
        annotations:
          description: 'Error rate is {{ printf "%.2f" $values.A.Value }}% (threshold: 5%)'
          summary: Application error rate is above threshold
          runbook_url: https://wiki.company.com/runbooks/high-error-rate
        labels:
          severity: critical
          team: platform
        isPaused: false

      - uid: high_latency
        title: High P95 Latency
        condition: A
        data:
          - refId: A
            datasourceUid: prometheus_uid
            model:
              expr: |
                histogram_quantile(0.95,
                  sum(rate(http_request_duration_seconds_bucket[5m])) by (le, endpoint)
                ) > 2
        for: 10m
        annotations:
          description: "P95 latency is {{ $values.A.Value }}s on endpoint {{ $labels.endpoint }}"
          runbook_url: https://wiki.company.com/runbooks/high-latency
        labels:
          severity: warning

Loki Alert Rule

apiVersion: 1

groups:
  - name: log_based_alerts
    interval: 1m
    rules:
      - uid: error_spike
        title: Error Log Spike
        condition: A
        data:
          - refId: A
            queryType: ""
            datasourceUid: loki_uid
            model:
              expr: |
                sum(rate({app="api"} | json | level="error" [5m]))
                > 10
        for: 2m
        annotations:
          description: "Error log rate is {{ $values.A.Value }} logs/sec"
          summary: Spike in error logs detected
        labels:
          severity: warning

      - uid: critical_error_pattern
        title: Critical Error Pattern Detected
        condition: A
        data:
          - refId: A
            datasourceUid: loki_uid
            model:
              expr: |
                sum(count_over_time({app="api"}
                  |~ "OutOfMemoryError|StackOverflowError|FatalException" [5m]
                )) > 0
        for: 0m
        annotations:
          description: "Critical error pattern found in logs"
        labels:
          severity: critical
          page: true

Contact Points and Notification Policies

File: /etc/grafana/provisioning/alerting/contactpoints.yaml

apiVersion: 1

contactPoints:
  - orgId: 1
    name: slack-critical
    receivers:
      - uid: slack_critical
        type: slack
        settings:
          url: https://hooks.slack.com/services/YOUR/WEBHOOK/URL
          title: "{{ .GroupLabels.alertname }}"
          text: |
            {{ range .Alerts }}
            *Alert:* {{ .Labels.alertname }}
            *Summary:* {{ .Annotations.summary }}
            *Description:* {{ .Annotations.description }}
            *Severity:* {{ .Labels.severity }}
            {{ end }}
        disableResolveMessage: false

  - orgId: 1
    name: pagerduty-oncall
    receivers:
      - uid: pagerduty_oncall
        type: pagerduty
        settings:
          integrationKey: YOUR_INTEGRATION_KEY
          severity: critical
          class: infrastructure

  - orgId: 1
    name: email-team
    receivers:
      - uid: email_team
        type: email
        settings:
          addresses: [email protected]
          singleEmail: true

notificationPolicies:
  - orgId: 1
    receiver: slack-critical
    group_by: ["alertname", "namespace"]
    group_wait: 30s
    group_interval: 5m
    repeat_interval: 4h
    routes:
      - receiver: pagerduty-oncall
        matchers:
          - severity = critical
          - page = true
        group_wait: 10s
        repeat_interval: 1h
        continue: true

      - receiver: email-team
        matchers:
          - severity = warning
          - team = platform
        group_interval: 10m
        repeat_interval: 12h

LogQL Query Patterns

Basic Log Queries

Stream Selection

# Simple label matching
{namespace="production", app="api"}

# Regex matching
{app=~"api|web|worker"}

# Not equal
{env!="staging"}

# Multiple conditions
{namespace="production", app="api", level!="debug"}

Line Filters

# Contains
{app="api"} |= "error"

# Does not contain
{app="api"} != "debug"

# Regex match
{app="api"} |~ "error|exception|fatal"

# Case insensitive
{app="api"} |~ "(?i)error"

# Chaining filters
{app="api"} |= "error" != "timeout"

Parsing and Extraction

JSON Parsing

# Parse JSON logs
{app="api"} | json

# Extract specific fields
{app="api"} | json message="msg", level="severity"

# Filter on extracted field
{app="api"} | json | level="error"

# Nested JSON
{app="api"} | json | line_format "{{.response.status}}"

Logfmt Parsing

# Parse logfmt (key=value)
{app="api"} | logfmt

# Extract specific fields
{app="api"} | logfmt level, caller, msg

# Filter parsed fields
{app="api"} | logfmt | level="error"

Pattern Parsing

# Extract with pattern
{app="nginx"} | pattern `<ip> - - <_> "<method> <uri> <_>" <status> <_>`

# Filter on extracted values
{app="nginx"} | pattern `<_> <status> <_>` | status >= 400

# Complex pattern
{app="api"} | pattern `level=<level> msg="<msg>" duration=<duration>ms`

Aggregations and Metrics

Count Queries

# Count log lines over time
count_over_time({app="api"}[5m])

# Rate of logs
rate({app="api"}[5m])

# Errors per second
sum(rate({app="api"} |= "error" [5m])) by (namespace)

# Error ratio
sum(rate({app="api"} |= "error" [5m]))
/
sum(rate({app="api"}[5m]))

Extracted Metrics

# Average duration
avg_over_time({app="api"}
  | logfmt
  | unwrap duration [5m]) by (endpoint)

# P95 latency
quantile_over_time(0.95, {app="api"}
  | regexp `duration=(?P<duration>[0-9.]+)ms`
  | unwrap duration [5m]) by (method)

# Top 10 error messages
topk(10,
  sum by (msg) (
    count_over_time({app="api"}
      | json
      | level="error" [1h]
    )
  )
)

TraceQL Query Patterns

Basic Trace Queries

# Find traces by service
{ .service.name = "api" }

# HTTP status codes
{ .http.status_code = 500 }

# Combine conditions
{ .service.name = "api" && .http.status_code >= 400 }

# Duration filter
{ duration > 1s }

Advanced TraceQL

# Parent-child relationship
{ .service.name = "frontend" }
  >> { .service.name = "backend" && .http.status_code = 500 }

# Descendant spans
{ .service.name = "api" }
  >>+ { .db.system = "postgresql" && duration > 1s }

# Failed database queries
{ .service.name = "api" }
  >> { .db.system = "postgresql" && status = "error" }

Complete Dashboard Examples

Application Observability Dashboard

{
  "dashboard": {
    "title": "Application Observability - ${app}",
    "tags": ["observability", "application"],
    "timezone": "browser",
    "editable": true,
    "graphTooltip": 1,
    "time": {
      "from": "now-1h",
      "to": "now"
    },
    "templating": {
      "list": [
        {
          "name": "app",
          "type": "query",
          "datasource": "Mimir",
          "query": "label_values(up, app)",
          "current": {
            "selected": false,
            "text": "api",
            "value": "api"
          }
        },
        {
          "name": "namespace",
          "type": "query",
          "datasource": "Mimir",
          "query": "label_values(up{app=\"$app\"}, namespace)",
          "multi": true,
          "includeAll": true
        }
      ]
    },
    "panels": [
      {
        "id": 1,
        "title": "Request Rate",
        "type": "graph",
        "datasource": "Mimir",
        "targets": [
          {
            "expr": "sum(rate(http_requests_total{app=\"$app\", namespace=~\"$namespace\"}[$__rate_interval])) by (method, status)",
            "legendFormat": "{{method}} - {{status}}"
          }
        ],
        "gridPos": {
          "h": 8,
          "w": 12,
          "x": 0,
          "y": 0
        },
        "yaxes": [
          {
            "format": "reqps",
            "label": "Requests/sec"
          }
        ]
      },
      {
        "id": 2,
        "title": "P95 Latency",
        "type": "graph",
        "datasource": "Mimir",
        "targets": [
          {
            "expr": "histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket{app=\"$app\", namespace=~\"$namespace\"}[$__rate_interval])) by (le, endpoint))",
            "legendFormat": "{{endpoint}}"
          }
        ],
        "gridPos": {
          "h": 8,
          "w": 12,
          "x": 12,
          "y": 0
        },
        "yaxes": [
          {
            "format": "s",
            "label": "Duration"
          }
        ],
        "thresholds": [
          {
            "value": 1,
            "colorMode": "critical",
            "fill": true,
            "line": true,
            "op": "gt"
          }
        ]
      },
      {
        "id": 3,
        "title": "Error Rate",
        "type": "graph",
        "datasource": "Mimir",
        "targets": [
          {
            "expr": "sum(rate(http_requests_total{app=\"$app\", namespace=~\"$namespace\", status=~\"5..\"}[$__rate_interval])) / sum(rate(http_requests_total{app=\"$app\", namespace=~\"$namespace\"}[$__rate_interval]))",
            "legendFormat": "Error %"
          }
        ],
        "gridPos": {
          "h": 8,
          "w": 12,
          "x": 0,
          "y": 8
        },
        "yaxes": [
          {
            "format": "percentunit",
            "max": 1,
            "min": 0
          }
        ],
        "alert": {
          "conditions": [
            {
              "evaluator": {
                "params": [0.01],
                "type": "gt"
              },
              "operator": {
                "type": "and"
              },
              "query": {
                "params": ["A", "5m", "now"]
              },
              "reducer": {
                "type": "avg"
              },
              "type": "query"
            }
          ],
          "frequency": "1m",
          "handler": 1,
          "name": "Error Rate Alert",
          "noDataState": "no_data",
          "notifications": []
        }
      },
      {
        "id": 4,
        "title": "Recent Error Logs",
        "type": "logs",
        "datasource": "Loki",
        "targets": [
          {
            "expr": "{app=\"$app\", namespace=~\"$namespace\"} | json | level=\"error\"",
            "refId": "A"
          }
        ],
        "options": {
          "showTime": true,
          "showLabels": false,
          "showCommonLabels": false,
          "wrapLogMessage": true,
          "dedupStrategy": "none",
          "enableLogDetails": true
        },
        "gridPos": {
          "h": 8,
          "w": 12,
          "x": 12,
          "y": 8
        }
      }
    ],
    "links": [
      {
        "title": "Explore Logs",
        "url": "/explore?left={\"datasource\":\"Loki\",\"queries\":[{\"expr\":\"{app=\\\"$app\\\",namespace=~\\\"$namespace\\\"}\"}]}",
        "type": "link",
        "icon": "doc"
      },
      {
        "title": "Explore Traces",
        "url": "/explore?left={\"datasource\":\"Tempo\",\"queries\":[{\"query\":\"{resource.service.name=\\\"$app\\\"}\",\"queryType\":\"traceql\"}]}",
        "type": "link",
        "icon": "gf-traces"
      }
    ]
  }
}

LGTM Stack Configuration

Loki Configuration

File: loki.yaml

auth_enabled: false

server:
  http_listen_port: 3100
  grpc_listen_port: 9096
  log_level: info

common:
  path_prefix: /loki
  storage:
    filesystem:
      chunks_directory: /loki/chunks
      rules_directory: /loki/rules
  replication_factor: 1
  ring:
    kvstore:
      store: inmemory

schema_config:
  configs:
    - from: 2024-01-01
      store: tsdb
      object_store: s3
      schema: v13
      index:
        prefix: index_
        period: 24h

storage_config:
  aws:
    s3: s3://us-east-1/my-loki-bucket
    s3forcepathstyle: true
  tsdb_shipper:
    active_index_directory: /loki/tsdb-index
    cache_location: /loki/tsdb-cache
    shared_store: s3

limits_config:
  retention_period: 744h # 31 days
  ingestion_rate_mb: 10
  ingestion_burst_size_mb: 20
  max_query_series: 500
  max_query_lookback: 30d
  reject_old_samples: true
  reject_old_samples_max_age: 168h

compactor:
  working_directory: /loki/compactor
  shared_store: s3
  compaction_interval: 10m
  retention_enabled: true
  retention_delete_delay: 2h

Tempo Configuration

File: tempo.yaml

server:
  http_listen_port: 3200
  grpc_listen_port: 9096

distributor:
  receivers:
    otlp:
      protocols:
        http:
        grpc:
    jaeger:
      protocols:
        thrift_http:
        grpc:

ingester:
  max_block_duration: 5m

compactor:
  compaction:
    block_retention: 720h # 30 days

storage:
  trace:
    backend: s3
    s3:
      bucket: tempo-traces
      endpoint: s3.amazonaws.com
      region: us-east-1
    wal:
      path: /var/tempo/wal

metrics_generator:
  registry:
    external_labels:
      source: tempo
      cluster: primary
  storage:
    path: /var/tempo/generator/wal
    remote_write:
      - url: http://mimir:9009/api/v1/push
        send_exemplars: true

Production Best Practices

Performance Optimization

Query Optimization

  • Use label filters before line filters
  • Limit time ranges for expensive queries
  • Use unwrap instead of parsing when possible
  • Cache query results with query frontend

Dashboard Performance

  • Limit number of panels (< 15 per dashboard)
  • Use appropriate time intervals
  • Avoid high-cardinality grouping
  • Use $__interval for adaptive sampling

Storage Optimization

  • Configure retention policies
  • Use compaction for Loki and Tempo
  • Implement tiered storage (hot/warm/cold)
  • Monitor storage growth

Security Best Practices

Authentication

  • Enable auth (auth_enabled: true in Loki/Tempo)
  • Use OAuth/LDAP for Grafana
  • Implement multi-tenancy with org isolation

Authorization

  • Configure RBAC in Grafana
  • Limit datasource access by team
  • Use folder permissions for dashboards

Network Security

  • TLS for all components
  • Network policies in Kubernetes
  • Rate limiting at ingress

Troubleshooting

Common Issues

  1. High Cardinality: Too many unique label combinations
  2. Solution: Reduce label dimensions, use log parsing instead

  3. Query Timeouts: Complex queries on large datasets

  4. Solution: Reduce time range, use aggregations, add query limits

  5. Storage Growth: Unbounded retention

  6. Solution: Configure retention policies, enable compaction

  7. Missing Traces: Incomplete trace data

  8. Solution: Check sampling rates, verify instrumentation

Resources

# Supported AI Coding Agents

This skill is compatible with the SKILL.md standard and works with all major AI coding agents:

Learn more about the SKILL.md standard and how to use these skills with your preferred AI coding agent.