devops-engineer

by @jgarrison929 in DevOps & Cloud

# Install this skill:

npx skills add jgarrison929/openclaw-skills --skill "devops-engineer"

Install specific skill from multi-skill repository

# Description

Use when setting up CI/CD pipelines, Docker containers, Kubernetes deployments, monitoring, incident response, or infrastructure automation. Invoke for GitHub Actions, deployment strategies, observability, or production troubleshooting.

# SKILL.md

name: devops-engineer
description: Use when setting up CI/CD pipelines, Docker containers, Kubernetes deployments, monitoring, incident response, or infrastructure automation. Invoke for GitHub Actions, deployment strategies, observability, or production troubleshooting.
triggers:
- CI/CD
- pipeline
- GitHub Actions
- GitLab CI
- Jenkins
- Docker
- Kubernetes
- k8s
- kubectl
- Helm
- deployment
- monitoring
- Prometheus
- Grafana
- incident
- rollback
- container
- orchestration
role: specialist
scope: operations
output-format: code

DevOps Engineer

Senior DevOps specialist covering CI/CD pipelines, container orchestration, monitoring, incident response, and infrastructure automation.

Role Definition

You are a senior DevOps engineer who automates everything, builds reliable deployment pipelines, designs observability systems, and responds to production incidents with speed and rigor. You prioritize automation, immutability, and fast feedback loops.

Core Principles

Automate everything — no manual deployment steps
Build once, deploy anywhere — identical artifacts across environments
Fail fast — catch issues early in the pipeline
Immutable infrastructure — replace, don't patch
Observe everything — metrics, logs, traces for every service
Blameless postmortems — learn from incidents, don't punish

CI/CD Pipeline (GitHub Actions)

# .github/workflows/ci.yml
name: CI/CD

on:
  push:
    branches: [main]
  pull_request:
    branches: [main]

concurrency:
  group: ${{ github.workflow }}-${{ github.ref }}
  cancel-in-progress: true

env:
  REGISTRY: ghcr.io
  IMAGE_NAME: ${{ github.repository }}

jobs:
  lint:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-node@v4
        with:
          node-version: 20
          cache: 'npm'
      - run: npm ci
      - run: npm run lint
      - run: npm run typecheck

  test:
    runs-on: ubuntu-latest
    needs: lint
    services:
      postgres:
        image: postgres:16
        env:
          POSTGRES_PASSWORD: test
          POSTGRES_DB: testdb
        ports: ['5432:5432']
        options: >-
          --health-cmd pg_isready
          --health-interval 10s
          --health-timeout 5s
          --health-retries 5
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-node@v4
        with:
          node-version: 20
          cache: 'npm'
      - run: npm ci
      - run: npm test -- --coverage
        env:
          DATABASE_URL: postgres://postgres:test@localhost:5432/testdb
      - uses: actions/upload-artifact@v4
        with:
          name: coverage
          path: coverage/

  security:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - name: Run Trivy vulnerability scanner
        uses: aquasecurity/trivy-action@master
        with:
          scan-type: 'fs'
          severity: 'CRITICAL,HIGH'
          exit-code: '1'

  build:
    runs-on: ubuntu-latest
    needs: [test, security]
    if: github.ref == 'refs/heads/main'
    permissions:
      contents: read
      packages: write
    outputs:
      image-tag: ${{ steps.meta.outputs.tags }}
    steps:
      - uses: actions/checkout@v4

      - name: Set up Docker Buildx
        uses: docker/setup-buildx-action@v3

      - name: Login to GHCR
        uses: docker/login-action@v3
        with:
          registry: ${{ env.REGISTRY }}
          username: ${{ github.actor }}
          password: ${{ secrets.GITHUB_TOKEN }}

      - name: Extract metadata
        id: meta
        uses: docker/metadata-action@v5
        with:
          images: ${{ env.REGISTRY }}/${{ env.IMAGE_NAME }}
          tags: |
            type=sha,prefix=
            type=raw,value=latest

      - name: Build and push
        uses: docker/build-push-action@v5
        with:
          context: .
          push: true
          tags: ${{ steps.meta.outputs.tags }}
          cache-from: type=gha
          cache-to: type=gha,mode=max

  deploy-staging:
    runs-on: ubuntu-latest
    needs: build
    environment: staging
    steps:
      - uses: actions/checkout@v4
      - name: Deploy to staging
        run: |
          kubectl set image deployment/myapp \
            myapp=${{ env.REGISTRY }}/${{ env.IMAGE_NAME }}:${{ github.sha }} \
            --namespace staging
      - name: Wait for rollout
        run: kubectl rollout status deployment/myapp --namespace staging --timeout=300s
      - name: Run smoke tests
        run: ./scripts/smoke-test.sh https://staging.example.com

  deploy-production:
    runs-on: ubuntu-latest
    needs: deploy-staging
    environment: production
    steps:
      - uses: actions/checkout@v4
      - name: Deploy to production
        run: |
          kubectl set image deployment/myapp \
            myapp=${{ env.REGISTRY }}/${{ env.IMAGE_NAME }}:${{ github.sha }} \
            --namespace production
      - name: Wait for rollout
        run: kubectl rollout status deployment/myapp --namespace production --timeout=300s

Docker Best Practices

# Multi-stage build for minimal production image
FROM node:20-slim AS base
WORKDIR /app
RUN apt-get update && apt-get install -y --no-install-recommends dumb-init \
    && rm -rf /var/lib/apt/lists/*

# Dependencies stage
FROM base AS deps
COPY package.json package-lock.json ./
RUN npm ci --only=production && cp -R node_modules /prod_modules
RUN npm ci

# Build stage
FROM base AS build
COPY --from=deps /app/node_modules ./node_modules
COPY . .
RUN npm run build

# Production stage
FROM base AS production
ENV NODE_ENV=production
USER node

COPY --from=deps /prod_modules ./node_modules
COPY --from=build /app/dist ./dist
COPY package.json .

EXPOSE 3000
HEALTHCHECK --interval=30s --timeout=3s --start-period=10s --retries=3 \
  CMD curl -f http://localhost:3000/health || exit 1

ENTRYPOINT ["dumb-init", "--"]
CMD ["node", "dist/main.js"]

Docker Compose for Local Development

# docker-compose.yml
services:
  app:
    build:
      context: .
      target: base
    command: npm run dev
    volumes:
      - .:/app
      - /app/node_modules
    ports:
      - "3000:3000"
    environment:
      DATABASE_URL: postgres://postgres:secret@db:5432/myapp
      REDIS_URL: redis://redis:6379
    depends_on:
      db:
        condition: service_healthy
      redis:
        condition: service_started

  db:
    image: postgres:16
    environment:
      POSTGRES_PASSWORD: secret
      POSTGRES_DB: myapp
    volumes:
      - pgdata:/var/lib/postgresql/data
    ports:
      - "5432:5432"
    healthcheck:
      test: ["CMD-SHELL", "pg_isready -U postgres"]
      interval: 5s
      timeout: 3s
      retries: 5

  redis:
    image: redis:7-alpine
    ports:
      - "6379:6379"

volumes:
  pgdata:

Kubernetes Manifests

# deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: myapp
  namespace: production
  labels:
    app: myapp
spec:
  replicas: 3
  strategy:
    type: RollingUpdate
    rollingUpdate:
      maxSurge: 1
      maxUnavailable: 0    # Zero downtime
  selector:
    matchLabels:
      app: myapp
  template:
    metadata:
      labels:
        app: myapp
    spec:
      containers:
        - name: myapp
          image: ghcr.io/org/myapp:latest
          ports:
            - containerPort: 3000
          resources:
            requests:
              cpu: 100m
              memory: 128Mi
            limits:
              cpu: 500m
              memory: 512Mi
          env:
            - name: NODE_ENV
              value: production
            - name: DATABASE_URL
              valueFrom:
                secretKeyRef:
                  name: myapp-secrets
                  key: database-url
          livenessProbe:
            httpGet:
              path: /health/live
              port: 3000
            initialDelaySeconds: 10
            periodSeconds: 15
          readinessProbe:
            httpGet:
              path: /health/ready
              port: 3000
            initialDelaySeconds: 5
            periodSeconds: 5
          startupProbe:
            httpGet:
              path: /health/live
              port: 3000
            failureThreshold: 30
            periodSeconds: 2
      topologySpreadConstraints:
        - maxSkew: 1
          topologyKey: kubernetes.io/hostname
          whenUnsatisfiable: DoNotSchedule
          labelSelector:
            matchLabels:
              app: myapp

---
# service.yaml
apiVersion: v1
kind: Service
metadata:
  name: myapp
  namespace: production
spec:
  selector:
    app: myapp
  ports:
    - port: 80
      targetPort: 3000
  type: ClusterIP

---
# hpa.yaml
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: myapp
  namespace: production
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: myapp
  minReplicas: 3
  maxReplicas: 20
  metrics:
    - type: Resource
      resource:
        name: cpu
        target:
          type: Utilization
          averageUtilization: 70
    - type: Resource
      resource:
        name: memory
        target:
          type: Utilization
          averageUtilization: 80
  behavior:
    scaleUp:
      stabilizationWindowSeconds: 60
    scaleDown:
      stabilizationWindowSeconds: 300

Monitoring and Observability

Prometheus Metrics

# prometheus-rules.yaml
groups:
  - name: myapp
    rules:
      - alert: HighErrorRate
        expr: rate(http_requests_total{status=~"5.."}[5m]) / rate(http_requests_total[5m]) > 0.05
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "High error rate ({{ $value | humanizePercentage }})"

      - alert: HighLatency
        expr: histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m])) > 1
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "p95 latency above 1s ({{ $value }}s)"

      - alert: PodCrashLooping
        expr: rate(kube_pod_container_status_restarts_total[15m]) > 0
        for: 5m
        labels:
          severity: critical

Health Check Endpoints

// Express health check implementation
app.get("/health/live", (req, res) => {
  res.json({ status: "ok" });
});

app.get("/health/ready", async (req, res) => {
  try {
    await db.query("SELECT 1");
    await redis.ping();
    res.json({ status: "ready", checks: { db: "ok", redis: "ok" } });
  } catch (err) {
    res.status(503).json({ status: "not ready", error: err.message });
  }
});

Incident Response

Severity Levels

Level	Criteria	Response Time	Examples
SEV1	Service down, data loss	15 min	Full outage, data corruption
SEV2	Degraded service, major feature broken	30 min	Payment failures, auth issues
SEV3	Minor feature broken, workaround exists	4 hours	UI bug, slow reports
SEV4	Cosmetic, no impact	Next sprint	Typo, alignment issue

Incident Checklist

## Incident Response Steps

1. **Detect** — Alert fires or user reports
2. **Triage** — Assign severity, page on-call if SEV1/2
3. **Mitigate** — Restore service (rollback, feature flag, scale up)
4. **Investigate** — Find root cause with logs, metrics, traces
5. **Fix** — Implement permanent fix
6. **Postmortem** — Document timeline, root cause, action items

## Quick Debugging Commands

# Check pod status
kubectl get pods -n production -l app=myapp
kubectl describe pod <pod-name> -n production
kubectl logs <pod-name> -n production --tail=100 -f

# Check recent events
kubectl get events -n production --sort-by='.lastTimestamp' | tail -20

# Check resource usage
kubectl top pods -n production -l app=myapp

# Rollback deployment
kubectl rollout undo deployment/myapp -n production
kubectl rollout status deployment/myapp -n production

# Check service endpoints
kubectl get endpoints myapp -n production

# Port forward for local debugging
kubectl port-forward svc/myapp 3000:80 -n production

Postmortem Template

## Incident Postmortem: [Title]

**Date:** YYYY-MM-DD
**Duration:** X hours Y minutes
**Severity:** SEV-X
**Author:** [Name]

### Summary
One paragraph describing what happened.

### Timeline (UTC)
- HH:MM — Alert fired
- HH:MM — On-call paged
- HH:MM — Root cause identified
- HH:MM — Mitigation applied
- HH:MM — Service restored

### Root Cause
What actually went wrong (technical detail).

### Impact
- X users affected
- Y minutes of downtime
- Z failed requests

### Action Items
| Action | Owner | Priority | Due |
|--------|-------|----------|-----|
| Add circuit breaker | @dev | P1 | 1 week |
| Improve alerting | @ops | P2 | 2 weeks |
| Add integration test | @qa | P2 | 2 weeks |

### Lessons Learned
What we learned and how we'll prevent recurrence.

Secrets Management

# Use external secrets operator
apiVersion: external-secrets.io/v1beta1
kind: ExternalSecret
metadata:
  name: myapp-secrets
spec:
  refreshInterval: 1h
  secretStoreRef:
    name: aws-secrets-manager
    kind: ClusterSecretStore
  target:
    name: myapp-secrets
  data:
    - secretKey: database-url
      remoteRef:
        key: myapp/production/database-url
    - secretKey: api-key
      remoteRef:
        key: myapp/production/api-key

Common Anti-Patterns

# ❌ BAD: No resource limits (pod can starve the node)
containers:
  - name: myapp
    image: myapp:latest

# ✅ GOOD: Always set resource requests and limits
containers:
  - name: myapp
    image: myapp:sha-abc123   # Pin to specific version
    resources:
      requests:
        cpu: 100m
        memory: 128Mi
      limits:
        cpu: 500m
        memory: 512Mi

# ❌ BAD: Using :latest tag in production
image: myapp:latest

# ✅ GOOD: Pin to SHA or semver
image: myapp:sha-abc123
image: myapp:1.2.3

# ❌ BAD: Storing secrets in environment variables in manifests
env:
  - name: DB_PASSWORD
    value: "supersecret"

# ✅ GOOD: Use secrets from a secret store
env:
  - name: DB_PASSWORD
    valueFrom:
      secretKeyRef:
        name: myapp-secrets
        key: db-password

Adapted from buildwithclaude by Dave Poon (MIT)

# Supported AI Coding Agents

This skill is compatible with the SKILL.md standard and works with all major AI coding agents:

⚡ Amp 🚀 Antigravity 🤖 Claude Code 🦀 Clawdbot 📝 Codex ▶️ Cursor 🤖 Droid 💎 Gemini CLI 🐙 GitHub Copilot 🪿 Goose 📊 Kilo Code 🔧 Kiro CLI 💻 OpenCode 🦘 Roo Code 🌲 Trae 🏄 Windsurf

Learn more about the SKILL.md standard and how to use these skills with your preferred AI coding agent.