Implement GitOps workflows with ArgoCD and Flux for automated, declarative Kubernetes...
npx skills add jgarrison929/openclaw-skills --skill "devops-engineer"
Install specific skill from multi-skill repository
# Description
Use when setting up CI/CD pipelines, Docker containers, Kubernetes deployments, monitoring, incident response, or infrastructure automation. Invoke for GitHub Actions, deployment strategies, observability, or production troubleshooting.
# SKILL.md
name: devops-engineer
description: Use when setting up CI/CD pipelines, Docker containers, Kubernetes deployments, monitoring, incident response, or infrastructure automation. Invoke for GitHub Actions, deployment strategies, observability, or production troubleshooting.
triggers:
- CI/CD
- pipeline
- GitHub Actions
- GitLab CI
- Jenkins
- Docker
- Kubernetes
- k8s
- kubectl
- Helm
- deployment
- monitoring
- Prometheus
- Grafana
- incident
- rollback
- container
- orchestration
role: specialist
scope: operations
output-format: code
DevOps Engineer
Senior DevOps specialist covering CI/CD pipelines, container orchestration, monitoring, incident response, and infrastructure automation.
Role Definition
You are a senior DevOps engineer who automates everything, builds reliable deployment pipelines, designs observability systems, and responds to production incidents with speed and rigor. You prioritize automation, immutability, and fast feedback loops.
Core Principles
- Automate everything β no manual deployment steps
- Build once, deploy anywhere β identical artifacts across environments
- Fail fast β catch issues early in the pipeline
- Immutable infrastructure β replace, don't patch
- Observe everything β metrics, logs, traces for every service
- Blameless postmortems β learn from incidents, don't punish
CI/CD Pipeline (GitHub Actions)
# .github/workflows/ci.yml
name: CI/CD
on:
push:
branches: [main]
pull_request:
branches: [main]
concurrency:
group: ${{ github.workflow }}-${{ github.ref }}
cancel-in-progress: true
env:
REGISTRY: ghcr.io
IMAGE_NAME: ${{ github.repository }}
jobs:
lint:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: actions/setup-node@v4
with:
node-version: 20
cache: 'npm'
- run: npm ci
- run: npm run lint
- run: npm run typecheck
test:
runs-on: ubuntu-latest
needs: lint
services:
postgres:
image: postgres:16
env:
POSTGRES_PASSWORD: test
POSTGRES_DB: testdb
ports: ['5432:5432']
options: >-
--health-cmd pg_isready
--health-interval 10s
--health-timeout 5s
--health-retries 5
steps:
- uses: actions/checkout@v4
- uses: actions/setup-node@v4
with:
node-version: 20
cache: 'npm'
- run: npm ci
- run: npm test -- --coverage
env:
DATABASE_URL: postgres://postgres:test@localhost:5432/testdb
- uses: actions/upload-artifact@v4
with:
name: coverage
path: coverage/
security:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Run Trivy vulnerability scanner
uses: aquasecurity/trivy-action@master
with:
scan-type: 'fs'
severity: 'CRITICAL,HIGH'
exit-code: '1'
build:
runs-on: ubuntu-latest
needs: [test, security]
if: github.ref == 'refs/heads/main'
permissions:
contents: read
packages: write
outputs:
image-tag: ${{ steps.meta.outputs.tags }}
steps:
- uses: actions/checkout@v4
- name: Set up Docker Buildx
uses: docker/setup-buildx-action@v3
- name: Login to GHCR
uses: docker/login-action@v3
with:
registry: ${{ env.REGISTRY }}
username: ${{ github.actor }}
password: ${{ secrets.GITHUB_TOKEN }}
- name: Extract metadata
id: meta
uses: docker/metadata-action@v5
with:
images: ${{ env.REGISTRY }}/${{ env.IMAGE_NAME }}
tags: |
type=sha,prefix=
type=raw,value=latest
- name: Build and push
uses: docker/build-push-action@v5
with:
context: .
push: true
tags: ${{ steps.meta.outputs.tags }}
cache-from: type=gha
cache-to: type=gha,mode=max
deploy-staging:
runs-on: ubuntu-latest
needs: build
environment: staging
steps:
- uses: actions/checkout@v4
- name: Deploy to staging
run: |
kubectl set image deployment/myapp \
myapp=${{ env.REGISTRY }}/${{ env.IMAGE_NAME }}:${{ github.sha }} \
--namespace staging
- name: Wait for rollout
run: kubectl rollout status deployment/myapp --namespace staging --timeout=300s
- name: Run smoke tests
run: ./scripts/smoke-test.sh https://staging.example.com
deploy-production:
runs-on: ubuntu-latest
needs: deploy-staging
environment: production
steps:
- uses: actions/checkout@v4
- name: Deploy to production
run: |
kubectl set image deployment/myapp \
myapp=${{ env.REGISTRY }}/${{ env.IMAGE_NAME }}:${{ github.sha }} \
--namespace production
- name: Wait for rollout
run: kubectl rollout status deployment/myapp --namespace production --timeout=300s
Docker Best Practices
# Multi-stage build for minimal production image
FROM node:20-slim AS base
WORKDIR /app
RUN apt-get update && apt-get install -y --no-install-recommends dumb-init \
&& rm -rf /var/lib/apt/lists/*
# Dependencies stage
FROM base AS deps
COPY package.json package-lock.json ./
RUN npm ci --only=production && cp -R node_modules /prod_modules
RUN npm ci
# Build stage
FROM base AS build
COPY --from=deps /app/node_modules ./node_modules
COPY . .
RUN npm run build
# Production stage
FROM base AS production
ENV NODE_ENV=production
USER node
COPY --from=deps /prod_modules ./node_modules
COPY --from=build /app/dist ./dist
COPY package.json .
EXPOSE 3000
HEALTHCHECK --interval=30s --timeout=3s --start-period=10s --retries=3 \
CMD curl -f http://localhost:3000/health || exit 1
ENTRYPOINT ["dumb-init", "--"]
CMD ["node", "dist/main.js"]
Docker Compose for Local Development
# docker-compose.yml
services:
app:
build:
context: .
target: base
command: npm run dev
volumes:
- .:/app
- /app/node_modules
ports:
- "3000:3000"
environment:
DATABASE_URL: postgres://postgres:secret@db:5432/myapp
REDIS_URL: redis://redis:6379
depends_on:
db:
condition: service_healthy
redis:
condition: service_started
db:
image: postgres:16
environment:
POSTGRES_PASSWORD: secret
POSTGRES_DB: myapp
volumes:
- pgdata:/var/lib/postgresql/data
ports:
- "5432:5432"
healthcheck:
test: ["CMD-SHELL", "pg_isready -U postgres"]
interval: 5s
timeout: 3s
retries: 5
redis:
image: redis:7-alpine
ports:
- "6379:6379"
volumes:
pgdata:
Kubernetes Manifests
# deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: myapp
namespace: production
labels:
app: myapp
spec:
replicas: 3
strategy:
type: RollingUpdate
rollingUpdate:
maxSurge: 1
maxUnavailable: 0 # Zero downtime
selector:
matchLabels:
app: myapp
template:
metadata:
labels:
app: myapp
spec:
containers:
- name: myapp
image: ghcr.io/org/myapp:latest
ports:
- containerPort: 3000
resources:
requests:
cpu: 100m
memory: 128Mi
limits:
cpu: 500m
memory: 512Mi
env:
- name: NODE_ENV
value: production
- name: DATABASE_URL
valueFrom:
secretKeyRef:
name: myapp-secrets
key: database-url
livenessProbe:
httpGet:
path: /health/live
port: 3000
initialDelaySeconds: 10
periodSeconds: 15
readinessProbe:
httpGet:
path: /health/ready
port: 3000
initialDelaySeconds: 5
periodSeconds: 5
startupProbe:
httpGet:
path: /health/live
port: 3000
failureThreshold: 30
periodSeconds: 2
topologySpreadConstraints:
- maxSkew: 1
topologyKey: kubernetes.io/hostname
whenUnsatisfiable: DoNotSchedule
labelSelector:
matchLabels:
app: myapp
---
# service.yaml
apiVersion: v1
kind: Service
metadata:
name: myapp
namespace: production
spec:
selector:
app: myapp
ports:
- port: 80
targetPort: 3000
type: ClusterIP
---
# hpa.yaml
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: myapp
namespace: production
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: myapp
minReplicas: 3
maxReplicas: 20
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 70
- type: Resource
resource:
name: memory
target:
type: Utilization
averageUtilization: 80
behavior:
scaleUp:
stabilizationWindowSeconds: 60
scaleDown:
stabilizationWindowSeconds: 300
Monitoring and Observability
Prometheus Metrics
# prometheus-rules.yaml
groups:
- name: myapp
rules:
- alert: HighErrorRate
expr: rate(http_requests_total{status=~"5.."}[5m]) / rate(http_requests_total[5m]) > 0.05
for: 5m
labels:
severity: critical
annotations:
summary: "High error rate ({{ $value | humanizePercentage }})"
- alert: HighLatency
expr: histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m])) > 1
for: 5m
labels:
severity: warning
annotations:
summary: "p95 latency above 1s ({{ $value }}s)"
- alert: PodCrashLooping
expr: rate(kube_pod_container_status_restarts_total[15m]) > 0
for: 5m
labels:
severity: critical
Health Check Endpoints
// Express health check implementation
app.get("/health/live", (req, res) => {
res.json({ status: "ok" });
});
app.get("/health/ready", async (req, res) => {
try {
await db.query("SELECT 1");
await redis.ping();
res.json({ status: "ready", checks: { db: "ok", redis: "ok" } });
} catch (err) {
res.status(503).json({ status: "not ready", error: err.message });
}
});
Incident Response
Severity Levels
| Level | Criteria | Response Time | Examples |
|---|---|---|---|
| SEV1 | Service down, data loss | 15 min | Full outage, data corruption |
| SEV2 | Degraded service, major feature broken | 30 min | Payment failures, auth issues |
| SEV3 | Minor feature broken, workaround exists | 4 hours | UI bug, slow reports |
| SEV4 | Cosmetic, no impact | Next sprint | Typo, alignment issue |
Incident Checklist
## Incident Response Steps
1. **Detect** β Alert fires or user reports
2. **Triage** β Assign severity, page on-call if SEV1/2
3. **Mitigate** β Restore service (rollback, feature flag, scale up)
4. **Investigate** β Find root cause with logs, metrics, traces
5. **Fix** β Implement permanent fix
6. **Postmortem** β Document timeline, root cause, action items
## Quick Debugging Commands
# Check pod status
kubectl get pods -n production -l app=myapp
kubectl describe pod <pod-name> -n production
kubectl logs <pod-name> -n production --tail=100 -f
# Check recent events
kubectl get events -n production --sort-by='.lastTimestamp' | tail -20
# Check resource usage
kubectl top pods -n production -l app=myapp
# Rollback deployment
kubectl rollout undo deployment/myapp -n production
kubectl rollout status deployment/myapp -n production
# Check service endpoints
kubectl get endpoints myapp -n production
# Port forward for local debugging
kubectl port-forward svc/myapp 3000:80 -n production
Postmortem Template
## Incident Postmortem: [Title]
**Date:** YYYY-MM-DD
**Duration:** X hours Y minutes
**Severity:** SEV-X
**Author:** [Name]
### Summary
One paragraph describing what happened.
### Timeline (UTC)
- HH:MM β Alert fired
- HH:MM β On-call paged
- HH:MM β Root cause identified
- HH:MM β Mitigation applied
- HH:MM β Service restored
### Root Cause
What actually went wrong (technical detail).
### Impact
- X users affected
- Y minutes of downtime
- Z failed requests
### Action Items
| Action | Owner | Priority | Due |
|--------|-------|----------|-----|
| Add circuit breaker | @dev | P1 | 1 week |
| Improve alerting | @ops | P2 | 2 weeks |
| Add integration test | @qa | P2 | 2 weeks |
### Lessons Learned
What we learned and how we'll prevent recurrence.
Secrets Management
# Use external secrets operator
apiVersion: external-secrets.io/v1beta1
kind: ExternalSecret
metadata:
name: myapp-secrets
spec:
refreshInterval: 1h
secretStoreRef:
name: aws-secrets-manager
kind: ClusterSecretStore
target:
name: myapp-secrets
data:
- secretKey: database-url
remoteRef:
key: myapp/production/database-url
- secretKey: api-key
remoteRef:
key: myapp/production/api-key
Common Anti-Patterns
# β BAD: No resource limits (pod can starve the node)
containers:
- name: myapp
image: myapp:latest
# β
GOOD: Always set resource requests and limits
containers:
- name: myapp
image: myapp:sha-abc123 # Pin to specific version
resources:
requests:
cpu: 100m
memory: 128Mi
limits:
cpu: 500m
memory: 512Mi
# β BAD: Using :latest tag in production
image: myapp:latest
# β
GOOD: Pin to SHA or semver
image: myapp:sha-abc123
image: myapp:1.2.3
# β BAD: Storing secrets in environment variables in manifests
env:
- name: DB_PASSWORD
value: "supersecret"
# β
GOOD: Use secrets from a secret store
env:
- name: DB_PASSWORD
valueFrom:
secretKeyRef:
name: myapp-secrets
key: db-password
Adapted from buildwithclaude by Dave Poon (MIT)
# Supported AI Coding Agents
This skill is compatible with the SKILL.md standard and works with all major AI coding agents:
Learn more about the SKILL.md standard and how to use these skills with your preferred AI coding agent.