deploy

by @miles-knowbl in Development

# Install this skill:

npx skills add miles-knowbl/orchestrator --skill "deploy"

Install specific skill from multi-skill repository

# Description

Guides deployment of systems to production environments. Covers deployment strategies, production validation, rollback procedures, and monitoring setup. Completes the engineering loop by bridging the gap between merged code and running production systems.

# SKILL.md

name: deploy
description: "Guides deployment of systems to production environments. Covers deployment strategies, production validation, rollback procedures, and monitoring setup. Completes the engineering loop by bridging the gap between merged code and running production systems."
phase: SHIP
category: core
version: "1.0.0"
depends_on: ["code-review"]
tags: [shipping, operations, core-workflow]

Deploy

Ship code to production safely.

When to Use

After PR merge — Deploy new system or feature
Hotfix — Emergency production fix
Rollback — Revert problematic deployment
Environment promotion — Move from staging to production
Infrastructure changes — Deploy configuration or infrastructure

Reference Requirements

MUST read before applying this skill:

Reference	Why Required
`rollback-procedures.md`	How to safely rollback if needed

Read if applicable:

Reference	When Needed
Stack-specific deployment guides	For specific deployment targets

Verification: Ensure DEPLOY.md is produced with rollback procedure documented.

Required Deliverables

Deliverable	Location	Condition
`DEPLOY.md`	Project root	Always

Core Concept

Deployment answers: "How do we safely get this code running in production?"

┌─────────────────────────────────────────────────────────────────────────────┐
│                       DEPLOYMENT PIPELINE                                    │
│                                                                             │
│  Code Merged    Build &       Deploy to      Validate      Monitor &       │
│  to Main    →   Package   →   Production  →  Production →  Observe         │
│                                                                             │
│  ┌─────────┐    ┌─────────┐   ┌─────────┐   ┌─────────┐   ┌─────────┐     │
│  │  PR     │    │ Docker  │   │  K8s/   │   │  Smoke  │   │ Metrics │     │
│  │ Merged  │───▶│  Build  │──▶│  ECS    │──▶│  Tests  │──▶│ Alerts  │     │
│  └─────────┘    └─────────┘   └─────────┘   └─────────┘   └─────────┘     │
│                                    │              │                         │
│                                    │    FAIL      │                         │
│                                    ▼              │                         │
│                              ┌─────────┐         │                         │
│                              │Rollback │◀────────┘                         │
│                              └─────────┘                                    │
│                                                                             │
└─────────────────────────────────────────────────────────────────────────────┘

Deployment Strategies

Strategy Comparison

Strategy	Risk	Rollback Speed	Resource Overhead	Best For
Rolling	Low	Medium	Low	Standard deploys
Blue-Green	Very Low	Fast	High (2x)	Critical services
Canary	Very Low	Fast	Medium	High-traffic services
Recreate	High	Slow	None	Dev/test environments
Feature Flag	Very Low	Instant	None	Gradual rollouts

Rolling Deployment

Deploy to instances one at a time:

┌─────────────────────────────────────────────────────────────────────────────┐
│                    ROLLING DEPLOYMENT                                        │
│                                                                             │
│  Time 0:     [v1] [v1] [v1] [v1]                                           │
│                                                                             │
│  Time 1:     [v2] [v1] [v1] [v1]   ← First instance updated                │
│                                                                             │
│  Time 2:     [v2] [v2] [v1] [v1]   ← Second instance updated               │
│                                                                             │
│  Time 3:     [v2] [v2] [v2] [v1]   ← Third instance updated                │
│                                                                             │
│  Time 4:     [v2] [v2] [v2] [v2]   ← All instances updated                 │
│                                                                             │
└─────────────────────────────────────────────────────────────────────────────┘

Blue-Green Deployment

Run two identical environments, switch traffic:

┌─────────────────────────────────────────────────────────────────────────────┐
│                    BLUE-GREEN DEPLOYMENT                                     │
│                                                                             │
│  Before:                                                                    │
│                                                                             │
│           ┌──────────────┐                                                  │
│  Traffic ─┤  Blue (v1)   │ ← Active                                        │
│           └──────────────┘                                                  │
│           ┌──────────────┐                                                  │
│           │ Green (idle) │ ← Idle                                          │
│           └──────────────┘                                                  │
│                                                                             │
│  After:                                                                     │
│                                                                             │
│           ┌──────────────┐                                                  │
│           │  Blue (v1)   │ ← Idle (rollback target)                        │
│           └──────────────┘                                                  │
│           ┌──────────────┐                                                  │
│  Traffic ─┤ Green (v2)   │ ← Active                                        │
│           └──────────────┘                                                  │
│                                                                             │
└─────────────────────────────────────────────────────────────────────────────┘

Canary Deployment

Gradually shift traffic to new version:

┌─────────────────────────────────────────────────────────────────────────────┐
│                    CANARY DEPLOYMENT                                         │
│                                                                             │
│  Stage 1:   5% traffic to v2                                                │
│                                                                             │
│           ┌──────────────┐                                                  │
│  95% ─────┤  v1 (stable) │                                                  │
│           └──────────────┘                                                  │
│           ┌──────────────┐                                                  │
│  5%  ─────┤ v2 (canary)  │                                                  │
│           └──────────────┘                                                  │
│                                                                             │
│  Stage 2:  25% traffic to v2 (if metrics OK)                                │
│  Stage 3:  50% traffic to v2 (if metrics OK)                                │
│  Stage 4: 100% traffic to v2 (if metrics OK)                                │
│                                                                             │
│  At any stage: Rollback if errors spike                                     │
│                                                                             │
└─────────────────────────────────────────────────────────────────────────────┘

→ See references/deployment-strategies.md

The Deployment Process

┌─────────────────────────────────────────────────────────────────────────────┐
│                    DEPLOYMENT PROCESS                                        │
│                                                                             │
│  1. PRE-DEPLOYMENT                                                          │
│     ├─→ Verify build artifacts exist                                        │
│     ├─→ Run pre-deployment checks                                           │
│     ├─→ Notify stakeholders                                                 │
│     └─→ Create deployment record                                            │
│                                                                             │
│  2. DEPLOYMENT                                                              │
│     ├─→ Apply infrastructure changes (if any)                               │
│     ├─→ Deploy application                                                  │
│     ├─→ Run database migrations (if any)                                    │
│     └─→ Update service configuration                                        │
│                                                                             │
│  3. VALIDATION                                                              │
│     ├─→ Health checks pass                                                  │
│     ├─→ Smoke tests pass                                                    │
│     ├─→ Metrics within thresholds                                           │
│     └─→ No error spikes                                                     │
│                                                                             │
│  4. POST-DEPLOYMENT                                                         │
│     ├─→ Update deployment record                                            │
│     ├─→ Notify stakeholders                                                 │
│     ├─→ Monitor for issues                                                  │
│     └─→ Document any issues                                                 │
│                                                                             │
│  5. ROLLBACK (if needed)                                                    │
│     ├─→ Trigger rollback                                                    │
│     ├─→ Verify rollback successful                                          │
│     ├─→ Investigate failure                                                 │
│     └─→ Document incident                                                   │
│                                                                             │
└─────────────────────────────────────────────────────────────────────────────┘

Pre-Deployment Checklist

Required Checks

## Pre-Deployment Checklist

### Build Verification
- [ ] Build artifacts exist
- [ ] Docker image tagged and pushed
- [ ] Version number correct
- [ ] All tests passed in CI

### Database
- [ ] Migrations tested in staging
- [ ] Migrations are reversible
- [ ] No destructive changes (or approved)
- [ ] Backup taken (if significant migration)

### Dependencies
- [ ] All dependent services available
- [ ] External API changes coordinated
- [ ] Feature flags configured

### Notifications
- [ ] Team notified of deployment
- [ ] Stakeholders aware (if significant)
- [ ] On-call engineer aware

### Documentation
- [ ] Changelog updated
- [ ] Runbook updated (if applicable)
- [ ] Known issues documented

Pre-Deployment Commands

# Verify build
docker pull $REGISTRY/$IMAGE:$VERSION
docker inspect $REGISTRY/$IMAGE:$VERSION

# Check staging
curl -s https://staging.example.com/health | jq .status

# Notify
slack-notify "#deployments" "Starting deployment of $SERVICE v$VERSION"

# Create deployment record
gh api repos/$REPO/deployments -f ref=$SHA -f environment=production

Deployment Execution

Kubernetes Deployment

# Update deployment
kubectl set image deployment/$SERVICE $SERVICE=$IMAGE:$VERSION

# Watch rollout
kubectl rollout status deployment/$SERVICE --timeout=5m

# Check pods
kubectl get pods -l app=$SERVICE

Kubernetes Manifest

apiVersion: apps/v1
kind: Deployment
metadata:
  name: order-service
spec:
  replicas: 3
  strategy:
    type: RollingUpdate
    rollingUpdate:
      maxSurge: 1
      maxUnavailable: 0
  selector:
    matchLabels:
      app: order-service
  template:
    metadata:
      labels:
        app: order-service
        version: v1.2.3
    spec:
      containers:
      - name: order-service
        image: registry.example.com/order-service:v1.2.3
        ports:
        - containerPort: 3000
        readinessProbe:
          httpGet:
            path: /health
            port: 3000
          initialDelaySeconds: 5
          periodSeconds: 10
        livenessProbe:
          httpGet:
            path: /health
            port: 3000
          initialDelaySeconds: 15
          periodSeconds: 20
        resources:
          requests:
            memory: "256Mi"
            cpu: "100m"
          limits:
            memory: "512Mi"
            cpu: "500m"

AWS ECS Deployment

# Update service
aws ecs update-service \
  --cluster production \
  --service order-service \
  --task-definition order-service:$VERSION \
  --desired-count 3

# Wait for stability
aws ecs wait services-stable \
  --cluster production \
  --services order-service

Docker Compose (Simple)

# Pull new images
docker-compose pull

# Deploy with zero downtime
docker-compose up -d --no-deps --scale service=2 service
sleep 30
docker-compose up -d --no-deps --scale service=1 service

→ See references/deployment-commands.md

Database Migrations

Migration Safety Rules

Backward compatible first — New code must work with old schema
Forward migration only — During deploy, not rollback
Test in staging — Always run migrations in staging first
Small batches — Don't lock tables for long
Backup first — For destructive changes

Safe Migration Patterns

Change	Safe Approach
Add column	Add with default or nullable, backfill later
Remove column	Stop using, then remove in later deploy
Rename column	Add new, migrate data, remove old
Add index	CREATE INDEX CONCURRENTLY
Change type	Add new column, migrate, drop old

Migration Commands

# Check pending migrations
npm run db:migrate:status

# Run migrations
npm run db:migrate

# Rollback (if needed)
npm run db:migrate:rollback

→ See references/migration-safety.md

Production Validation

Health Checks

# Basic health
curl -s https://api.example.com/health | jq

# Expected response
{
  "status": "healthy",
  "version": "1.2.3",
  "uptime": 123,
  "dependencies": {
    "database": "healthy",
    "redis": "healthy"
  }
}

Smoke Tests

# Run smoke tests against production
npm run test:smoke -- --env=production

# Or specific checks
curl -s -o /dev/null -w "%{http_code}" https://api.example.com/orders
# Should return 401 (unauthorized, but service is up)

curl -s https://api.example.com/orders -H "Authorization: Bearer $TOKEN" | jq '.data | length'
# Should return order count

Metric Validation

Check key metrics are within thresholds:

Metric	Warning	Critical	Action
Error rate	>1%	>5%	Rollback
Latency p95	>500ms	>2000ms	Investigate
CPU	>70%	>90%	Scale up
Memory	>80%	>95%	Investigate

# Query Prometheus
curl -s "http://prometheus:9090/api/v1/query?query=rate(http_requests_total{status=~'5..'}[5m])"

# Check error rate
ERROR_RATE=$(curl -s ... | jq '.data.result[0].value[1]')
if (( $(echo "$ERROR_RATE > 0.05" | bc -l) )); then
  echo "ERROR: Error rate $ERROR_RATE exceeds 5%"
  exit 1
fi

Validation Checklist

## Post-Deployment Validation

### Immediate (< 5 minutes)
- [ ] Health endpoint returns healthy
- [ ] All pods/instances running
- [ ] No crash loops
- [ ] Smoke tests pass

### Short-term (5-30 minutes)
- [ ] Error rate stable
- [ ] Latency within SLA
- [ ] No memory leaks
- [ ] Logs look normal

### Medium-term (30 min - 2 hours)
- [ ] User reports (if any)
- [ ] Downstream systems healthy
- [ ] Background jobs running
- [ ] Metrics trending normally

→ See references/validation-checklist.md

Rollback Procedures

When to Rollback

Signal	Severity	Action
Error rate > 5%	Critical	Immediate rollback
Service unavailable	Critical	Immediate rollback
Data corruption	Critical	Immediate rollback + investigation
Error rate > 1%	Warning	Investigate, consider rollback
Latency > 2x normal	Warning	Investigate, consider rollback
User reports	Varies	Investigate

Rollback Commands

Kubernetes

# Rollback to previous revision
kubectl rollout undo deployment/$SERVICE

# Rollback to specific revision
kubectl rollout undo deployment/$SERVICE --to-revision=3

# Check rollout history
kubectl rollout history deployment/$SERVICE

Docker Compose

# Pull previous version
docker-compose pull
# (after updating docker-compose.yml to previous tag)

# Or directly specify image
docker-compose up -d --no-deps service

Blue-Green

# Switch traffic back to blue
kubectl patch service $SERVICE -p '{"spec":{"selector":{"version":"blue"}}}'

Rollback Checklist

## Rollback Checklist

### Before Rollback
- [ ] Confirm rollback decision with team lead
- [ ] Note current state and symptoms
- [ ] Alert stakeholders

### Execute Rollback
- [ ] Trigger rollback command
- [ ] Verify rollback in progress
- [ ] Wait for completion

### After Rollback
- [ ] Verify service healthy
- [ ] Verify error rate dropping
- [ ] Notify stakeholders
- [ ] Create incident ticket
- [ ] Begin investigation

→ See references/rollback-procedures.md

Monitoring Setup

Essential Metrics

Category	Metrics
RED	Rate, Errors, Duration
USE	Utilization, Saturation, Errors
Business	Orders/min, Revenue, Active users

Alerting Rules

# Prometheus alerting rules
groups:
  - name: service-alerts
    rules:
      - alert: HighErrorRate
        expr: rate(http_requests_total{status=~"5.."}[5m]) > 0.05
        for: 2m
        labels:
          severity: critical
        annotations:
          summary: "High error rate on {{ $labels.service }}"

      - alert: HighLatency
        expr: histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m])) > 2
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "High latency on {{ $labels.service }}"

Dashboard Essentials

## Deployment Dashboard

### Top Row - Health
- Service status (up/down)
- Error rate (current)
- Request rate (current)

### Second Row - Trends
- Error rate (24h)
- Latency p50/p95/p99 (24h)
- Request rate (24h)

### Third Row - Resources
- CPU usage
- Memory usage
- Pod/instance count

### Bottom Row - Deployments
- Deployment annotations
- Version distribution
- Recent deployments

→ See references/monitoring-setup.md

Deployment Record

Track deployments for audit and debugging:

{
  "id": "deploy-2024-01-17-001",
  "service": "order-service",
  "version": "1.2.3",
  "environment": "production",
  "deployedAt": "2024-01-17T14:30:00Z",
  "deployedBy": "agent-001",
  "status": "success",
  "duration": 180,
  "previousVersion": "1.2.2",
  "changes": {
    "commits": ["abc1234", "def5678"],
    "prNumber": 123,
    "releaseNotes": "Added work order completion flow"
  },
  "validation": {
    "healthCheck": "pass",
    "smokeTests": "pass",
    "errorRate": 0.001,
    "latencyP95": 245
  },
  "rollback": null
}

Relationship to Other Skills

Skill	Relationship
`distribute`	Sets up CI/CD pipeline; deploy focuses on production strategies
`loop-controller`	Deploy is final stage before COMPLETE
`code-review`	Creates PR that triggers deploy
`git-workflow`	Manages merge that triggers deploy
`integration-test`	Validates before deploy
`security-audit`	Approves security-sensitive deploys

Note: Use distribute to set up the CI/CD pipeline (GitHub Actions, Vercel, tarball releases). Use deploy for production deployment strategies (blue-green, canary, rollback procedures, monitoring).

Documentation Site Deployment (Non-Web Apps)

For CLI tools, MCP servers, and other non-web applications, the SHIP phase should still deploy a documentation site to Vercel (or similar). Users need a browsable reference even when the core product isn't a web application.

What to Deploy

Content	Source	Purpose
Installation guide	README.md	How to install
Usage examples	Examples/ or README	How to use
API reference	Generated from source	Complete reference
Changelog	CHANGELOG.md	What changed

Documentation Site Structure

docs/
├── index.md          # Overview + quick start
├── installation.md   # Installation methods
├── usage.md          # Usage guide with examples
├── api/              # Generated API reference
├── examples/         # Runnable examples
└── changelog.md      # Version history

Vercel Deployment

# Install docs framework (e.g., VitePress, Nextra, Docusaurus)
npm create vitepress@latest docs

# Configure vercel.json
{
  "buildCommand": "npm run docs:build",
  "outputDirectory": "docs/.vitepress/dist"
}

# Deploy
vercel --prod

When to Skip

Skip documentation site deployment only if:
- Project is internal-only with no external users
- Documentation exists elsewhere (company wiki)
- User explicitly opts out

Default behavior: Always deploy a docs site unless explicitly skipped.

Key Principles

Deploy frequently. Small, frequent deploys are safer than big-bang releases.

Automate everything. Manual steps cause errors.

Validate thoroughly. Trust but verify.

Roll back fast. When in doubt, roll back.

Monitor continuously. Watch metrics after deploy.

Document decisions. Record what was deployed and why.

Ship docs too. Every project deserves a browsable documentation site.

Mode-Specific Behavior

Deployment strategy and validation differ by orchestrator mode:

Greenfield Mode

Aspect	Behavior
Scope	Full deployment pipeline setup
Approach	Comprehensive deployment strategy design
Patterns	Free choice of deployment strategy
Deliverables	Full DEPLOY.md + monitoring setup
Validation	Standard smoke test suite
Constraints	Minimal - standard deployment risk

Brownfield-Polish Mode

Aspect	Behavior
Scope	Gap-specific deployment additions
Approach	Extend existing deployment patterns
Patterns	Should match existing CI/CD patterns
Deliverables	Delta deployment changes
Validation	Existing tests + gap-specific validation
Constraints	Don't break existing deployment process

Polish considerations:
- [ ] Deploy process matches existing CI/CD
- [ ] New features behind feature flags if needed
- [ ] Existing functionality smoke tested
- [ ] Gap functionality validated
- [ ] No breaking changes to existing APIs

Brownfield-Enterprise Mode

Aspect	Behavior
Scope	Change-specific deployment only
Approach	Surgical deployment with canary rollout
Patterns	Must conform exactly to existing procedures
Deliverables	Change record with rollback documentation
Validation	Full regression + change-specific testing
Constraints	Requires approval - change window scheduled

Enterprise deployment requirements:
- Change approval required before deploy
- Deployment window must be scheduled
- On-call engineer must be available
- Rollback tested in staging first
- Post-deploy monitoring period required

Enterprise deployment record:

{
  "changeId": "CHG-12345",
  "approvedBy": "change-board",
  "deployWindow": "2024-01-17T02:00:00Z",
  "rollbackTested": true,
  "monitoringPeriod": "4h",
  "escalationPath": ["on-call", "team-lead", "director"]
}

References

references/deployment-strategies.md: Detailed strategy comparison
references/deployment-commands.md: Platform-specific commands
references/migration-safety.md: Safe database migration patterns
references/validation-checklist.md: Comprehensive validation steps
references/rollback-procedures.md: Emergency rollback guide
references/monitoring-setup.md: Monitoring configuration

# Supported AI Coding Agents

This skill is compatible with the SKILL.md standard and works with all major AI coding agents:

⚡ Amp 🚀 Antigravity 🤖 Claude Code 🦀 Clawdbot 📝 Codex ▶️ Cursor 🤖 Droid 💎 Gemini CLI 🐙 GitHub Copilot 🪿 Goose 📊 Kilo Code 🔧 Kiro CLI 💻 OpenCode 🦘 Roo Code 🌲 Trae 🏄 Windsurf

Learn more about the SKILL.md standard and how to use these skills with your preferred AI coding agent.