Use when adding new error messages to React, or seeing "unknown error code" warnings.
npx skills add miles-knowbl/orchestrator --skill "deploy"
Install specific skill from multi-skill repository
# Description
Guides deployment of systems to production environments. Covers deployment strategies, production validation, rollback procedures, and monitoring setup. Completes the engineering loop by bridging the gap between merged code and running production systems.
# SKILL.md
name: deploy
description: "Guides deployment of systems to production environments. Covers deployment strategies, production validation, rollback procedures, and monitoring setup. Completes the engineering loop by bridging the gap between merged code and running production systems."
phase: SHIP
category: core
version: "1.0.0"
depends_on: ["code-review"]
tags: [shipping, operations, core-workflow]
Deploy
Ship code to production safely.
When to Use
- After PR merge — Deploy new system or feature
- Hotfix — Emergency production fix
- Rollback — Revert problematic deployment
- Environment promotion — Move from staging to production
- Infrastructure changes — Deploy configuration or infrastructure
Reference Requirements
MUST read before applying this skill:
| Reference | Why Required |
|---|---|
rollback-procedures.md |
How to safely rollback if needed |
Read if applicable:
| Reference | When Needed |
|---|---|
| Stack-specific deployment guides | For specific deployment targets |
Verification: Ensure DEPLOY.md is produced with rollback procedure documented.
Required Deliverables
| Deliverable | Location | Condition |
|---|---|---|
DEPLOY.md |
Project root | Always |
Core Concept
Deployment answers: "How do we safely get this code running in production?"
┌─────────────────────────────────────────────────────────────────────────────┐
│ DEPLOYMENT PIPELINE │
│ │
│ Code Merged Build & Deploy to Validate Monitor & │
│ to Main → Package → Production → Production → Observe │
│ │
│ ┌─────────┐ ┌─────────┐ ┌─────────┐ ┌─────────┐ ┌─────────┐ │
│ │ PR │ │ Docker │ │ K8s/ │ │ Smoke │ │ Metrics │ │
│ │ Merged │───▶│ Build │──▶│ ECS │──▶│ Tests │──▶│ Alerts │ │
│ └─────────┘ └─────────┘ └─────────┘ └─────────┘ └─────────┘ │
│ │ │ │
│ │ FAIL │ │
│ ▼ │ │
│ ┌─────────┐ │ │
│ │Rollback │◀────────┘ │
│ └─────────┘ │
│ │
└─────────────────────────────────────────────────────────────────────────────┘
Deployment Strategies
Strategy Comparison
| Strategy | Risk | Rollback Speed | Resource Overhead | Best For |
|---|---|---|---|---|
| Rolling | Low | Medium | Low | Standard deploys |
| Blue-Green | Very Low | Fast | High (2x) | Critical services |
| Canary | Very Low | Fast | Medium | High-traffic services |
| Recreate | High | Slow | None | Dev/test environments |
| Feature Flag | Very Low | Instant | None | Gradual rollouts |
Rolling Deployment
Deploy to instances one at a time:
┌─────────────────────────────────────────────────────────────────────────────┐
│ ROLLING DEPLOYMENT │
│ │
│ Time 0: [v1] [v1] [v1] [v1] │
│ │
│ Time 1: [v2] [v1] [v1] [v1] ← First instance updated │
│ │
│ Time 2: [v2] [v2] [v1] [v1] ← Second instance updated │
│ │
│ Time 3: [v2] [v2] [v2] [v1] ← Third instance updated │
│ │
│ Time 4: [v2] [v2] [v2] [v2] ← All instances updated │
│ │
└─────────────────────────────────────────────────────────────────────────────┘
Blue-Green Deployment
Run two identical environments, switch traffic:
┌─────────────────────────────────────────────────────────────────────────────┐
│ BLUE-GREEN DEPLOYMENT │
│ │
│ Before: │
│ │
│ ┌──────────────┐ │
│ Traffic ─┤ Blue (v1) │ ← Active │
│ └──────────────┘ │
│ ┌──────────────┐ │
│ │ Green (idle) │ ← Idle │
│ └──────────────┘ │
│ │
│ After: │
│ │
│ ┌──────────────┐ │
│ │ Blue (v1) │ ← Idle (rollback target) │
│ └──────────────┘ │
│ ┌──────────────┐ │
│ Traffic ─┤ Green (v2) │ ← Active │
│ └──────────────┘ │
│ │
└─────────────────────────────────────────────────────────────────────────────┘
Canary Deployment
Gradually shift traffic to new version:
┌─────────────────────────────────────────────────────────────────────────────┐
│ CANARY DEPLOYMENT │
│ │
│ Stage 1: 5% traffic to v2 │
│ │
│ ┌──────────────┐ │
│ 95% ─────┤ v1 (stable) │ │
│ └──────────────┘ │
│ ┌──────────────┐ │
│ 5% ─────┤ v2 (canary) │ │
│ └──────────────┘ │
│ │
│ Stage 2: 25% traffic to v2 (if metrics OK) │
│ Stage 3: 50% traffic to v2 (if metrics OK) │
│ Stage 4: 100% traffic to v2 (if metrics OK) │
│ │
│ At any stage: Rollback if errors spike │
│ │
└─────────────────────────────────────────────────────────────────────────────┘
→ See references/deployment-strategies.md
The Deployment Process
┌─────────────────────────────────────────────────────────────────────────────┐
│ DEPLOYMENT PROCESS │
│ │
│ 1. PRE-DEPLOYMENT │
│ ├─→ Verify build artifacts exist │
│ ├─→ Run pre-deployment checks │
│ ├─→ Notify stakeholders │
│ └─→ Create deployment record │
│ │
│ 2. DEPLOYMENT │
│ ├─→ Apply infrastructure changes (if any) │
│ ├─→ Deploy application │
│ ├─→ Run database migrations (if any) │
│ └─→ Update service configuration │
│ │
│ 3. VALIDATION │
│ ├─→ Health checks pass │
│ ├─→ Smoke tests pass │
│ ├─→ Metrics within thresholds │
│ └─→ No error spikes │
│ │
│ 4. POST-DEPLOYMENT │
│ ├─→ Update deployment record │
│ ├─→ Notify stakeholders │
│ ├─→ Monitor for issues │
│ └─→ Document any issues │
│ │
│ 5. ROLLBACK (if needed) │
│ ├─→ Trigger rollback │
│ ├─→ Verify rollback successful │
│ ├─→ Investigate failure │
│ └─→ Document incident │
│ │
└─────────────────────────────────────────────────────────────────────────────┘
Pre-Deployment Checklist
Required Checks
## Pre-Deployment Checklist
### Build Verification
- [ ] Build artifacts exist
- [ ] Docker image tagged and pushed
- [ ] Version number correct
- [ ] All tests passed in CI
### Database
- [ ] Migrations tested in staging
- [ ] Migrations are reversible
- [ ] No destructive changes (or approved)
- [ ] Backup taken (if significant migration)
### Dependencies
- [ ] All dependent services available
- [ ] External API changes coordinated
- [ ] Feature flags configured
### Notifications
- [ ] Team notified of deployment
- [ ] Stakeholders aware (if significant)
- [ ] On-call engineer aware
### Documentation
- [ ] Changelog updated
- [ ] Runbook updated (if applicable)
- [ ] Known issues documented
Pre-Deployment Commands
# Verify build
docker pull $REGISTRY/$IMAGE:$VERSION
docker inspect $REGISTRY/$IMAGE:$VERSION
# Check staging
curl -s https://staging.example.com/health | jq .status
# Notify
slack-notify "#deployments" "Starting deployment of $SERVICE v$VERSION"
# Create deployment record
gh api repos/$REPO/deployments -f ref=$SHA -f environment=production
Deployment Execution
Kubernetes Deployment
# Update deployment
kubectl set image deployment/$SERVICE $SERVICE=$IMAGE:$VERSION
# Watch rollout
kubectl rollout status deployment/$SERVICE --timeout=5m
# Check pods
kubectl get pods -l app=$SERVICE
Kubernetes Manifest
apiVersion: apps/v1
kind: Deployment
metadata:
name: order-service
spec:
replicas: 3
strategy:
type: RollingUpdate
rollingUpdate:
maxSurge: 1
maxUnavailable: 0
selector:
matchLabels:
app: order-service
template:
metadata:
labels:
app: order-service
version: v1.2.3
spec:
containers:
- name: order-service
image: registry.example.com/order-service:v1.2.3
ports:
- containerPort: 3000
readinessProbe:
httpGet:
path: /health
port: 3000
initialDelaySeconds: 5
periodSeconds: 10
livenessProbe:
httpGet:
path: /health
port: 3000
initialDelaySeconds: 15
periodSeconds: 20
resources:
requests:
memory: "256Mi"
cpu: "100m"
limits:
memory: "512Mi"
cpu: "500m"
AWS ECS Deployment
# Update service
aws ecs update-service \
--cluster production \
--service order-service \
--task-definition order-service:$VERSION \
--desired-count 3
# Wait for stability
aws ecs wait services-stable \
--cluster production \
--services order-service
Docker Compose (Simple)
# Pull new images
docker-compose pull
# Deploy with zero downtime
docker-compose up -d --no-deps --scale service=2 service
sleep 30
docker-compose up -d --no-deps --scale service=1 service
→ See references/deployment-commands.md
Database Migrations
Migration Safety Rules
- Backward compatible first — New code must work with old schema
- Forward migration only — During deploy, not rollback
- Test in staging — Always run migrations in staging first
- Small batches — Don't lock tables for long
- Backup first — For destructive changes
Safe Migration Patterns
| Change | Safe Approach |
|---|---|
| Add column | Add with default or nullable, backfill later |
| Remove column | Stop using, then remove in later deploy |
| Rename column | Add new, migrate data, remove old |
| Add index | CREATE INDEX CONCURRENTLY |
| Change type | Add new column, migrate, drop old |
Migration Commands
# Check pending migrations
npm run db:migrate:status
# Run migrations
npm run db:migrate
# Rollback (if needed)
npm run db:migrate:rollback
→ See references/migration-safety.md
Production Validation
Health Checks
# Basic health
curl -s https://api.example.com/health | jq
# Expected response
{
"status": "healthy",
"version": "1.2.3",
"uptime": 123,
"dependencies": {
"database": "healthy",
"redis": "healthy"
}
}
Smoke Tests
# Run smoke tests against production
npm run test:smoke -- --env=production
# Or specific checks
curl -s -o /dev/null -w "%{http_code}" https://api.example.com/orders
# Should return 401 (unauthorized, but service is up)
curl -s https://api.example.com/orders -H "Authorization: Bearer $TOKEN" | jq '.data | length'
# Should return order count
Metric Validation
Check key metrics are within thresholds:
| Metric | Warning | Critical | Action |
|---|---|---|---|
| Error rate | >1% | >5% | Rollback |
| Latency p95 | >500ms | >2000ms | Investigate |
| CPU | >70% | >90% | Scale up |
| Memory | >80% | >95% | Investigate |
# Query Prometheus
curl -s "http://prometheus:9090/api/v1/query?query=rate(http_requests_total{status=~'5..'}[5m])"
# Check error rate
ERROR_RATE=$(curl -s ... | jq '.data.result[0].value[1]')
if (( $(echo "$ERROR_RATE > 0.05" | bc -l) )); then
echo "ERROR: Error rate $ERROR_RATE exceeds 5%"
exit 1
fi
Validation Checklist
## Post-Deployment Validation
### Immediate (< 5 minutes)
- [ ] Health endpoint returns healthy
- [ ] All pods/instances running
- [ ] No crash loops
- [ ] Smoke tests pass
### Short-term (5-30 minutes)
- [ ] Error rate stable
- [ ] Latency within SLA
- [ ] No memory leaks
- [ ] Logs look normal
### Medium-term (30 min - 2 hours)
- [ ] User reports (if any)
- [ ] Downstream systems healthy
- [ ] Background jobs running
- [ ] Metrics trending normally
→ See references/validation-checklist.md
Rollback Procedures
When to Rollback
| Signal | Severity | Action |
|---|---|---|
| Error rate > 5% | Critical | Immediate rollback |
| Service unavailable | Critical | Immediate rollback |
| Data corruption | Critical | Immediate rollback + investigation |
| Error rate > 1% | Warning | Investigate, consider rollback |
| Latency > 2x normal | Warning | Investigate, consider rollback |
| User reports | Varies | Investigate |
Rollback Commands
Kubernetes
# Rollback to previous revision
kubectl rollout undo deployment/$SERVICE
# Rollback to specific revision
kubectl rollout undo deployment/$SERVICE --to-revision=3
# Check rollout history
kubectl rollout history deployment/$SERVICE
Docker Compose
# Pull previous version
docker-compose pull
# (after updating docker-compose.yml to previous tag)
# Or directly specify image
docker-compose up -d --no-deps service
Blue-Green
# Switch traffic back to blue
kubectl patch service $SERVICE -p '{"spec":{"selector":{"version":"blue"}}}'
Rollback Checklist
## Rollback Checklist
### Before Rollback
- [ ] Confirm rollback decision with team lead
- [ ] Note current state and symptoms
- [ ] Alert stakeholders
### Execute Rollback
- [ ] Trigger rollback command
- [ ] Verify rollback in progress
- [ ] Wait for completion
### After Rollback
- [ ] Verify service healthy
- [ ] Verify error rate dropping
- [ ] Notify stakeholders
- [ ] Create incident ticket
- [ ] Begin investigation
→ See references/rollback-procedures.md
Monitoring Setup
Essential Metrics
| Category | Metrics |
|---|---|
| RED | Rate, Errors, Duration |
| USE | Utilization, Saturation, Errors |
| Business | Orders/min, Revenue, Active users |
Alerting Rules
# Prometheus alerting rules
groups:
- name: service-alerts
rules:
- alert: HighErrorRate
expr: rate(http_requests_total{status=~"5.."}[5m]) > 0.05
for: 2m
labels:
severity: critical
annotations:
summary: "High error rate on {{ $labels.service }}"
- alert: HighLatency
expr: histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m])) > 2
for: 5m
labels:
severity: warning
annotations:
summary: "High latency on {{ $labels.service }}"
Dashboard Essentials
## Deployment Dashboard
### Top Row - Health
- Service status (up/down)
- Error rate (current)
- Request rate (current)
### Second Row - Trends
- Error rate (24h)
- Latency p50/p95/p99 (24h)
- Request rate (24h)
### Third Row - Resources
- CPU usage
- Memory usage
- Pod/instance count
### Bottom Row - Deployments
- Deployment annotations
- Version distribution
- Recent deployments
→ See references/monitoring-setup.md
Deployment Record
Track deployments for audit and debugging:
{
"id": "deploy-2024-01-17-001",
"service": "order-service",
"version": "1.2.3",
"environment": "production",
"deployedAt": "2024-01-17T14:30:00Z",
"deployedBy": "agent-001",
"status": "success",
"duration": 180,
"previousVersion": "1.2.2",
"changes": {
"commits": ["abc1234", "def5678"],
"prNumber": 123,
"releaseNotes": "Added work order completion flow"
},
"validation": {
"healthCheck": "pass",
"smokeTests": "pass",
"errorRate": 0.001,
"latencyP95": 245
},
"rollback": null
}
Relationship to Other Skills
| Skill | Relationship |
|---|---|
distribute |
Sets up CI/CD pipeline; deploy focuses on production strategies |
loop-controller |
Deploy is final stage before COMPLETE |
code-review |
Creates PR that triggers deploy |
git-workflow |
Manages merge that triggers deploy |
integration-test |
Validates before deploy |
security-audit |
Approves security-sensitive deploys |
Note: Use distribute to set up the CI/CD pipeline (GitHub Actions, Vercel, tarball releases). Use deploy for production deployment strategies (blue-green, canary, rollback procedures, monitoring).
Documentation Site Deployment (Non-Web Apps)
For CLI tools, MCP servers, and other non-web applications, the SHIP phase should still deploy a documentation site to Vercel (or similar). Users need a browsable reference even when the core product isn't a web application.
What to Deploy
| Content | Source | Purpose |
|---|---|---|
| Installation guide | README.md | How to install |
| Usage examples | Examples/ or README | How to use |
| API reference | Generated from source | Complete reference |
| Changelog | CHANGELOG.md | What changed |
Documentation Site Structure
docs/
├── index.md # Overview + quick start
├── installation.md # Installation methods
├── usage.md # Usage guide with examples
├── api/ # Generated API reference
├── examples/ # Runnable examples
└── changelog.md # Version history
Vercel Deployment
# Install docs framework (e.g., VitePress, Nextra, Docusaurus)
npm create vitepress@latest docs
# Configure vercel.json
{
"buildCommand": "npm run docs:build",
"outputDirectory": "docs/.vitepress/dist"
}
# Deploy
vercel --prod
When to Skip
Skip documentation site deployment only if:
- Project is internal-only with no external users
- Documentation exists elsewhere (company wiki)
- User explicitly opts out
Default behavior: Always deploy a docs site unless explicitly skipped.
Key Principles
Deploy frequently. Small, frequent deploys are safer than big-bang releases.
Automate everything. Manual steps cause errors.
Validate thoroughly. Trust but verify.
Roll back fast. When in doubt, roll back.
Monitor continuously. Watch metrics after deploy.
Document decisions. Record what was deployed and why.
Ship docs too. Every project deserves a browsable documentation site.
Mode-Specific Behavior
Deployment strategy and validation differ by orchestrator mode:
Greenfield Mode
| Aspect | Behavior |
|---|---|
| Scope | Full deployment pipeline setup |
| Approach | Comprehensive deployment strategy design |
| Patterns | Free choice of deployment strategy |
| Deliverables | Full DEPLOY.md + monitoring setup |
| Validation | Standard smoke test suite |
| Constraints | Minimal - standard deployment risk |
Brownfield-Polish Mode
| Aspect | Behavior |
|---|---|
| Scope | Gap-specific deployment additions |
| Approach | Extend existing deployment patterns |
| Patterns | Should match existing CI/CD patterns |
| Deliverables | Delta deployment changes |
| Validation | Existing tests + gap-specific validation |
| Constraints | Don't break existing deployment process |
Polish considerations:
- [ ] Deploy process matches existing CI/CD
- [ ] New features behind feature flags if needed
- [ ] Existing functionality smoke tested
- [ ] Gap functionality validated
- [ ] No breaking changes to existing APIs
Brownfield-Enterprise Mode
| Aspect | Behavior |
|---|---|
| Scope | Change-specific deployment only |
| Approach | Surgical deployment with canary rollout |
| Patterns | Must conform exactly to existing procedures |
| Deliverables | Change record with rollback documentation |
| Validation | Full regression + change-specific testing |
| Constraints | Requires approval - change window scheduled |
Enterprise deployment requirements:
- Change approval required before deploy
- Deployment window must be scheduled
- On-call engineer must be available
- Rollback tested in staging first
- Post-deploy monitoring period required
Enterprise deployment record:
{
"changeId": "CHG-12345",
"approvedBy": "change-board",
"deployWindow": "2024-01-17T02:00:00Z",
"rollbackTested": true,
"monitoringPeriod": "4h",
"escalationPath": ["on-call", "team-lead", "director"]
}
References
references/deployment-strategies.md: Detailed strategy comparisonreferences/deployment-commands.md: Platform-specific commandsreferences/migration-safety.md: Safe database migration patternsreferences/validation-checklist.md: Comprehensive validation stepsreferences/rollback-procedures.md: Emergency rollback guidereferences/monitoring-setup.md: Monitoring configuration
# Supported AI Coding Agents
This skill is compatible with the SKILL.md standard and works with all major AI coding agents:
Learn more about the SKILL.md standard and how to use these skills with your preferred AI coding agent.