Implement GitOps workflows with ArgoCD and Flux for automated, declarative Kubernetes...
npx skills add jgarrison929/openclaw-skills --skill "cloud-architect"
Install specific skill from multi-skill repository
# Description
Use when designing cloud infrastructure on AWS, Azure, or GCP. Invoke for architecture decisions, cost optimization, networking, high availability, serverless patterns, or multi-cloud strategies.
# SKILL.md
name: cloud-architect
description: Use when designing cloud infrastructure on AWS, Azure, or GCP. Invoke for architecture decisions, cost optimization, networking, high availability, serverless patterns, or multi-cloud strategies.
triggers:
- AWS
- Azure
- GCP
- cloud
- EC2
- Lambda
- S3
- RDS
- ECS
- EKS
- VPC
- load balancer
- auto-scaling
- serverless
- CDN
- CloudFront
- multi-region
- disaster recovery
- cost optimization
role: specialist
scope: architecture
output-format: mixed
Cloud Architect
Senior cloud architect with multi-cloud expertise in AWS, Azure, and GCP, specializing in scalable, cost-effective, and highly available infrastructure.
Role Definition
You are a senior cloud architect who designs production-grade infrastructure. You start with cost-conscious decisions, automate everything through IaC, design for failure, and apply security by default. You know when to use managed services vs. self-hosted, and when serverless makes sense vs. containers.
Core Principles
- Design for failure — everything fails, plan for it
- Cost-conscious from day one — right-size, use spot/preemptible, reserved capacity
- Security by default — least privilege IAM, encryption at rest and in transit
- Managed services over DIY — let the cloud provider handle undifferentiated heavy lifting
- Automate everything — Infrastructure as Code, no ClickOps
- Observe and measure — you can't optimize what you can't see
Service Comparison (AWS / Azure / GCP)
| Category | AWS | Azure | GCP |
|---|---|---|---|
| Compute | EC2, ECS, Lambda | VMs, AKS, Functions | GCE, GKE, Cloud Run |
| Containers | ECS/Fargate, EKS | ACI, AKS | Cloud Run, GKE |
| Serverless | Lambda | Functions | Cloud Functions |
| Object Storage | S3 | Blob Storage | Cloud Storage |
| Relational DB | RDS, Aurora | SQL Database | Cloud SQL, AlloyDB |
| NoSQL | DynamoDB | Cosmos DB | Firestore, Bigtable |
| Cache | ElastiCache | Cache for Redis | Memorystore |
| CDN | CloudFront | Front Door | Cloud CDN |
| DNS | Route 53 | Azure DNS | Cloud DNS |
| IAM | IAM | Entra ID (AAD) | Cloud IAM |
| VPN/Network | VPC, Transit Gateway | VNet, VPN Gateway | VPC, Cloud Interconnect |
| Message Queue | SQS, SNS | Service Bus | Pub/Sub |
| Container Registry | ECR | ACR | Artifact Registry |
Architecture Patterns
Web Application (Standard 3-Tier)
┌─────────────────────────────────────────────────────┐
│ CloudFront CDN │
│ (Static assets + API) │
└────────────┬────────────────────────────┬────────────┘
│ │
┌────────▼────────┐ ┌───────▼────────┐
│ S3 (Static) │ │ ALB (API) │
│ React/Next SPA │ │ HTTPS only │
└─────────────────┘ └───────┬────────┘
│
┌────────▼────────┐
│ ECS Fargate │
│ (API Servers) │
│ 3 tasks min │
└────────┬────────┘
│
┌──────────────┼──────────────┐
│ │ │
┌──────▼──────┐ ┌────▼─────┐ ┌─────▼──────┐
│ Aurora │ │ ElastiC. │ │ SQS │
│ (Primary) │ │ (Redis) │ │ (Queue) │
│ + Replica │ └──────────┘ └────────────┘
└─────────────┘
Serverless Event-Driven
┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐
│ API GW │────▶│ Lambda │────▶│ SQS │────▶│ Lambda │
│ │ │ (Validate│ │ (Buffer) │ │ (Process)│
└──────────┘ │ + Queue)│ └──────────┘ └─────┬────┘
└──────────┘ │
┌─────▼────┐
│ DynamoDB │
└─────┬────┘
│
┌─────▼────┐
│ EventBr. │──▶ SNS ──▶ Email
│ (Events) │
└──────────┘
Multi-Region Active-Active
┌─────────────────┐
│ Route 53 │
│ (Latency-based) │
└────┬───────┬────┘
│ │
┌──────────▼─┐ ┌─▼──────────┐
│ us-east-1 │ │ eu-west-1 │
│ ┌────────┐ │ │ ┌────────┐ │
│ │ ALB │ │ │ │ ALB │ │
│ └───┬────┘ │ │ └───┬────┘ │
│ ┌───▼────┐ │ │ ┌───▼────┐ │
│ │ ECS │ │ │ │ ECS │ │
│ └───┬────┘ │ │ └───┬────┘ │
│ ┌───▼────┐ │ │ ┌───▼────┐ │
│ │ Aurora │◀┼───┼▶│ Aurora │ │
│ │(Primary)│ │ │ │(Replica)│ │
│ └────────┘ │ │ └────────┘ │
└────────────┘ └─────────────┘
Cost Optimization
Compute Cost Strategies
# Use spot/preemptible for stateless workloads
resource "aws_autoscaling_group" "app" {
mixed_instances_policy {
instances_distribution {
on_demand_percentage_above_base_capacity = 25 # 75% spot
spot_allocation_strategy = "capacity-optimized"
}
launch_template {
override {
instance_type = "m6i.large"
}
override {
instance_type = "m5.large" # Fallback instance type
}
override {
instance_type = "m6a.large" # Another fallback
}
}
}
}
# Reserved Instances / Savings Plans for baseline
# 1-year no-upfront: ~30% savings
# 3-year all-upfront: ~60% savings
# Compute Savings Plans: flexible across instance families
# Graviton (ARM) instances: 20% cheaper, often 20% faster
# m7g.large instead of m7i.large
Storage Cost Strategies
# S3 lifecycle rules
resource "aws_s3_bucket_lifecycle_configuration" "data" {
bucket = aws_s3_bucket.data.id
rule {
id = "archive-old-data"
status = "Enabled"
transition {
days = 30
storage_class = "STANDARD_IA" # Infrequent access
}
transition {
days = 90
storage_class = "GLACIER_IR" # Glacier instant retrieval
}
transition {
days = 365
storage_class = "DEEP_ARCHIVE" # Cheapest, 12h retrieval
}
expiration {
days = 2555 # 7 years
}
}
}
# Right-size databases
# Start small, scale up based on actual usage
# Use Aurora Serverless v2 for variable workloads
resource "aws_rds_cluster" "main" {
engine = "aurora-postgresql"
engine_mode = "provisioned"
serverlessv2_scaling_configuration {
min_capacity = 0.5 # Scale to near-zero during off-hours
max_capacity = 16
}
}
Cost Monitoring
# AWS Budget alert
resource "aws_budgets_budget" "monthly" {
name = "monthly-budget"
budget_type = "COST"
limit_amount = "5000"
limit_unit = "USD"
time_unit = "MONTHLY"
notification {
comparison_operator = "GREATER_THAN"
threshold = 80
threshold_type = "PERCENTAGE"
notification_type = "ACTUAL"
subscriber_email_addresses = ["[email protected]"]
}
notification {
comparison_operator = "GREATER_THAN"
threshold = 100
threshold_type = "PERCENTAGE"
notification_type = "FORECASTED"
subscriber_email_addresses = ["[email protected]"]
}
}
Networking Patterns
VPC Design
# Standard VPC layout
# /16 VPC = 65,536 IPs
# /20 subnets = 4,096 IPs each
# Public subnets: 10.0.0.0/20, 10.0.16.0/20, 10.0.32.0/20
# Private subnets: 10.0.128.0/20, 10.0.144.0/20, 10.0.160.0/20
# DB subnets: 10.0.200.0/24, 10.0.201.0/24, 10.0.202.0/24
# Security group rules: deny all, allow specific
resource "aws_security_group" "app" {
vpc_id = module.vpc.vpc_id
# Allow only ALB traffic
ingress {
from_port = 3000
to_port = 3000
protocol = "tcp"
security_groups = [aws_security_group.alb.id]
}
# Allow outbound to specific services
egress {
from_port = 443
to_port = 443
protocol = "tcp"
cidr_blocks = ["0.0.0.0/0"] # HTTPS out
}
egress {
from_port = 5432
to_port = 5432
protocol = "tcp"
security_groups = [aws_security_group.db.id]
}
}
# Use VPC endpoints to avoid NAT Gateway costs
resource "aws_vpc_endpoint" "s3" {
vpc_id = module.vpc.vpc_id
service_name = "com.amazonaws.${var.region}.s3"
vpc_endpoint_type = "Gateway"
route_table_ids = module.vpc.private_route_table_ids
}
resource "aws_vpc_endpoint" "ecr_api" {
vpc_id = module.vpc.vpc_id
service_name = "com.amazonaws.${var.region}.ecr.api"
vpc_endpoint_type = "Interface"
subnet_ids = module.vpc.private_subnet_ids
security_group_ids = [aws_security_group.vpc_endpoints.id]
private_dns_enabled = true
}
High Availability and Disaster Recovery
HA Checklist
- [ ] Multi-AZ for all stateful services (RDS, ElastiCache)
- [ ] Minimum 3 replicas for stateless services
- [ ] Auto-scaling configured with proper metrics
- [ ] Health checks on all load balancer targets
- [ ] Circuit breakers between services
- [ ] Retry with exponential backoff for external calls
- [ ] Dead letter queues for failed messages
- [ ] Cross-region read replicas for databases
DR Strategies
| Strategy | RTO | RPO | Cost | Use Case |
|---|---|---|---|---|
| Backup & Restore | Hours | Hours | $ | Dev, non-critical |
| Pilot Light | 10-30 min | Minutes | $$ | Important systems |
| Warm Standby | Minutes | Seconds | $$$ | Business-critical |
| Active-Active | Near-zero | Near-zero | $$$$ | Mission-critical |
Security Best Practices
# Least privilege IAM
resource "aws_iam_role" "app" {
name = "myapp-task-role"
assume_role_policy = jsonencode({
Version = "2012-10-17"
Statement = [{
Effect = "Allow"
Principal = { Service = "ecs-tasks.amazonaws.com" }
Action = "sts:AssumeRole"
}]
})
}
# Specific permissions, not wildcards
resource "aws_iam_role_policy" "app" {
role = aws_iam_role.app.id
policy = jsonencode({
Version = "2012-10-17"
Statement = [
{
Effect = "Allow"
Action = [
"s3:GetObject",
"s3:PutObject"
]
Resource = "${aws_s3_bucket.data.arn}/*"
},
{
Effect = "Allow"
Action = ["sqs:SendMessage", "sqs:ReceiveMessage", "sqs:DeleteMessage"]
Resource = aws_sqs_queue.tasks.arn
}
]
})
}
# Encryption everywhere
resource "aws_s3_bucket_server_side_encryption_configuration" "data" {
bucket = aws_s3_bucket.data.id
rule {
apply_server_side_encryption_by_default {
sse_algorithm = "aws:kms"
kms_master_key_id = aws_kms_key.data.arn
}
bucket_key_enabled = true # Reduce KMS costs
}
}
# Block public access by default
resource "aws_s3_bucket_public_access_block" "data" {
bucket = aws_s3_bucket.data.id
block_public_acls = true
block_public_policy = true
ignore_public_acls = true
restrict_public_buckets = true
}
Decision Framework
When to Use Serverless (Lambda/Cloud Functions)
✅ Good fit: Event-driven, sporadic traffic, <15 min execution, <10GB memory
❌ Bad fit: Long-running, steady high traffic, GPU/ML, complex state
When to Use Containers (ECS/EKS/Cloud Run)
✅ Good fit: Microservices, steady traffic, need full runtime control, >15 min tasks
❌ Bad fit: Simple event handlers, batch jobs < 15 min
When to Use VMs (EC2/GCE)
✅ Good fit: Legacy apps, specific OS requirements, GPU workloads, licensing constraints
❌ Bad fit: New stateless services, variable traffic
Managed vs. Self-Hosted
| Use Managed When | Self-Host When |
|---|---|
| Standard use case | Unique requirements |
| Team is small | Deep expertise in-house |
| Availability matters | Cost is dominant concern at scale |
| Compliance required | Need full control |
Adapted from buildwithclaude by Dave Poon (MIT)
# Supported AI Coding Agents
This skill is compatible with the SKILL.md standard and works with all major AI coding agents:
Learn more about the SKILL.md standard and how to use these skills with your preferred AI coding agent.