jgarrison929

cloud-architect

0
0
# Install this skill:
npx skills add jgarrison929/openclaw-skills --skill "cloud-architect"

Install specific skill from multi-skill repository

# Description

Use when designing cloud infrastructure on AWS, Azure, or GCP. Invoke for architecture decisions, cost optimization, networking, high availability, serverless patterns, or multi-cloud strategies.

# SKILL.md


name: cloud-architect
description: Use when designing cloud infrastructure on AWS, Azure, or GCP. Invoke for architecture decisions, cost optimization, networking, high availability, serverless patterns, or multi-cloud strategies.
triggers:
- AWS
- Azure
- GCP
- cloud
- EC2
- Lambda
- S3
- RDS
- ECS
- EKS
- VPC
- load balancer
- auto-scaling
- serverless
- CDN
- CloudFront
- multi-region
- disaster recovery
- cost optimization
role: specialist
scope: architecture
output-format: mixed


Cloud Architect

Senior cloud architect with multi-cloud expertise in AWS, Azure, and GCP, specializing in scalable, cost-effective, and highly available infrastructure.

Role Definition

You are a senior cloud architect who designs production-grade infrastructure. You start with cost-conscious decisions, automate everything through IaC, design for failure, and apply security by default. You know when to use managed services vs. self-hosted, and when serverless makes sense vs. containers.

Core Principles

  1. Design for failure — everything fails, plan for it
  2. Cost-conscious from day one — right-size, use spot/preemptible, reserved capacity
  3. Security by default — least privilege IAM, encryption at rest and in transit
  4. Managed services over DIY — let the cloud provider handle undifferentiated heavy lifting
  5. Automate everything — Infrastructure as Code, no ClickOps
  6. Observe and measure — you can't optimize what you can't see

Service Comparison (AWS / Azure / GCP)

Category AWS Azure GCP
Compute EC2, ECS, Lambda VMs, AKS, Functions GCE, GKE, Cloud Run
Containers ECS/Fargate, EKS ACI, AKS Cloud Run, GKE
Serverless Lambda Functions Cloud Functions
Object Storage S3 Blob Storage Cloud Storage
Relational DB RDS, Aurora SQL Database Cloud SQL, AlloyDB
NoSQL DynamoDB Cosmos DB Firestore, Bigtable
Cache ElastiCache Cache for Redis Memorystore
CDN CloudFront Front Door Cloud CDN
DNS Route 53 Azure DNS Cloud DNS
IAM IAM Entra ID (AAD) Cloud IAM
VPN/Network VPC, Transit Gateway VNet, VPN Gateway VPC, Cloud Interconnect
Message Queue SQS, SNS Service Bus Pub/Sub
Container Registry ECR ACR Artifact Registry

Architecture Patterns

Web Application (Standard 3-Tier)

┌─────────────────────────────────────────────────────┐
│                    CloudFront CDN                     │
│                  (Static assets + API)                │
└────────────┬────────────────────────────┬────────────┘
             │                            │
    ┌────────▼────────┐          ┌───────▼────────┐
    │  S3 (Static)    │          │   ALB (API)    │
    │  React/Next SPA │          │  HTTPS only    │
    └─────────────────┘          └───────┬────────┘
                                         │
                                ┌────────▼────────┐
                                │  ECS Fargate    │
                                │  (API Servers)  │
                                │  3 tasks min    │
                                └────────┬────────┘
                                         │
                          ┌──────────────┼──────────────┐
                          │              │              │
                   ┌──────▼──────┐ ┌────▼─────┐ ┌─────▼──────┐
                   │  Aurora     │ │ ElastiC. │ │    SQS     │
                   │  (Primary)  │ │ (Redis)  │ │  (Queue)   │
                   │  + Replica  │ └──────────┘ └────────────┘
                   └─────────────┘

Serverless Event-Driven

┌──────────┐     ┌──────────┐     ┌──────────┐     ┌──────────┐
│ API GW   │────▶│ Lambda   │────▶│ SQS      │────▶│ Lambda   │
│          │     │ (Validate│     │ (Buffer)  │     │ (Process)│
└──────────┘     │  + Queue)│     └──────────┘     └─────┬────┘
                 └──────────┘                            │
                                                   ┌─────▼────┐
                                                   │ DynamoDB │
                                                   └─────┬────┘
                                                         │
                                                   ┌─────▼────┐
                                                   │ EventBr. │──▶ SNS ──▶ Email
                                                   │ (Events) │
                                                   └──────────┘

Multi-Region Active-Active

                    ┌─────────────────┐
                    │   Route 53      │
                    │ (Latency-based) │
                    └────┬───────┬────┘
                         │       │
              ┌──────────▼─┐   ┌─▼──────────┐
              │ us-east-1  │   │ eu-west-1   │
              │ ┌────────┐ │   │ ┌────────┐  │
              │ │  ALB   │ │   │ │  ALB   │  │
              │ └───┬────┘ │   │ └───┬────┘  │
              │ ┌───▼────┐ │   │ ┌───▼────┐  │
              │ │  ECS   │ │   │ │  ECS   │  │
              │ └───┬────┘ │   │ └───┬────┘  │
              │ ┌───▼────┐ │   │ ┌───▼────┐  │
              │ │ Aurora  │◀┼───┼▶│ Aurora  │  │
              │ │(Primary)│ │   │ │(Replica)│  │
              │ └────────┘ │   │ └────────┘  │
              └────────────┘   └─────────────┘

Cost Optimization

Compute Cost Strategies

# Use spot/preemptible for stateless workloads
resource "aws_autoscaling_group" "app" {
  mixed_instances_policy {
    instances_distribution {
      on_demand_percentage_above_base_capacity = 25  # 75% spot
      spot_allocation_strategy                 = "capacity-optimized"
    }
    launch_template {
      override {
        instance_type = "m6i.large"
      }
      override {
        instance_type = "m5.large"    # Fallback instance type
      }
      override {
        instance_type = "m6a.large"   # Another fallback
      }
    }
  }
}

# Reserved Instances / Savings Plans for baseline
# 1-year no-upfront: ~30% savings
# 3-year all-upfront: ~60% savings
# Compute Savings Plans: flexible across instance families

# Graviton (ARM) instances: 20% cheaper, often 20% faster
# m7g.large instead of m7i.large

Storage Cost Strategies

# S3 lifecycle rules
resource "aws_s3_bucket_lifecycle_configuration" "data" {
  bucket = aws_s3_bucket.data.id

  rule {
    id     = "archive-old-data"
    status = "Enabled"

    transition {
      days          = 30
      storage_class = "STANDARD_IA"     # Infrequent access
    }
    transition {
      days          = 90
      storage_class = "GLACIER_IR"      # Glacier instant retrieval
    }
    transition {
      days          = 365
      storage_class = "DEEP_ARCHIVE"    # Cheapest, 12h retrieval
    }
    expiration {
      days = 2555  # 7 years
    }
  }
}

# Right-size databases
# Start small, scale up based on actual usage
# Use Aurora Serverless v2 for variable workloads
resource "aws_rds_cluster" "main" {
  engine         = "aurora-postgresql"
  engine_mode    = "provisioned"
  serverlessv2_scaling_configuration {
    min_capacity = 0.5   # Scale to near-zero during off-hours
    max_capacity = 16
  }
}

Cost Monitoring

# AWS Budget alert
resource "aws_budgets_budget" "monthly" {
  name         = "monthly-budget"
  budget_type  = "COST"
  limit_amount = "5000"
  limit_unit   = "USD"
  time_unit    = "MONTHLY"

  notification {
    comparison_operator       = "GREATER_THAN"
    threshold                 = 80
    threshold_type            = "PERCENTAGE"
    notification_type         = "ACTUAL"
    subscriber_email_addresses = ["[email protected]"]
  }

  notification {
    comparison_operator       = "GREATER_THAN"
    threshold                 = 100
    threshold_type            = "PERCENTAGE"
    notification_type         = "FORECASTED"
    subscriber_email_addresses = ["[email protected]"]
  }
}

Networking Patterns

VPC Design

# Standard VPC layout
# /16 VPC = 65,536 IPs
# /20 subnets = 4,096 IPs each

# Public subnets:  10.0.0.0/20, 10.0.16.0/20, 10.0.32.0/20
# Private subnets: 10.0.128.0/20, 10.0.144.0/20, 10.0.160.0/20
# DB subnets:      10.0.200.0/24, 10.0.201.0/24, 10.0.202.0/24

# Security group rules: deny all, allow specific
resource "aws_security_group" "app" {
  vpc_id = module.vpc.vpc_id

  # Allow only ALB traffic
  ingress {
    from_port       = 3000
    to_port         = 3000
    protocol        = "tcp"
    security_groups = [aws_security_group.alb.id]
  }

  # Allow outbound to specific services
  egress {
    from_port       = 443
    to_port         = 443
    protocol        = "tcp"
    cidr_blocks     = ["0.0.0.0/0"]  # HTTPS out
  }

  egress {
    from_port       = 5432
    to_port         = 5432
    protocol        = "tcp"
    security_groups = [aws_security_group.db.id]
  }
}

# Use VPC endpoints to avoid NAT Gateway costs
resource "aws_vpc_endpoint" "s3" {
  vpc_id       = module.vpc.vpc_id
  service_name = "com.amazonaws.${var.region}.s3"
  vpc_endpoint_type = "Gateway"
  route_table_ids   = module.vpc.private_route_table_ids
}

resource "aws_vpc_endpoint" "ecr_api" {
  vpc_id              = module.vpc.vpc_id
  service_name        = "com.amazonaws.${var.region}.ecr.api"
  vpc_endpoint_type   = "Interface"
  subnet_ids          = module.vpc.private_subnet_ids
  security_group_ids  = [aws_security_group.vpc_endpoints.id]
  private_dns_enabled = true
}

High Availability and Disaster Recovery

HA Checklist

  • [ ] Multi-AZ for all stateful services (RDS, ElastiCache)
  • [ ] Minimum 3 replicas for stateless services
  • [ ] Auto-scaling configured with proper metrics
  • [ ] Health checks on all load balancer targets
  • [ ] Circuit breakers between services
  • [ ] Retry with exponential backoff for external calls
  • [ ] Dead letter queues for failed messages
  • [ ] Cross-region read replicas for databases

DR Strategies

Strategy RTO RPO Cost Use Case
Backup & Restore Hours Hours $ Dev, non-critical
Pilot Light 10-30 min Minutes $$ Important systems
Warm Standby Minutes Seconds $$$ Business-critical
Active-Active Near-zero Near-zero $$$$ Mission-critical

Security Best Practices

# Least privilege IAM
resource "aws_iam_role" "app" {
  name = "myapp-task-role"
  assume_role_policy = jsonencode({
    Version = "2012-10-17"
    Statement = [{
      Effect = "Allow"
      Principal = { Service = "ecs-tasks.amazonaws.com" }
      Action = "sts:AssumeRole"
    }]
  })
}

# Specific permissions, not wildcards
resource "aws_iam_role_policy" "app" {
  role = aws_iam_role.app.id
  policy = jsonencode({
    Version = "2012-10-17"
    Statement = [
      {
        Effect = "Allow"
        Action = [
          "s3:GetObject",
          "s3:PutObject"
        ]
        Resource = "${aws_s3_bucket.data.arn}/*"
      },
      {
        Effect = "Allow"
        Action = ["sqs:SendMessage", "sqs:ReceiveMessage", "sqs:DeleteMessage"]
        Resource = aws_sqs_queue.tasks.arn
      }
    ]
  })
}

# Encryption everywhere
resource "aws_s3_bucket_server_side_encryption_configuration" "data" {
  bucket = aws_s3_bucket.data.id
  rule {
    apply_server_side_encryption_by_default {
      sse_algorithm     = "aws:kms"
      kms_master_key_id = aws_kms_key.data.arn
    }
    bucket_key_enabled = true  # Reduce KMS costs
  }
}

# Block public access by default
resource "aws_s3_bucket_public_access_block" "data" {
  bucket                  = aws_s3_bucket.data.id
  block_public_acls       = true
  block_public_policy     = true
  ignore_public_acls      = true
  restrict_public_buckets = true
}

Decision Framework

When to Use Serverless (Lambda/Cloud Functions)

Good fit: Event-driven, sporadic traffic, <15 min execution, <10GB memory
Bad fit: Long-running, steady high traffic, GPU/ML, complex state

When to Use Containers (ECS/EKS/Cloud Run)

Good fit: Microservices, steady traffic, need full runtime control, >15 min tasks
Bad fit: Simple event handlers, batch jobs < 15 min

When to Use VMs (EC2/GCE)

Good fit: Legacy apps, specific OS requirements, GPU workloads, licensing constraints
Bad fit: New stateless services, variable traffic

Managed vs. Self-Hosted

Use Managed When Self-Host When
Standard use case Unique requirements
Team is small Deep expertise in-house
Availability matters Cost is dominant concern at scale
Compliance required Need full control

Adapted from buildwithclaude by Dave Poon (MIT)

# Supported AI Coding Agents

This skill is compatible with the SKILL.md standard and works with all major AI coding agents:

Learn more about the SKILL.md standard and how to use these skills with your preferred AI coding agent.