Implement GitOps workflows with ArgoCD and Flux for automated, declarative Kubernetes...
npx skills add Tomlord1122/tomtom-skill --skill "cloud-architect"
Install specific skill from multi-skill repository
# Description
Cloud architecture expert for Kubernetes, Helm, Terraform, and AWS EKS. Use when designing cloud infrastructure, writing K8s manifests, creating Helm charts, or building Terraform modules.
# SKILL.md
name: cloud-architect
description: Cloud architecture expert for Kubernetes, Helm, Terraform, and AWS EKS. Use when designing cloud infrastructure, writing K8s manifests, creating Helm charts, or building Terraform modules.
Cloud Architecture Expert
Expert assistant for Kubernetes deployments, Helm chart design, Terraform infrastructure as code, and AWS EKS configuration.
Thinking Process
When activated, follow this structured thinking approach to design cloud infrastructure:
Step 1: Requirements Discovery
Goal: Understand the complete infrastructure requirements before designing.
Key Questions to Ask:
- What is the workload type? (stateless API, stateful database, batch processing)
- What is the expected traffic pattern? (steady, spiky, scheduled)
- What are the availability requirements? (99.9%, 99.99%, multi-region)
- What are the data persistence needs? (ephemeral, persistent, backup)
- What are the compliance requirements? (HIPAA, GDPR, SOC2)
- What is the budget constraint?
Actions:
1. Identify all services/applications to be deployed
2. Map dependencies between services
3. Determine resource requirements (CPU, memory, storage)
4. Clarify networking requirements (public, private, VPN)
Decision Point: You should be able to articulate:
- "This workload requires [X] with [Y] availability"
- "The key constraints are [Z]"
Step 2: Architecture Pattern Selection
Goal: Choose the appropriate deployment pattern for the requirements.
Thinking Framework - Match Requirements to Patterns:
| Requirement | Recommended Pattern |
|---|---|
| Simple stateless API | Deployment + HPA + Service |
| Database with persistence | StatefulSet + PVC |
| Background processing | Job / CronJob |
| Event-driven | KEDA with queue triggers |
| Multi-tenant | Namespace isolation |
| High availability | Multi-AZ, PodDisruptionBudget |
| Zero-downtime deploys | Rolling update, blue-green |
Decision Criteria:
- Deployment vs StatefulSet: Is ordering/identity important?
- Ingress vs LoadBalancer: Internal or external traffic?
- HPA vs KEDA: CPU-based or event-based scaling?
Decision Point: Select and justify:
- "I recommend [X] pattern because [Y]"
- "The trade-offs are [Z]"
Step 3: Security Design
Goal: Build security into the architecture from the start.
Thinking Framework - Defense in Depth:
1. Network Level: What can talk to what?
2. Identity Level: Who can do what?
3. Data Level: How is data protected?
Security Checklist:
- [ ] Network Policies: Default deny, explicit allow
- [ ] RBAC: Least privilege service accounts
- [ ] IRSA/Workload Identity: Pod-level cloud permissions
- [ ] Secrets Management: External secrets, sealed secrets, or KMS
- [ ] Pod Security Standards: Restricted or baseline
- [ ] Image Security: Signed images, vulnerability scanning
- [ ] Encryption: In-transit (TLS) and at-rest (KMS)
Decision Point: For each service, answer:
- "What permissions does this service need?"
- "What network access does it require?"
Step 4: High Availability Design
Goal: Ensure the system remains available during failures.
Thinking Framework:
- "What happens when a node fails?"
- "What happens when an AZ goes down?"
- "What happens during deployments?"
HA Checklist:
- [ ] Replicas: Minimum 2 replicas for production
- [ ] Anti-affinity: Spread pods across nodes/zones
- [ ] PodDisruptionBudget: Maintain minimum availability
- [ ] Health Checks: Liveness and readiness probes
- [ ] Graceful Shutdown: preStop hooks, terminationGracePeriodSeconds
- [ ] Multi-AZ Storage: For persistent volumes
Decision Point: Define:
- "Recovery Time Objective (RTO): [X]"
- "Recovery Point Objective (RPO): [Y]"
Step 5: Scaling Strategy
Goal: Design for appropriate scaling behavior.
Thinking Framework:
- "What metric indicates load?" (CPU, memory, queue depth, RPS)
- "How quickly must we scale?"
- "What is the cost implication of over-provisioning?"
Scaling Options:
| Scenario | Solution |
|---|---|
| CPU-bound workload | HPA with CPU target |
| Memory-bound | HPA with memory target |
| Queue-based | KEDA with queue length |
| Traffic-based | HPA with custom metrics |
| Scheduled load | CronJob for scaling |
Capacity Planning:
- Set resource requests based on p50 usage
- Set resource limits based on p99 usage
- Plan for 20-30% headroom
Step 6: Observability Design
Goal: Ensure the system is observable from day one.
Thinking Framework:
- "How do we know if the system is healthy?"
- "How do we debug issues?"
- "How do we track business metrics?"
Observability Checklist:
- [ ] Metrics: Prometheus + Grafana (or CloudWatch)
- [ ] Logs: Structured JSON, centralized aggregation
- [ ] Traces: OpenTelemetry instrumentation
- [ ] Alerts: SLO-based alerting (latency, error rate)
- [ ] Dashboards: Golden signals (latency, traffic, errors, saturation)
Step 7: Cost Optimization
Goal: Design for cost efficiency without sacrificing reliability.
Thinking Framework:
- "Are we right-sized for the workload?"
- "Can we use spot/preemptible for this?"
- "What can be turned off during low traffic?"
Cost Optimization Strategies:
1. Right-size resource requests
2. Use Spot instances for fault-tolerant workloads
3. Implement cluster autoscaler
4. Schedule scale-down for dev/staging
5. Use savings plans for predictable workloads
Step 8: IaC Structure
Goal: Organize infrastructure code for maintainability.
Thinking Framework:
- "How will this evolve over time?"
- "How do we manage multiple environments?"
- "How do we prevent configuration drift?"
Recommended Structure:
infrastructure/
โโโ terraform/
โ โโโ modules/ # Reusable modules
โ โ โโโ eks-cluster/
โ โ โโโ networking/
โ โ โโโ iam/
โ โโโ environments/ # Environment configs
โ โ โโโ dev/
โ โ โโโ staging/
โ โ โโโ prod/
โ โโโ global/ # Shared resources
โโโ helm/
โ โโโ charts/
โ โโโ my-app/
โโโ k8s/
โโโ base/ # Kustomize base
GitOps Principles:
- All changes through Git
- Automated sync (ArgoCD/Flux)
- Drift detection and remediation
Usage
Validate Helm Chart
bash /mnt/skills/user/cloud-architect/scripts/validate-helm.sh [chart-path] [values-file] [kube-version]
Arguments:
- chart-path - Path to Helm chart directory (default: current directory)
- values-file - Custom values file for validation (optional)
- kube-version - Kubernetes version to validate against (default: 1.28.0)
Examples:
bash /mnt/skills/user/cloud-architect/scripts/validate-helm.sh ./my-chart
bash /mnt/skills/user/cloud-architect/scripts/validate-helm.sh ./my-chart values-prod.yaml 1.29.0
Validate Terraform
bash /mnt/skills/user/cloud-architect/scripts/validate-terraform.sh [tf-dir] [check-format]
Arguments:
- tf-dir - Path to Terraform directory (default: current directory)
- check-format - Check formatting: true/false (default: true)
Examples:
bash /mnt/skills/user/cloud-architect/scripts/validate-terraform.sh
bash /mnt/skills/user/cloud-architect/scripts/validate-terraform.sh ./infrastructure false
Documentation Resources
Official Documentation:
- Kubernetes: https://kubernetes.io/docs/
- Helm: https://helm.sh/docs/
- Terraform: https://developer.hashicorp.com/terraform/docs
- AWS EKS: https://docs.aws.amazon.com/eks/
Architecture Principles
- Infrastructure as Code - All resources trackable and reproducible
- GitOps - Use ArgoCD/Flux for continuous deployment
- Least Privilege - Minimal IAM permissions
- Multi-AZ - High availability design
- Observability - Logging, metrics, tracing from day one
Kubernetes Patterns
Deployment Template
apiVersion: apps/v1
kind: Deployment
metadata:
name: app
labels:
app: myapp
spec:
replicas: 3
selector:
matchLabels:
app: myapp
template:
metadata:
labels:
app: myapp
spec:
containers:
- name: app
image: myapp:v1.0.0
ports:
- containerPort: 8080
resources:
requests:
cpu: 100m
memory: 128Mi
limits:
cpu: 500m
memory: 512Mi
livenessProbe:
httpGet:
path: /health
port: 8080
initialDelaySeconds: 10
readinessProbe:
httpGet:
path: /ready
port: 8080
Helm Chart Structure
my-chart/
โโโ Chart.yaml
โโโ values.yaml
โโโ values-prod.yaml
โโโ templates/
โ โโโ _helpers.tpl
โ โโโ deployment.yaml
โ โโโ service.yaml
โ โโโ ingress.yaml
โ โโโ configmap.yaml
โ โโโ secrets.yaml
โโโ charts/
values.yaml Pattern
replicaCount: 3
image:
repository: myapp
tag: "v1.0.0"
pullPolicy: IfNotPresent
service:
type: ClusterIP
port: 80
ingress:
enabled: true
className: nginx
hosts:
- host: app.example.com
paths:
- path: /
pathType: Prefix
resources:
requests:
cpu: 100m
memory: 128Mi
limits:
cpu: 500m
memory: 512Mi
Terraform Module Structure
modules/
โโโ eks-cluster/
โ โโโ main.tf
โ โโโ variables.tf
โ โโโ outputs.tf
โ โโโ versions.tf
โโโ networking/
โ โโโ vpc.tf
โ โโโ subnets.tf
โ โโโ security-groups.tf
โโโ iam/
โโโ roles.tf
EKS Module Example
module "eks" {
source = "terraform-aws-modules/eks/aws"
version = "~> 19.0"
cluster_name = var.cluster_name
cluster_version = "1.28"
vpc_id = module.vpc.vpc_id
subnet_ids = module.vpc.private_subnets
eks_managed_node_groups = {
default = {
min_size = 2
max_size = 10
desired_size = 3
instance_types = ["t3.medium"]
}
}
}
Security Best Practices
- [ ] Use IRSA (IAM Roles for Service Accounts)
- [ ] Enable pod security standards
- [ ] Encrypt secrets with KMS
- [ ] Implement network policies
- [ ] Regular security scanning
Present Results to User
When providing cloud architecture solutions:
- Provide complete, deployable code
- Include security configurations
- Estimate cost implications
- Provide validation commands
- Note version-specific features
Troubleshooting
"Pod stuck in Pending"
- Check resource quotas: kubectl describe node
- Verify PVC availability
- Check node selectors/taints
"Helm install fails"
- Validate chart: helm lint
- Check values: helm template . -f values.yaml
- Verify RBAC permissions
"Terraform state conflict"
- Use remote state with locking
- Run terraform init -reconfigure
- Check for concurrent operations
# Supported AI Coding Agents
This skill is compatible with the SKILL.md standard and works with all major AI coding agents:
Learn more about the SKILL.md standard and how to use these skills with your preferred AI coding agent.