Implement GitOps workflows with ArgoCD and Flux for automated, declarative Kubernetes...
npx skills add javiermontano-sofka/sdf --skill "sofka-cloud-native-architecture"
Install specific skill from multi-skill repository
# Description
>
# SKILL.md
name: sofka-cloud-native-architecture
description: >
Cloud-native design -- containers, service mesh, serverless, multi-cloud, FinOps.
Use when the user asks to "design cloud-native architecture", "containerize the application", "evaluate service mesh",
"plan serverless migration", "implement multi-cloud strategy", "optimize cloud costs", or mentions Kubernetes, Istio, Docker, Helm, Terraform, FinOps, or 12-factor.
model: opus
context: fork
allowed-tools:
- Read
- Write
- Edit
- Glob
- Grep
- Bash
Cloud-Native Architecture: Containers, Mesh, Serverless & Cost Optimization
Cloud-native architecture designs applications to fully exploit cloud platforms -- containers, orchestration, service mesh, serverless, infrastructure as code, and cost-aware engineering. This skill produces architecture documentation guiding teams from cloud readiness assessment through production-grade deployment.
Principio Rector
Cloud-native no es "mover a la nube" — es diseñar PARA la nube. Contenedores por defecto, service mesh para observabilidad, serverless donde stateless, FinOps integral desde el día uno. El objetivo no es usar servicios cloud — es explotar las propiedades de la nube: elasticidad, resiliencia, observabilidad y optimización continua de costos.
Filosofía de Cloud-Native
- Containers by default. Todo workload comienza como contenedor. Solo se desvía a serverless (stateless, event-driven) o VM (legacy sin refactor) con justificación explícita.
- Service mesh para observabilidad. mTLS, traffic management y distributed tracing no son opcionales — son infraestructura base para operar microservicios en producción.
- Serverless where stateless. Funciones serverless para procesamiento de eventos, transformaciones, y glue code. Nunca para lógica stateful o latencia crítica.
- FinOps integral. El costo es un atributo de calidad. Se mide, se asigna, se optimiza. OpenCost desde el día uno.
Inputs
The user provides a system or platform name as $ARGUMENTS. Parse $1 as the system/platform name used throughout all output artifacts.
Parameters:
- {MODO}: piloto-auto (default) | desatendido | supervisado | paso-a-paso
- piloto-auto: Auto para assessment y container strategy, HITL para decisiones mesh/serverless y FinOps targets.
- desatendido: Cero interrupciones. Arquitectura cloud-native documentada automáticamente. Supuestos documentados.
- supervisado: Autónomo con checkpoint en mesh adoption y multi-cloud decisions.
- paso-a-paso: Confirma cada 12-factor assessment, container design, mesh config, y FinOps plan.
- {FORMATO}: markdown (default) | html | dual
- {VARIANTE}: ejecutiva (~40% — S1 assessment + S2 container strategy + S6 FinOps) | técnica (full 6 sections, default)
Before generating architecture, detect cloud-native context:
!find . -name "Dockerfile" -o -name "*.yaml" -o -name "*.tf" -o -name "helm" -type d -o -name "serverless.yml" | head -20
If reference materials exist, load them:
Read ${CLAUDE_SKILL_DIR}/references/cloud-native-patterns.md
When to Use
- Assessing application readiness for cloud-native transformation
- Designing container strategy and Kubernetes architecture
- Evaluating service mesh adoption (Istio, Linkerd, Cilium)
- Making serverless vs. container decisions per workload
- Planning multi-cloud or cloud-agnostic architecture
- Implementing FinOps practices for cost visibility and optimization
When NOT to Use
- Infrastructure platform design (VPCs, compute, storage) --> use infrastructure-architecture skill
- CI/CD pipelines and supply chain security --> use devsecops-architecture skill
- Application internal structure and patterns --> use software-architecture skill
- Migrating existing workloads to cloud --> use cloud-migration skill
Delivery Structure: 6 Sections
S1: Cloud-Native Assessment
Evaluate the application against cloud-native principles to identify gaps and transformation priorities.
12-Factor Compliance Audit:
Rate each factor as compliant / partial / non-compliant with remediation effort (S/M/L):
1. Codebase: one repo, many deploys
2. Dependencies: explicitly declared, isolated
3. Config: stored in environment, not code
4. Backing services: attached resources (DB, cache, queue)
5. Build/release/run: strict stage separation
6. Processes: stateless, share-nothing
7. Port binding: self-contained, export via port
8. Concurrency: scale out via process model
9. Disposability: fast startup (<10s), graceful shutdown (SIGTERM handler, drain connections)
10. Dev/prod parity: minimize environment drift
11. Logs: treat as event streams (stdout)
12. Admin processes: run as one-off tasks
Containerization Readiness Checklist:
- Stateful components identified (database, file storage, sessions)
- External dependency inventory (APIs, queues, caches)
- Configuration externalized to env vars or config maps
- Health check endpoints (liveness + readiness + startup probes)
- Graceful shutdown with connection draining
- Secret management via Vault, AWS Secrets Manager, or sealed-secrets
S2: Container & Orchestration Strategy
Container Image Best Practices:
- Base images: distroless (Google) or Alpine (<5MB). Never use latest tag.
- Multi-stage builds: separate build and runtime layers. Final image <100MB target.
- Image scanning in CI: Trivy (CNCF, free), Grype, or Snyk Container. Block Critical/High CVEs.
- Registry: private, image signing (cosign/Sigstore), tag immutability enforced.
Kubernetes Architecture:
- Cluster topology: single-cluster per env (simple) vs. multi-cluster per region (HA/DR).
- Namespace strategy: per-team or per-service. Enforce with NetworkPolicies + RBAC.
- Pod design: sidecar for cross-cutting (logging, proxy), init containers for bootstrapping, PodDisruptionBudgets for availability.
Resource Request/Limit Guidance:
| Resource | Request | Limit | Rationale |
|---|---|---|---|
| CPU | Set at P95 usage (from VPA data) | Omit or set 5x request | Avoids CPU throttling; burst on idle cores |
| Memory | Set at P95 usage | Set at 1.5-2x request | OOMKill preferred over node instability |
| Ephemeral storage | Set if logs/cache grow | Set at 2x request | Prevents eviction |
- Start with VPA in recommend-only mode for 7+ days before tuning.
- Default LimitRange guardrails: 100m CPU / 128Mi request, 500m CPU / 512Mi limit.
- Overcommit ratio: 1.2-1.5x for CPU (safe burst), 1.0x for memory (no overcommit).
Node Autoscaling Decision Matrix:
| Tool | Mechanism | Provision Speed | Best For |
|---|---|---|---|
| Cluster Autoscaler | ASG-based, node group templates | 2-5 min | Homogeneous workloads, simple setups |
| Karpenter (AWS) | Direct EC2 API, right-sized nodes | 30-60s | Heterogeneous workloads, spot optimization, cost-sensitive |
| GKE NAP | GKE-native, auto node pools | 1-2 min | GKE clusters, managed simplicity |
- Prefer Karpenter on EKS for new deployments: faster provisioning, better bin-packing, native spot/OD mix, consolidation (replaces underutilized nodes automatically).
- Use Cluster Autoscaler only when Karpenter is unavailable (AKS, on-prem) or organizational policy requires ASG-based scaling.
Pod Autoscaling Decision Matrix:
| Tool | Trigger | Scales To Zero | Best For |
|---|---|---|---|
| HPA | CPU/memory/custom metrics | No | HTTP traffic, steady load spikes |
| VPA | Historical usage analysis | N/A | Right-sizing, legacy apps (recommend-only mode safe) |
| KEDA | External events (Kafka lag, SQS, cron, Prometheus) | Yes | Queue workers, batch jobs, event-driven |
- Never combine VPA and HPA on the same metric.
- Combination pattern: KEDA for async + HPA for HTTP + VPA recommend-only + Karpenter for nodes.
GitOps Deployment:
- ArgoCD or Flux for declarative, auditable, rollback-capable deployments.
- Helm charts with values-per-environment. OCI registry for chart storage.
S3: Service Mesh & Networking
Gateway API vs. Ingress (2025-2026):
| Aspect | Ingress (Legacy) | Gateway API (Standard) |
|---|---|---|
| Status | Ingress-NGINX retiring March 2026 | GA, v1.2+, CNCF standard |
| Role model | Single resource, annotation-heavy | HTTPRoute, GRPCRoute, TCPRoute (role-oriented) |
| TLS | Annotation-based | First-class TLSRoute |
| Multi-tenancy | Weak | Built-in (Gateway per team, shared GatewayClass) |
| Implementations | NGINX, HAProxy | Envoy Gateway, Cilium, Istio, Kong, Contour |
- Migrate all new clusters to Gateway API. Existing Ingress: plan migration before NGINX retirement.
- Recommended implementations: Envoy Gateway (highest conformance), Cilium Gateway (if already using Cilium CNI).
CNI & Service Mesh Comparison:
| Tool | Type | Data Plane | Resource Overhead | Best For |
|---|---|---|---|---|
| Cilium | CNI + mesh + observability | eBPF (kernel) | Lowest (no sidecar for L3/L4) | Teams wanting unified networking + mesh + observability |
| Calico | CNI + network policy | iptables or eBPF | Low | Network policy enforcement, simple CNI |
| Istio Ambient | Mesh (L4 ztunnel + L7 waypoint) | Per-node + per-namespace | 90% less than sidecar mode | Zero-trust mTLS at scale, new deployments |
| Istio Sidecar | Mesh (Envoy per pod) | Per-pod sidecar | ~50-100MB/sidecar | Complex L7 traffic management |
| Linkerd | Mesh (Rust proxy) | Per-pod sidecar (~10MB) | Very low | Teams wanting simplicity over features |
- Decision rule: <10 services with simple patterns = no mesh. Need mTLS only = Cilium or Istio Ambient. Need L7 traffic management = Istio Sidecar or Linkerd. Already using Cilium CNI = Cilium Service Mesh.
mTLS & Zero Trust: Mesh-managed short-lived certificates (hours). Service-to-service RBAC, deny-by-default. SPIFFE identities.
Traffic Management: Canary (gradual shift), blue-green (instant), A/B (header/weight). Circuit breaking, rate limiting, retry/timeout (idempotent operations only).
Observability Stack:
- Distributed tracing: OpenTelemetry (CNCF standard) with Jaeger or Grafana Tempo.
- Metrics: Prometheus + Grafana. RED metrics per service (Rate, Errors, Duration).
- eBPF-based observability (zero-instrumentation): Cilium Hubble (network flows), Pixie (auto-instrumented traces), Tetragon (security events). Use for polyglot environments where manual instrumentation is impractical.
S4: Serverless Decision Framework
Decision Matrix:
| Factor | Favor Serverless | Favor Containers |
|---|---|---|
| Traffic pattern | Spiky, unpredictable | Steady, predictable |
| Execution time | <15 minutes | Long-running |
| State | Stateless | Stateful |
| Cold start tolerance | Acceptable (100-500ms) | Not acceptable (<50ms) |
| Cost at volume | <1M invocations/month | >10M invocations/month |
| Vendor lock-in | Acceptable | Not acceptable |
Cold Start Mitigation: Provisioned concurrency, smaller packages (tree-shaking, layers), language choice (Go/Rust <100ms, Java/C# 500ms-2s), SnapStart (Java on Lambda), warm-up pings.
State Management: External stores (DynamoDB, Redis, S3). Step Functions / Durable Functions for orchestration. Event-driven decoupling via queues.
Vendor Lock-in Assessment: Abstraction layers (SST, Pulumi, Serverless Framework). Exit cost per component. Prefer open standards (CloudEvents, OpenTelemetry).
S5: Multi-Cloud & Portability
Strategy Tiers:
- Tier 1: Cloud-agnostic app (Kubernetes, standard APIs). Cost: low. Benefit: portability.
- Tier 2: Portable infrastructure (Terraform, Crossplane). Cost: medium. Benefit: negotiation leverage.
- Tier 3: Active multi-cloud (workloads distributed). Cost: high. Benefit: DR, compliance, best-of-breed.
Abstraction Approaches:
- Kubernetes as portability layer: same manifests across EKS/GKE/AKS.
- Terraform: provider-agnostic HCL modules. State in remote backend.
- Crossplane: Kubernetes-native infrastructure provisioning across clouds.
- Application-level: abstract cloud SDKs behind interfaces (storage, queue, identity).
Cloud-Agnostic Patterns:
- Use open standards: S3-compatible storage (MinIO), OpenTelemetry, OIDC, CloudEvents.
- Data gravity: place compute near data; minimize cross-cloud data transfer.
- Policy-as-code: OPA/Gatekeeper enforced across all clusters.
S6: FinOps Integration
FinOps Tooling Comparison:
| Tool | License | Scope | Unique Value |
|---|---|---|---|
| OpenCost | Open source (CNCF Incubating) | K8s workload costs | Free, Prometheus-native, MCP server for AI-driven cost queries |
| Kubecost | Freemium (backed by IBM) | K8s + cloud costs | Savings recommendations, network cost visibility, enterprise support |
| Vantage | Commercial SaaS | Multi-cloud + SaaS | Unified dashboard across AWS/Azure/GCP/Datadog/Snowflake |
| FOCUS | Open standard (FinOps Foundation) | Billing data format | Normalize billing across providers for consistent reporting |
- Start with OpenCost for Kubernetes-native cost allocation. Add Kubecost for savings recommendations. Use Vantage or CloudHealth for multi-cloud executive dashboards.
- OpenCost 2025: runs without Prometheus (Collector Datasource), MCP server for AI agent cost queries, plugin framework for Datadog/OpenAI/MongoDB Atlas cost monitoring.
Cost Allocation: Namespace/pod-level via OpenCost/Kubecost. Label strategy: team, service, environment, cost-center. Showback reports per team.
Optimization Levers:
- Rightsizing: VPA recommendations after 2+ weeks of data.
- Spot/preemptible: 60-90% savings for fault-tolerant workloads. Karpenter automates spot/OD mix.
- Scale to zero: KEDA for queue workers, serverless for event-driven.
- Ephemeral environments: spin up per PR, tear down on merge.
- Storage lifecycle: S3 Intelligent Tiering, EBS snapshot cleanup.
- Network: minimize cross-AZ traffic via topology-aware routing.
Cost Governance:
- Daily cost by team/service/environment dashboard.
- Unit economics: cost per user, cost per transaction.
- Anomaly alerts: >20% daily spike triggers investigation.
- Budget alerts per account, per namespace.
Trade-off Matrix
| Decision | Enables | Constrains | When to Use |
|---|---|---|---|
| Kubernetes | Portability, scaling, ecosystem | Operational complexity | Polyglot microservices, experienced teams |
| Service Mesh | mTLS, traffic control, observability | Resource overhead, complexity | >10 services, zero-trust required |
| Serverless | Zero ops, pay-per-use | Cold start, vendor lock-in | Event-driven, low-volume, spiky traffic |
| Multi-Cloud | Avoid lock-in, negotiate pricing | Complexity, lowest-common-denominator | Regulatory, negotiation leverage, DR |
| GitOps (ArgoCD) | Auditable, declarative, rollback | Learning curve, git as bottleneck | Kubernetes-native, compliance-driven |
| Spot Instances | 60-90% cost savings | Interruption risk | Stateless, fault-tolerant workloads |
| Karpenter over CA | Faster scaling, better bin-packing | AWS-only (EKS) | EKS clusters with heterogeneous workloads |
| Gateway API over Ingress | Multi-tenancy, role-based, extensible | Newer ecosystem | All new clusters; migrate existing before NGINX retirement |
Assumptions
- Application is being modernized or built for cloud deployment
- Team has or is developing container and orchestration skills
- Cloud provider(s) selected or shortlisted
- Budget includes cloud-native tooling (mesh, GitOps, cost tools)
Limits
- Does not design internal application architecture (use software-architecture skill)
- Does not cover infrastructure platform setup (use infrastructure-architecture skill)
- Does not plan CI/CD pipelines (use devsecops-architecture skill)
- FinOps practices require organizational buy-in beyond architecture decisions
Edge Cases
Monolith Containerization:
Containerize the monolith first (lift-and-shift to container), then decompose. Use strangler fig pattern. Do not attempt simultaneous containerization and decomposition.
Stateful Workloads on Kubernetes:
Use operators (CloudNativePG for PostgreSQL, Strimzi for Kafka). Alternative: managed services outside K8s. Evaluate operational burden vs. portability.
Serverless at Scale (>10M invocations/month):
Model break-even point. Container alternative often cheaper at high volume. Reserved concurrency or Fargate may be more cost-effective.
Regulated Industries:
Service mesh mTLS may be mandatory. Image provenance required (SLSA, Sigstore/cosign). Multi-cloud may be required for data residency. Audit logging at infrastructure layer.
Small Team (<5 developers):
Full K8s + mesh is likely over-engineered. Use managed Kubernetes, skip mesh, use cloud-managed services. Revisit as team grows.
Validation Gate
Before finalizing delivery, verify:
- [ ] 12-factor compliance gaps identified with remediation plan
- [ ] Container strategy includes security scanning and minimal base images
- [ ] Resource requests/limits configured per guidance table (CPU burst, memory capped)
- [ ] Node autoscaler selected (Karpenter vs. CA) with rationale
- [ ] Service mesh decision justified (adopt vs. defer) with comparison matrix
- [ ] Gateway API adopted for ingress (or migration plan from Ingress documented)
- [ ] Serverless vs. container decision documented per workload
- [ ] FinOps tooling deployed (OpenCost minimum) with cost allocation labels
- [ ] Auto-scaling configured for all stateless workloads (HPA/KEDA + Karpenter)
- [ ] GitOps deployment pipeline in place
Output Format Protocol
| Format | Default | Description |
|---|---|---|
markdown |
Yes | Rich Markdown + Mermaid diagrams. Token-efficient. |
html |
On demand | Branded HTML (Design System). Visual impact. |
dual |
On demand | Both formats. |
Default output is Markdown with embedded Mermaid diagrams. HTML generation requires explicit {FORMATO}=html parameter.
Output Artifact
Primary: A-01_Cloud_Native_Architecture.html -- Executive summary, 12-factor assessment, container strategy, Kubernetes architecture, service mesh design, serverless decisions, multi-cloud plan, FinOps dashboard.
Secondary: Kubernetes manifest templates, Helm chart structure, service mesh configuration, cost allocation report, 12-factor compliance checklist.
Autor: Javier Montaño | Última actualización: 12 de marzo de 2026
# Supported AI Coding Agents
This skill is compatible with the SKILL.md standard and works with all major AI coding agents:
Learn more about the SKILL.md standard and how to use these skills with your preferred AI coding agent.