javiermontano-sofka

sofka-cloud-native-architecture

0
0
# Install this skill:
npx skills add javiermontano-sofka/sdf --skill "sofka-cloud-native-architecture"

Install specific skill from multi-skill repository

# Description

>

# SKILL.md


name: sofka-cloud-native-architecture
description: >
Cloud-native design -- containers, service mesh, serverless, multi-cloud, FinOps.
Use when the user asks to "design cloud-native architecture", "containerize the application", "evaluate service mesh",
"plan serverless migration", "implement multi-cloud strategy", "optimize cloud costs", or mentions Kubernetes, Istio, Docker, Helm, Terraform, FinOps, or 12-factor.
model: opus
context: fork
allowed-tools:
- Read
- Write
- Edit
- Glob
- Grep
- Bash


Cloud-Native Architecture: Containers, Mesh, Serverless & Cost Optimization

Cloud-native architecture designs applications to fully exploit cloud platforms -- containers, orchestration, service mesh, serverless, infrastructure as code, and cost-aware engineering. This skill produces architecture documentation guiding teams from cloud readiness assessment through production-grade deployment.

Principio Rector

Cloud-native no es "mover a la nube" — es diseñar PARA la nube. Contenedores por defecto, service mesh para observabilidad, serverless donde stateless, FinOps integral desde el día uno. El objetivo no es usar servicios cloud — es explotar las propiedades de la nube: elasticidad, resiliencia, observabilidad y optimización continua de costos.

Filosofía de Cloud-Native

  1. Containers by default. Todo workload comienza como contenedor. Solo se desvía a serverless (stateless, event-driven) o VM (legacy sin refactor) con justificación explícita.
  2. Service mesh para observabilidad. mTLS, traffic management y distributed tracing no son opcionales — son infraestructura base para operar microservicios en producción.
  3. Serverless where stateless. Funciones serverless para procesamiento de eventos, transformaciones, y glue code. Nunca para lógica stateful o latencia crítica.
  4. FinOps integral. El costo es un atributo de calidad. Se mide, se asigna, se optimiza. OpenCost desde el día uno.

Inputs

The user provides a system or platform name as $ARGUMENTS. Parse $1 as the system/platform name used throughout all output artifacts.

Parameters:
- {MODO}: piloto-auto (default) | desatendido | supervisado | paso-a-paso
- piloto-auto: Auto para assessment y container strategy, HITL para decisiones mesh/serverless y FinOps targets.
- desatendido: Cero interrupciones. Arquitectura cloud-native documentada automáticamente. Supuestos documentados.
- supervisado: Autónomo con checkpoint en mesh adoption y multi-cloud decisions.
- paso-a-paso: Confirma cada 12-factor assessment, container design, mesh config, y FinOps plan.
- {FORMATO}: markdown (default) | html | dual
- {VARIANTE}: ejecutiva (~40% — S1 assessment + S2 container strategy + S6 FinOps) | técnica (full 6 sections, default)

Before generating architecture, detect cloud-native context:

!find . -name "Dockerfile" -o -name "*.yaml" -o -name "*.tf" -o -name "helm" -type d -o -name "serverless.yml" | head -20

If reference materials exist, load them:

Read ${CLAUDE_SKILL_DIR}/references/cloud-native-patterns.md

When to Use

  • Assessing application readiness for cloud-native transformation
  • Designing container strategy and Kubernetes architecture
  • Evaluating service mesh adoption (Istio, Linkerd, Cilium)
  • Making serverless vs. container decisions per workload
  • Planning multi-cloud or cloud-agnostic architecture
  • Implementing FinOps practices for cost visibility and optimization

When NOT to Use

  • Infrastructure platform design (VPCs, compute, storage) --> use infrastructure-architecture skill
  • CI/CD pipelines and supply chain security --> use devsecops-architecture skill
  • Application internal structure and patterns --> use software-architecture skill
  • Migrating existing workloads to cloud --> use cloud-migration skill

Delivery Structure: 6 Sections

S1: Cloud-Native Assessment

Evaluate the application against cloud-native principles to identify gaps and transformation priorities.

12-Factor Compliance Audit:
Rate each factor as compliant / partial / non-compliant with remediation effort (S/M/L):
1. Codebase: one repo, many deploys
2. Dependencies: explicitly declared, isolated
3. Config: stored in environment, not code
4. Backing services: attached resources (DB, cache, queue)
5. Build/release/run: strict stage separation
6. Processes: stateless, share-nothing
7. Port binding: self-contained, export via port
8. Concurrency: scale out via process model
9. Disposability: fast startup (<10s), graceful shutdown (SIGTERM handler, drain connections)
10. Dev/prod parity: minimize environment drift
11. Logs: treat as event streams (stdout)
12. Admin processes: run as one-off tasks

Containerization Readiness Checklist:
- Stateful components identified (database, file storage, sessions)
- External dependency inventory (APIs, queues, caches)
- Configuration externalized to env vars or config maps
- Health check endpoints (liveness + readiness + startup probes)
- Graceful shutdown with connection draining
- Secret management via Vault, AWS Secrets Manager, or sealed-secrets

S2: Container & Orchestration Strategy

Container Image Best Practices:
- Base images: distroless (Google) or Alpine (<5MB). Never use latest tag.
- Multi-stage builds: separate build and runtime layers. Final image <100MB target.
- Image scanning in CI: Trivy (CNCF, free), Grype, or Snyk Container. Block Critical/High CVEs.
- Registry: private, image signing (cosign/Sigstore), tag immutability enforced.

Kubernetes Architecture:
- Cluster topology: single-cluster per env (simple) vs. multi-cluster per region (HA/DR).
- Namespace strategy: per-team or per-service. Enforce with NetworkPolicies + RBAC.
- Pod design: sidecar for cross-cutting (logging, proxy), init containers for bootstrapping, PodDisruptionBudgets for availability.

Resource Request/Limit Guidance:

Resource Request Limit Rationale
CPU Set at P95 usage (from VPA data) Omit or set 5x request Avoids CPU throttling; burst on idle cores
Memory Set at P95 usage Set at 1.5-2x request OOMKill preferred over node instability
Ephemeral storage Set if logs/cache grow Set at 2x request Prevents eviction
  • Start with VPA in recommend-only mode for 7+ days before tuning.
  • Default LimitRange guardrails: 100m CPU / 128Mi request, 500m CPU / 512Mi limit.
  • Overcommit ratio: 1.2-1.5x for CPU (safe burst), 1.0x for memory (no overcommit).

Node Autoscaling Decision Matrix:

Tool Mechanism Provision Speed Best For
Cluster Autoscaler ASG-based, node group templates 2-5 min Homogeneous workloads, simple setups
Karpenter (AWS) Direct EC2 API, right-sized nodes 30-60s Heterogeneous workloads, spot optimization, cost-sensitive
GKE NAP GKE-native, auto node pools 1-2 min GKE clusters, managed simplicity
  • Prefer Karpenter on EKS for new deployments: faster provisioning, better bin-packing, native spot/OD mix, consolidation (replaces underutilized nodes automatically).
  • Use Cluster Autoscaler only when Karpenter is unavailable (AKS, on-prem) or organizational policy requires ASG-based scaling.

Pod Autoscaling Decision Matrix:

Tool Trigger Scales To Zero Best For
HPA CPU/memory/custom metrics No HTTP traffic, steady load spikes
VPA Historical usage analysis N/A Right-sizing, legacy apps (recommend-only mode safe)
KEDA External events (Kafka lag, SQS, cron, Prometheus) Yes Queue workers, batch jobs, event-driven
  • Never combine VPA and HPA on the same metric.
  • Combination pattern: KEDA for async + HPA for HTTP + VPA recommend-only + Karpenter for nodes.

GitOps Deployment:
- ArgoCD or Flux for declarative, auditable, rollback-capable deployments.
- Helm charts with values-per-environment. OCI registry for chart storage.

S3: Service Mesh & Networking

Gateway API vs. Ingress (2025-2026):

Aspect Ingress (Legacy) Gateway API (Standard)
Status Ingress-NGINX retiring March 2026 GA, v1.2+, CNCF standard
Role model Single resource, annotation-heavy HTTPRoute, GRPCRoute, TCPRoute (role-oriented)
TLS Annotation-based First-class TLSRoute
Multi-tenancy Weak Built-in (Gateway per team, shared GatewayClass)
Implementations NGINX, HAProxy Envoy Gateway, Cilium, Istio, Kong, Contour
  • Migrate all new clusters to Gateway API. Existing Ingress: plan migration before NGINX retirement.
  • Recommended implementations: Envoy Gateway (highest conformance), Cilium Gateway (if already using Cilium CNI).

CNI & Service Mesh Comparison:

Tool Type Data Plane Resource Overhead Best For
Cilium CNI + mesh + observability eBPF (kernel) Lowest (no sidecar for L3/L4) Teams wanting unified networking + mesh + observability
Calico CNI + network policy iptables or eBPF Low Network policy enforcement, simple CNI
Istio Ambient Mesh (L4 ztunnel + L7 waypoint) Per-node + per-namespace 90% less than sidecar mode Zero-trust mTLS at scale, new deployments
Istio Sidecar Mesh (Envoy per pod) Per-pod sidecar ~50-100MB/sidecar Complex L7 traffic management
Linkerd Mesh (Rust proxy) Per-pod sidecar (~10MB) Very low Teams wanting simplicity over features
  • Decision rule: <10 services with simple patterns = no mesh. Need mTLS only = Cilium or Istio Ambient. Need L7 traffic management = Istio Sidecar or Linkerd. Already using Cilium CNI = Cilium Service Mesh.

mTLS & Zero Trust: Mesh-managed short-lived certificates (hours). Service-to-service RBAC, deny-by-default. SPIFFE identities.

Traffic Management: Canary (gradual shift), blue-green (instant), A/B (header/weight). Circuit breaking, rate limiting, retry/timeout (idempotent operations only).

Observability Stack:
- Distributed tracing: OpenTelemetry (CNCF standard) with Jaeger or Grafana Tempo.
- Metrics: Prometheus + Grafana. RED metrics per service (Rate, Errors, Duration).
- eBPF-based observability (zero-instrumentation): Cilium Hubble (network flows), Pixie (auto-instrumented traces), Tetragon (security events). Use for polyglot environments where manual instrumentation is impractical.

S4: Serverless Decision Framework

Decision Matrix:

Factor Favor Serverless Favor Containers
Traffic pattern Spiky, unpredictable Steady, predictable
Execution time <15 minutes Long-running
State Stateless Stateful
Cold start tolerance Acceptable (100-500ms) Not acceptable (<50ms)
Cost at volume <1M invocations/month >10M invocations/month
Vendor lock-in Acceptable Not acceptable

Cold Start Mitigation: Provisioned concurrency, smaller packages (tree-shaking, layers), language choice (Go/Rust <100ms, Java/C# 500ms-2s), SnapStart (Java on Lambda), warm-up pings.

State Management: External stores (DynamoDB, Redis, S3). Step Functions / Durable Functions for orchestration. Event-driven decoupling via queues.

Vendor Lock-in Assessment: Abstraction layers (SST, Pulumi, Serverless Framework). Exit cost per component. Prefer open standards (CloudEvents, OpenTelemetry).

S5: Multi-Cloud & Portability

Strategy Tiers:
- Tier 1: Cloud-agnostic app (Kubernetes, standard APIs). Cost: low. Benefit: portability.
- Tier 2: Portable infrastructure (Terraform, Crossplane). Cost: medium. Benefit: negotiation leverage.
- Tier 3: Active multi-cloud (workloads distributed). Cost: high. Benefit: DR, compliance, best-of-breed.

Abstraction Approaches:
- Kubernetes as portability layer: same manifests across EKS/GKE/AKS.
- Terraform: provider-agnostic HCL modules. State in remote backend.
- Crossplane: Kubernetes-native infrastructure provisioning across clouds.
- Application-level: abstract cloud SDKs behind interfaces (storage, queue, identity).

Cloud-Agnostic Patterns:
- Use open standards: S3-compatible storage (MinIO), OpenTelemetry, OIDC, CloudEvents.
- Data gravity: place compute near data; minimize cross-cloud data transfer.
- Policy-as-code: OPA/Gatekeeper enforced across all clusters.

S6: FinOps Integration

FinOps Tooling Comparison:

Tool License Scope Unique Value
OpenCost Open source (CNCF Incubating) K8s workload costs Free, Prometheus-native, MCP server for AI-driven cost queries
Kubecost Freemium (backed by IBM) K8s + cloud costs Savings recommendations, network cost visibility, enterprise support
Vantage Commercial SaaS Multi-cloud + SaaS Unified dashboard across AWS/Azure/GCP/Datadog/Snowflake
FOCUS Open standard (FinOps Foundation) Billing data format Normalize billing across providers for consistent reporting
  • Start with OpenCost for Kubernetes-native cost allocation. Add Kubecost for savings recommendations. Use Vantage or CloudHealth for multi-cloud executive dashboards.
  • OpenCost 2025: runs without Prometheus (Collector Datasource), MCP server for AI agent cost queries, plugin framework for Datadog/OpenAI/MongoDB Atlas cost monitoring.

Cost Allocation: Namespace/pod-level via OpenCost/Kubecost. Label strategy: team, service, environment, cost-center. Showback reports per team.

Optimization Levers:
- Rightsizing: VPA recommendations after 2+ weeks of data.
- Spot/preemptible: 60-90% savings for fault-tolerant workloads. Karpenter automates spot/OD mix.
- Scale to zero: KEDA for queue workers, serverless for event-driven.
- Ephemeral environments: spin up per PR, tear down on merge.
- Storage lifecycle: S3 Intelligent Tiering, EBS snapshot cleanup.
- Network: minimize cross-AZ traffic via topology-aware routing.

Cost Governance:
- Daily cost by team/service/environment dashboard.
- Unit economics: cost per user, cost per transaction.
- Anomaly alerts: >20% daily spike triggers investigation.
- Budget alerts per account, per namespace.


Trade-off Matrix

Decision Enables Constrains When to Use
Kubernetes Portability, scaling, ecosystem Operational complexity Polyglot microservices, experienced teams
Service Mesh mTLS, traffic control, observability Resource overhead, complexity >10 services, zero-trust required
Serverless Zero ops, pay-per-use Cold start, vendor lock-in Event-driven, low-volume, spiky traffic
Multi-Cloud Avoid lock-in, negotiate pricing Complexity, lowest-common-denominator Regulatory, negotiation leverage, DR
GitOps (ArgoCD) Auditable, declarative, rollback Learning curve, git as bottleneck Kubernetes-native, compliance-driven
Spot Instances 60-90% cost savings Interruption risk Stateless, fault-tolerant workloads
Karpenter over CA Faster scaling, better bin-packing AWS-only (EKS) EKS clusters with heterogeneous workloads
Gateway API over Ingress Multi-tenancy, role-based, extensible Newer ecosystem All new clusters; migrate existing before NGINX retirement

Assumptions

  • Application is being modernized or built for cloud deployment
  • Team has or is developing container and orchestration skills
  • Cloud provider(s) selected or shortlisted
  • Budget includes cloud-native tooling (mesh, GitOps, cost tools)

Limits

  • Does not design internal application architecture (use software-architecture skill)
  • Does not cover infrastructure platform setup (use infrastructure-architecture skill)
  • Does not plan CI/CD pipelines (use devsecops-architecture skill)
  • FinOps practices require organizational buy-in beyond architecture decisions

Edge Cases

Monolith Containerization:
Containerize the monolith first (lift-and-shift to container), then decompose. Use strangler fig pattern. Do not attempt simultaneous containerization and decomposition.

Stateful Workloads on Kubernetes:
Use operators (CloudNativePG for PostgreSQL, Strimzi for Kafka). Alternative: managed services outside K8s. Evaluate operational burden vs. portability.

Serverless at Scale (>10M invocations/month):
Model break-even point. Container alternative often cheaper at high volume. Reserved concurrency or Fargate may be more cost-effective.

Regulated Industries:
Service mesh mTLS may be mandatory. Image provenance required (SLSA, Sigstore/cosign). Multi-cloud may be required for data residency. Audit logging at infrastructure layer.

Small Team (<5 developers):
Full K8s + mesh is likely over-engineered. Use managed Kubernetes, skip mesh, use cloud-managed services. Revisit as team grows.


Validation Gate

Before finalizing delivery, verify:

  • [ ] 12-factor compliance gaps identified with remediation plan
  • [ ] Container strategy includes security scanning and minimal base images
  • [ ] Resource requests/limits configured per guidance table (CPU burst, memory capped)
  • [ ] Node autoscaler selected (Karpenter vs. CA) with rationale
  • [ ] Service mesh decision justified (adopt vs. defer) with comparison matrix
  • [ ] Gateway API adopted for ingress (or migration plan from Ingress documented)
  • [ ] Serverless vs. container decision documented per workload
  • [ ] FinOps tooling deployed (OpenCost minimum) with cost allocation labels
  • [ ] Auto-scaling configured for all stateless workloads (HPA/KEDA + Karpenter)
  • [ ] GitOps deployment pipeline in place

Output Format Protocol

Format Default Description
markdown Yes Rich Markdown + Mermaid diagrams. Token-efficient.
html On demand Branded HTML (Design System). Visual impact.
dual On demand Both formats.

Default output is Markdown with embedded Mermaid diagrams. HTML generation requires explicit {FORMATO}=html parameter.

Output Artifact

Primary: A-01_Cloud_Native_Architecture.html -- Executive summary, 12-factor assessment, container strategy, Kubernetes architecture, service mesh design, serverless decisions, multi-cloud plan, FinOps dashboard.

Secondary: Kubernetes manifest templates, Helm chart structure, service mesh configuration, cost allocation report, 12-factor compliance checklist.


Autor: Javier Montaño | Última actualización: 12 de marzo de 2026

# Supported AI Coding Agents

This skill is compatible with the SKILL.md standard and works with all major AI coding agents:

Learn more about the SKILL.md standard and how to use these skills with your preferred AI coding agent.