404kidwiz

kubernetes-specialist

6
0
# Install this skill:
npx skills add 404kidwiz/claude-supercode-skills --skill "kubernetes-specialist"

Install specific skill from multi-skill repository

# Description

Expert Kubernetes Specialist with deep expertise in container orchestration, cluster management, and cloud-native applications. Proficient in Kubernetes architecture, Helm charts, operators, and multi-cluster management across EKS, AKS, GKE, and on-premises deployments.

# SKILL.md


name: kubernetes-specialist
description: "Expert Kubernetes Specialist with deep expertise in container orchestration, cluster management, and cloud-native applications. Proficient in Kubernetes architecture, Helm charts, operators, and multi-cluster management across EKS, AKS, GKE, and on-premises deployments."


Kubernetes Specialist

Purpose

Provides expert Kubernetes orchestration and cloud-native application expertise with deep knowledge of container orchestration, cluster management, and production-grade deployments. Specializes in Kubernetes architecture, Helm charts, operators, multi-cluster management, and GitOps workflows across EKS, AKS, GKE, and on-premises deployments.

When to Use

  • Designing Kubernetes cluster architecture for production workloads
  • Implementing Helm charts, operators, or GitOps workflows (ArgoCD, Flux)
  • Troubleshooting cluster issues (networking, storage, performance)
  • Planning Kubernetes upgrades or multi-cluster strategies
  • Optimizing resource utilization and cost in Kubernetes environments
  • Setting up service mesh (Istio, Linkerd) and observability
  • Implementing Kubernetes security and RBAC policies

Quick Start

Invoke this skill when:
- Designing Kubernetes cluster architecture for production workloads
- Implementing Helm charts, operators, or GitOps workflows
- Troubleshooting cluster issues (networking, storage, performance)
- Planning Kubernetes upgrades or multi-cluster strategies
- Optimizing resource utilization and cost in Kubernetes environments

Do NOT invoke when:
- Simple Docker container needs (use docker commands directly)
- Cloud infrastructure provisioning (use cloud-architect instead)
- Application code debugging (use backend-developer/frontend-developer)
- Database-specific issues (use database-administrator instead)

Decision Framework

Deployment Strategy Selection

├─ Zero downtime required?
│   ├─ Instant rollback needed → Blue-Green Deployment
│   │   Pros: Instant switch, easy rollback
│   │   Cons: 2x resources during deployment
│   │
│   ├─ Gradual rollout → Canary Deployment
│   │   Pros: Test with subset of traffic
│   │   Cons: Complex routing setup
│   │
│   └─ Simple updates → Rolling Update (default)
│       Pros: Built-in, no extra resources
│       Cons: Rollback takes time
│
├─ Stateful application?
│   ├─ Database → StatefulSet + PVC
│   │   Pros: Stable network IDs, ordered deployment
│   │   Cons: Complex scaling
│   │
│   └─ Stateless → Deployment
│       Pros: Easy scaling, self-healing
│
└─ Batch processing?
    ├─ One-time → Job
    ├─ Scheduled → CronJob
    └─ Parallel processing → Job with parallelism

Resource Configuration Matrix

Workload Type CPU Request CPU Limit Memory Request Memory Limit
Web API 100m-500m 1000m 256Mi-512Mi 1Gi
Worker 500m-1000m 2000m 512Mi-1Gi 2Gi
Database 1000m-2000m 4000m 2Gi-4Gi 8Gi
Cache 100m-250m 500m 1Gi-4Gi 8Gi
Batch Job 500m-2000m 4000m 1Gi-4Gi 8Gi

Node Pool Strategy

Use Case Instance Type Scaling Cost
System pods t3.large (3 nodes) Fixed Low
Applications m5.xlarge Auto 3-20 Medium
Batch/Spot m5.large-2xlarge Auto 0-50 Very Low
GPU workloads p3.2xlarge Manual High

Red Flags → Escalate

STOP and escalate if:
- Cluster upgrade with breaking API changes (deprecated versions)
- Multi-region active-active requirements
- Compliance requirements (PCI-DSS, HIPAA) need validation
- Custom scheduler or controller development needed
- etcd corruption or cluster state issues

Quality Checklist

Cluster Configuration

  • [ ] Multi-AZ deployment (nodes spread across availability zones)
  • [ ] Node autoscaling configured (Cluster Autoscaler or Karpenter)
  • [ ] System node pool with taints (separate critical addons from apps)
  • [ ] Encryption enabled (secrets at rest with KMS)
  • [ ] Audit logging enabled (API server logs)

Security

  • [ ] Pod Security Standards enforced (restricted or baseline)
  • [ ] Network policies configured (default deny + explicit allow)
  • [ ] RBAC configured (least privilege for all service accounts)
  • [ ] Image scanning enabled (scan for vulnerabilities)
  • [ ] Private container registry configured

Resource Management

  • [ ] All pods have resource requests and limits
  • [ ] HorizontalPodAutoscalers configured for scalable workloads
  • [ ] PodDisruptionBudgets defined (prevent too many pods down)
  • [ ] ResourceQuotas set per namespace
  • [ ] LimitRanges defined (default limits for pods)

High Availability

  • [ ] Deployments have ≥2 replicas
  • [ ] Anti-affinity rules prevent pod co-location
  • [ ] Readiness and liveness probes configured
  • [ ] PodDisruptionBudgets allow for rolling updates
  • [ ] Multi-region cluster (if global scale required)

Observability

  • [ ] Metrics server installed (kubectl top works)
  • [ ] Prometheus monitoring application metrics
  • [ ] Centralized logging (CloudWatch, Elasticsearch, Loki)
  • [ ] Distributed tracing (Jaeger, Tempo)
  • [ ] Dashboards for cluster and application health

Disaster Recovery

  • [ ] Velero installed for cluster backups
  • [ ] Backup schedule configured (daily minimum)
  • [ ] Restore tested (annual drill)
  • [ ] etcd backups automated (cloud-managed clusters)

Additional Resources

# Supported AI Coding Agents

This skill is compatible with the SKILL.md standard and works with all major AI coding agents:

Learn more about the SKILL.md standard and how to use these skills with your preferred AI coding agent.