Implement GitOps workflows with ArgoCD and Flux for automated, declarative Kubernetes...
npx skills add cosmix/loom --skill "kubernetes"
Install specific skill from multi-skill repository
# Description
Kubernetes deployment, cluster architecture, security, and operations. Includes manifests, Helm charts, RBAC, network policies, troubleshooting, and production best practices. Trigger keywords: kubernetes, k8s, pod, deployment, statefulset, daemonset, service, ingress, configmap, secret, pvc, namespace, helm, chart, kubectl, cluster, rbac, networkpolicy, podsecurity, operator, crd, job, cronjob, hpa, pdb, kustomize.
# SKILL.md
name: kubernetes
description: Kubernetes deployment, cluster architecture, security, and operations. Includes manifests, Helm charts, RBAC, network policies, troubleshooting, and production best practices. Trigger keywords: kubernetes, k8s, pod, deployment, statefulset, daemonset, service, ingress, configmap, secret, pvc, namespace, helm, chart, kubectl, cluster, rbac, networkpolicy, podsecurity, operator, crd, job, cronjob, hpa, pdb, kustomize.
allowed-tools: Read, Grep, Glob, Edit, Write, Bash
Kubernetes
Overview
This skill covers Kubernetes resource configuration, deployment strategies, cluster architecture, security hardening, Helm chart development, and production operations. It helps create production-ready manifests, troubleshoot cluster issues, and implement security best practices.
Agent Delegation
software-engineer (Sonnet) - Writes K8s manifests, implements configs
senior-software-engineer (Opus) - Application architecture on K8s, multi-service design
security-engineer (Opus) - RBAC, NetworkPolicies, PodSecurityStandards, admission control
senior-infrastructure-engineer (Opus) - Cluster architecture AND implementation, operators, CRDs
Instructions
1. Design Resource Architecture
- Plan namespace organization and boundaries
- Define resource requests/limits based on workload
- Design service mesh topology (if applicable)
- Plan for high availability (replicas, affinity, PDBs)
- Choose workload type (Deployment/StatefulSet/DaemonSet/Job)
- Design storage strategy (PVCs, StorageClasses, ephemeral volumes)
2. Create Resource Manifests
- Write Deployments/StatefulSets/DaemonSets with proper lifecycle
- Configure Services (ClusterIP/NodePort/LoadBalancer) and Ingress
- Set up ConfigMaps and Secrets (sealed secrets, external secrets)
- Define RBAC policies (ServiceAccounts, Roles, RoleBindings)
- Implement NetworkPolicies for pod-to-pod communication
- Configure PodDisruptionBudgets for availability guarantees
3. Develop Helm Charts
- Structure charts (Chart.yaml, values.yaml, templates/)
- Use template functions (_helpers.tpl for common labels)
- Parameterize configurations (replicas, image tags, resources)
- Define dependencies (requirements.yaml or Chart.yaml dependencies)
- Implement hooks (pre-install, post-upgrade, etc.)
- Test charts locally (helm template, helm lint, helm install --dry-run)
- Package and publish charts to registries
4. Implement Security Best Practices
- Run as non-root user (runAsUser, runAsNonRoot)
- Drop all capabilities (securityContext.capabilities.drop: [ALL])
- Use read-only root filesystems (readOnlyRootFilesystem: true)
- Prevent privilege escalation (allowPrivilegeEscalation: false)
- Implement PodSecurityStandards (restricted profile)
- Use NetworkPolicies for zero-trust networking
- Scan images for vulnerabilities (integrate Trivy, Snyk)
- Rotate secrets regularly (external secrets operator)
- Audit RBAC permissions (principle of least privilege)
5. Configure Observability
- Add Prometheus annotations for scraping
- Configure logging (stdout/stderr, fluentd/loki)
- Implement health probes (liveness, readiness, startup)
- Export metrics (custom metrics for HPA)
- Set up distributed tracing (OpenTelemetry)
- Configure alerting rules
6. Troubleshoot Issues
- Pod crashes:
kubectl logs,kubectl describe pod, events - Image pull failures: Check ImagePullSecrets, registry auth
- Networking: Test DNS resolution, check NetworkPolicies
- Resource constraints: Check node capacity, pod eviction
- Scheduling failures: Node selectors, taints/tolerations, affinity
- Performance: Use
kubectl top, Prometheus metrics - CrashLoopBackOff: Check startup probes, init containers
- Pending pods: Describe pod for scheduling failures
7. Production Operations
- Rolling updates with proper rollout strategy
- Blue-green deployments (multiple services)
- Canary deployments (traffic splitting via service mesh)
- Backup and restore (Velero for cluster state)
- Disaster recovery planning (multi-region, etcd backups)
- Capacity planning (resource quotas, limit ranges)
- Cost optimization (right-sizing, spot instances, cluster autoscaler)
- Upgrade planning (test in staging, rolling node upgrades)
Best Practices
- Use Namespaces: Organize resources logically, enable RBAC boundaries
- Set Resource Limits: Prevent resource exhaustion, enable QoS classes
- Health Probes: Configure liveness, readiness, and startup probes
- Rolling Updates: Zero-downtime deployments with maxSurge/maxUnavailable
- Secrets Management: Use external secrets, never hardcode credentials
- Label Everything: Enable filtering, selection, and monitoring
- Use Helm/Kustomize: Template and manage manifests as code
- Security Contexts: Always run as non-root with minimal privileges
- NetworkPolicies: Default deny, explicit allow for zero-trust
- Pod Disruption Budgets: Protect availability during disruptions
- Resource Quotas: Prevent runaway resource consumption per namespace
- Admission Control: Use OPA/Kyverno for policy enforcement
- Immutable Infrastructure: Never SSH into pods, replace instead
- GitOps: Use ArgoCD/Flux for declarative deployment
Examples
Example 1: Production Deployment
apiVersion: apps/v1
kind: Deployment
metadata:
name: api-server
namespace: production
labels:
app: api-server
version: v1.2.0
spec:
replicas: 3
strategy:
type: RollingUpdate
rollingUpdate:
maxSurge: 1
maxUnavailable: 0
selector:
matchLabels:
app: api-server
template:
metadata:
labels:
app: api-server
version: v1.2.0
annotations:
prometheus.io/scrape: "true"
prometheus.io/port: "8080"
spec:
serviceAccountName: api-server
securityContext:
runAsNonRoot: true
runAsUser: 1000
fsGroup: 1000
containers:
- name: api
image: myregistry.io/api-server:v1.2.0
imagePullPolicy: IfNotPresent
ports:
- name: http
containerPort: 8080
protocol: TCP
env:
- name: DATABASE_URL
valueFrom:
secretKeyRef:
name: api-secrets
key: database-url
- name: LOG_LEVEL
valueFrom:
configMapKeyRef:
name: api-config
key: log-level
resources:
requests:
cpu: 100m
memory: 128Mi
limits:
cpu: 500m
memory: 512Mi
livenessProbe:
httpGet:
path: /health/live
port: http
initialDelaySeconds: 15
periodSeconds: 20
timeoutSeconds: 5
failureThreshold: 3
readinessProbe:
httpGet:
path: /health/ready
port: http
initialDelaySeconds: 5
periodSeconds: 10
timeoutSeconds: 3
failureThreshold: 3
securityContext:
allowPrivilegeEscalation: false
readOnlyRootFilesystem: true
capabilities:
drop:
- ALL
volumeMounts:
- name: tmp
mountPath: /tmp
- name: cache
mountPath: /app/cache
volumes:
- name: tmp
emptyDir: {}
- name: cache
emptyDir:
sizeLimit: 100Mi
affinity:
podAntiAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 100
podAffinityTerm:
labelSelector:
matchLabels:
app: api-server
topologyKey: kubernetes.io/hostname
topologySpreadConstraints:
- maxSkew: 1
topologyKey: topology.kubernetes.io/zone
whenUnsatisfiable: ScheduleAnyway
labelSelector:
matchLabels:
app: api-server
Example 2: Service and Ingress
apiVersion: v1
kind: Service
metadata:
name: api-server
namespace: production
labels:
app: api-server
spec:
type: ClusterIP
ports:
- port: 80
targetPort: http
protocol: TCP
name: http
selector:
app: api-server
---
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
name: api-server
namespace: production
annotations:
kubernetes.io/ingress.class: nginx
cert-manager.io/cluster-issuer: letsencrypt-prod
nginx.ingress.kubernetes.io/rate-limit: "100"
nginx.ingress.kubernetes.io/rate-limit-window: "1m"
spec:
tls:
- hosts:
- api.example.com
secretName: api-tls-cert
rules:
- host: api.example.com
http:
paths:
- path: /
pathType: Prefix
backend:
service:
name: api-server
port:
number: 80
Example 3: ConfigMap and Secret
apiVersion: v1
kind: ConfigMap
metadata:
name: api-config
namespace: production
data:
log-level: "info"
max-connections: "100"
cache-ttl: "3600"
feature-flags: |
{
"new-checkout": true,
"beta-features": false
}
---
apiVersion: v1
kind: Secret
metadata:
name: api-secrets
namespace: production
type: Opaque
stringData:
database-url: "postgresql://user:password@db-host:5432/myapp"
api-key: "super-secret-key"
Example 4: Horizontal Pod Autoscaler
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: api-server
namespace: production
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: api-server
minReplicas: 3
maxReplicas: 10
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 70
- type: Resource
resource:
name: memory
target:
type: Utilization
averageUtilization: 80
behavior:
scaleDown:
stabilizationWindowSeconds: 300
policies:
- type: Percent
value: 10
periodSeconds: 60
scaleUp:
stabilizationWindowSeconds: 0
policies:
- type: Percent
value: 100
periodSeconds: 15
- type: Pods
value: 4
periodSeconds: 15
selectPolicy: Max
Example 5: Helm Chart Structure
mychart/
Chart.yaml # Chart metadata
values.yaml # Default configuration values
templates/
_helpers.tpl # Template helpers
deployment.yaml # Deployment template
service.yaml # Service template
ingress.yaml # Ingress template
configmap.yaml # ConfigMap template
secret.yaml # Secret template
NOTES.txt # Post-install notes
charts/ # Chart dependencies
.helmignore # Files to ignore
Chart.yaml:
apiVersion: v2
name: api-server
description: Production API server Helm chart
type: application
version: 1.2.0
appVersion: "1.2.0"
keywords:
- api
- backend
maintainers:
- name: Platform Team
email: [email protected]
dependencies:
- name: postgresql
version: "12.x.x"
repository: "https://charts.bitnami.com/bitnami"
condition: postgresql.enabled
values.yaml:
replicaCount: 3
image:
repository: myregistry.io/api-server
tag: "1.2.0"
pullPolicy: IfNotPresent
service:
type: ClusterIP
port: 80
targetPort: 8080
ingress:
enabled: true
className: nginx
annotations:
cert-manager.io/cluster-issuer: letsencrypt-prod
hosts:
- host: api.example.com
paths:
- path: /
pathType: Prefix
tls:
- secretName: api-tls-cert
hosts:
- api.example.com
resources:
requests:
cpu: 100m
memory: 128Mi
limits:
cpu: 500m
memory: 512Mi
autoscaling:
enabled: true
minReplicas: 3
maxReplicas: 10
targetCPUUtilizationPercentage: 70
targetMemoryUtilizationPercentage: 80
postgresql:
enabled: true
auth:
username: apiuser
database: apidb
templates/_helpers.tpl:
{{- define "api-server.name" -}}
{{- default .Chart.Name .Values.nameOverride | trunc 63 | trimSuffix "-" }}
{{- end }}
{{- define "api-server.fullname" -}}
{{- if .Values.fullnameOverride }}
{{- .Values.fullnameOverride | trunc 63 | trimSuffix "-" }}
{{- else }}
{{- $name := default .Chart.Name .Values.nameOverride }}
{{- if contains $name .Release.Name }}
{{- .Release.Name | trunc 63 | trimSuffix "-" }}
{{- else }}
{{- printf "%s-%s" .Release.Name $name | trunc 63 | trimSuffix "-" }}
{{- end }}
{{- end }}
{{- end }}
{{- define "api-server.labels" -}}
helm.sh/chart: {{ .Chart.Name }}-{{ .Chart.Version }}
app.kubernetes.io/name: {{ include "api-server.name" . }}
app.kubernetes.io/instance: {{ .Release.Name }}
app.kubernetes.io/version: {{ .Chart.AppVersion | quote }}
app.kubernetes.io/managed-by: {{ .Release.Service }}
{{- end }}
Example 6: NetworkPolicy (Zero-Trust)
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: api-server-netpol
namespace: production
spec:
podSelector:
matchLabels:
app: api-server
policyTypes:
- Ingress
- Egress
ingress:
- from:
- namespaceSelector:
matchLabels:
name: ingress-nginx
- podSelector:
matchLabels:
app: frontend
ports:
- protocol: TCP
port: 8080
egress:
- to:
- podSelector:
matchLabels:
app: postgresql
ports:
- protocol: TCP
port: 5432
- to:
- podSelector:
matchLabels:
k8s-app: kube-dns
namespaceSelector:
matchLabels:
name: kube-system
ports:
- protocol: UDP
port: 53
- to:
- namespaceSelector: {}
ports:
- protocol: TCP
port: 443
---
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: default-deny-all
namespace: production
spec:
podSelector: {}
policyTypes:
- Ingress
- Egress
Example 7: RBAC Configuration
apiVersion: v1
kind: ServiceAccount
metadata:
name: api-server
namespace: production
labels:
app: api-server
---
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
name: api-server
namespace: production
rules:
- apiGroups: [""]
resources: ["configmaps"]
verbs: ["get", "list", "watch"]
- apiGroups: [""]
resources: ["secrets"]
resourceNames: ["api-secrets", "database-creds"]
verbs: ["get"]
- apiGroups: [""]
resources: ["pods"]
verbs: ["get", "list"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
name: api-server
namespace: production
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: Role
name: api-server
subjects:
- kind: ServiceAccount
name: api-server
namespace: production
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
name: pod-reader
rules:
- apiGroups: [""]
resources: ["pods"]
verbs: ["get", "list", "watch"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
name: read-pods-global
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: ClusterRole
name: pod-reader
subjects:
- kind: ServiceAccount
name: monitoring
namespace: observability
Example 8: StatefulSet with Persistent Storage
apiVersion: apps/v1
kind: StatefulSet
metadata:
name: postgresql
namespace: production
spec:
serviceName: postgresql-headless
replicas: 3
selector:
matchLabels:
app: postgresql
template:
metadata:
labels:
app: postgresql
spec:
containers:
- name: postgres
image: postgres:15-alpine
ports:
- containerPort: 5432
name: postgres
env:
- name: POSTGRES_DB
value: myapp
- name: POSTGRES_USER
valueFrom:
secretKeyRef:
name: postgres-creds
key: username
- name: POSTGRES_PASSWORD
valueFrom:
secretKeyRef:
name: postgres-creds
key: password
- name: PGDATA
value: /var/lib/postgresql/data/pgdata
volumeMounts:
- name: data
mountPath: /var/lib/postgresql/data
resources:
requests:
cpu: 250m
memory: 512Mi
limits:
cpu: 1000m
memory: 2Gi
volumeClaimTemplates:
- metadata:
name: data
spec:
accessModes: ["ReadWriteOnce"]
storageClassName: fast-ssd
resources:
requests:
storage: 50Gi
---
apiVersion: v1
kind: Service
metadata:
name: postgresql-headless
namespace: production
spec:
clusterIP: None
selector:
app: postgresql
ports:
- port: 5432
targetPort: postgres
name: postgres
Troubleshooting Commands
# Pod debugging
kubectl get pods -n production
kubectl describe pod api-server-abc123 -n production
kubectl logs api-server-abc123 -n production
kubectl logs api-server-abc123 -n production --previous
kubectl logs api-server-abc123 -c container-name -n production
kubectl exec -it api-server-abc123 -n production -- /bin/sh
# Events and status
kubectl get events -n production --sort-by='.lastTimestamp'
kubectl get events -n production --field-selector involvedObject.name=api-server-abc123
# Resource usage
kubectl top nodes
kubectl top pods -n production
kubectl describe node node-1
# Network debugging
kubectl run -it --rm debug --image=nicolaka/netshoot --restart=Never -- /bin/bash
kubectl exec -it api-server-abc123 -n production -- nslookup kubernetes.default
kubectl exec -it api-server-abc123 -n production -- curl -v http://other-service
# Configuration
kubectl get configmap api-config -n production -o yaml
kubectl get secret api-secrets -n production -o jsonpath='{.data}'
# Deployments and rollouts
kubectl rollout status deployment/api-server -n production
kubectl rollout history deployment/api-server -n production
kubectl rollout undo deployment/api-server -n production
kubectl rollout restart deployment/api-server -n production
# RBAC debugging
kubectl auth can-i get pods --as=system:serviceaccount:production:api-server -n production
kubectl get rolebindings,clusterrolebindings --all-namespaces -o json | jq '.items[] | select(.subjects[]?.name=="api-server")'
# NetworkPolicy debugging
kubectl describe networkpolicy api-server-netpol -n production
# Helm operations
helm list -n production
helm get values api-server -n production
helm history api-server -n production
helm upgrade api-server ./mychart -n production --values values.yaml
helm rollback api-server 2 -n production
helm uninstall api-server -n production
# Cluster info
kubectl cluster-info
kubectl get nodes -o wide
kubectl get componentstatuses
kubectl api-resources
kubectl api-versions
Common Issues and Solutions
| Issue | Cause | Solution |
|---|---|---|
| ImagePullBackOff | Registry auth missing or image not found | Check imagePullSecrets, verify image exists |
| CrashLoopBackOff | App crashes on startup | Check logs, verify config, add startup probe |
| Pending pod | Insufficient resources or scheduling constraints | Check node capacity, taints, tolerations, affinity |
| OOMKilled | Memory limit exceeded | Increase memory limit, fix memory leak |
| Service unreachable | Wrong selector or port | Verify selector matches pod labels, check ports |
| DNS resolution fails | CoreDNS issues or NetworkPolicy blocking | Check CoreDNS pods, verify NetworkPolicy egress |
| PVC pending | StorageClass missing or no available volumes | Verify StorageClass, check provisioner |
| RBAC permission denied | ServiceAccount lacks required permissions | Add Role/RoleBinding with necessary verbs |
| Readiness probe failing | App not ready or probe misconfigured | Adjust initialDelaySeconds, check probe endpoint |
| HPA not scaling | Metrics server missing or no resource requests | Install metrics-server, set resource requests |
# Supported AI Coding Agents
This skill is compatible with the SKILL.md standard and works with all major AI coding agents:
Learn more about the SKILL.md standard and how to use these skills with your preferred AI coding agent.