Use when adding new error messages to React, or seeing "unknown error code" warnings.
npx skills add luiscamaral/k8s-cell-platform-skills --skill "troubleshooting-pods"
Install specific skill from multi-skill repository
# Description
Debugs failing Kubernetes pods across all namespaces and layers. Use for pod crashes, CrashLoopBackOff, OOMKilled, ImagePullBackOff, pending pods, container errors, or when pods are not ready. Provides diagnostic workflows and common solutions.
# SKILL.md
name: troubleshooting-pods
description: Debugs failing Kubernetes pods across all namespaces and layers. Use for pod crashes, CrashLoopBackOff, OOMKilled, ImagePullBackOff, pending pods, container errors, or when pods are not ready. Provides diagnostic workflows and common solutions.
allowed-tools: Read, Glob, Grep, Bash(kubectl:get,describe,logs,exec,top,events)
Pod Troubleshooting
Diagnoses and debugs pod issues across the Kubernetes Cell Platform.
Quick Diagnosis
# Find problematic pods
kubectl get pods -A | grep -Ev "Running|Completed"
# Recent events
kubectl get events -A --sort-by='.lastTimestamp' | tail -20
# Resource pressure
kubectl top nodes
kubectl top pods -A --sort-by=memory | head -20
Debugging Workflow
Step 1: Identify the Problem
kubectl get pods -n <namespace> -o wide
kubectl describe pod <pod-name> -n <namespace>
Step 2: Check Events
kubectl get events -n <namespace> --sort-by='.lastTimestamp'
Step 3: Check Logs
# Current container
kubectl logs <pod-name> -n <namespace>
# Previous container (after crash)
kubectl logs <pod-name> -n <namespace> --previous
# Specific container in multi-container pod
kubectl logs <pod-name> -n <namespace> -c <container-name>
Step 4: Interactive Debug
# Execute into running container
kubectl exec -it <pod-name> -n <namespace> -- /bin/sh
# Debug with ephemeral container
kubectl debug <pod-name> -n <namespace> --image=busybox -it
Common Error Patterns
CrashLoopBackOff
Symptoms: Pod repeatedly crashes
Diagnosis:
kubectl logs <pod> -n <ns> --previous
kubectl describe pod <pod> -n <ns> | grep -A5 "Last State"
Common Causes:
- Application error (check logs)
- Missing config/secrets
- Liveness probe failing
ImagePullBackOff
Symptoms: Cannot pull container image
Diagnosis:
kubectl describe pod <pod> -n <ns> | grep -A3 "Events"
Common Causes:
- Wrong image name/tag
- Private registry without credentials
- Network issues
OOMKilled
Symptoms: Container killed for memory
Diagnosis:
kubectl describe pod <pod> -n <ns> | grep -i oom
kubectl top pod <pod> -n <ns>
Solution: Increase memory limits
Pending
Symptoms: Pod stuck in Pending state
Diagnosis:
kubectl describe pod <pod> -n <ns> | grep -A10 "Events"
Common Causes:
- Insufficient resources (check Karpenter)
- Node selector/affinity mismatch
- PVC not bound
CreateContainerConfigError
Symptoms: Container cannot start
Diagnosis:
kubectl describe pod <pod> -n <ns> | grep -A5 "Warning"
Common Causes:
- Missing ConfigMap/Secret
- Invalid environment variable reference
Memory Files
meta/memory/troubleshooting-history.md- Past issues and solutions
Reference Documentation
reference/common-errors.md- Error pattern catalog
Diagnostic Script
Run scripts/collect-diagnostics.sh <namespace> <pod-name> for comprehensive diagnostics.
Layer-Specific Notes
| Layer | Common Issues |
|---|---|
| L0 (kube-system) | Cilium agent, CoreDNS |
| L1 (metallb, dns) | Speaker pods, external-dns permissions |
| L2 (argocd, kyverno) | Image pull, CRD issues |
# Supported AI Coding Agents
This skill is compatible with the SKILL.md standard and works with all major AI coding agents:
Learn more about the SKILL.md standard and how to use these skills with your preferred AI coding agent.