Refactor high-complexity React components in Dify frontend. Use when `pnpm analyze-component...
npx skills add nik-kale/sre-skills --skill "kubernetes-troubleshooting"
Install specific skill from multi-skill repository
# Description
Systematic debugging workflows for Kubernetes issues including pod failures, resource problems, and networking. Use when debugging CrashLoopBackOff, OOMKilled, ImagePullBackOff, pod not starting, k8s issues, or any Kubernetes troubleshooting.
# SKILL.md
name: kubernetes-troubleshooting
description: Systematic debugging workflows for Kubernetes issues including pod failures, resource problems, and networking. Use when debugging CrashLoopBackOff, OOMKilled, ImagePullBackOff, pod not starting, k8s issues, or any Kubernetes troubleshooting.
Kubernetes Troubleshooting
Systematic approach to debugging Kubernetes issues.
When to Use This Skill
- Pod stuck in CrashLoopBackOff
- OOMKilled errors
- ImagePullBackOff failures
- Pod not starting or scheduling
- Service connectivity issues
- Resource constraint problems
Quick Diagnostic Commands
Start with these commands to understand the current state:
# Cluster overview
kubectl get nodes
kubectl get pods -A | grep -v Running
# Specific namespace
kubectl get pods -n <namespace>
kubectl get events -n <namespace> --sort-by='.lastTimestamp' | tail -20
# Resource usage
kubectl top nodes
kubectl top pods -n <namespace>
Pod Debugging Workflow
Step 1: Check Pod Status
kubectl get pod <pod-name> -n <namespace> -o wide
kubectl describe pod <pod-name> -n <namespace>
Look for:
- Status: What state is the pod in?
- Conditions: Ready, ContainersReady, PodScheduled
- Events: Recent events at the bottom of describe output
Step 2: Identify the Problem Category
| Symptom | Likely Cause | Go To Section |
|---|---|---|
| Pending | Scheduling issue | Scheduling Issues |
| CrashLoopBackOff | Application crash | CrashLoopBackOff |
| ImagePullBackOff | Image/registry issue | Image Pull Issues |
| OOMKilled | Memory exhaustion | OOMKilled |
| Running but not Ready | Health check failing | Readiness Issues |
| Error | Container error | Container Errors |
Common Issues
Scheduling Issues
Pod stuck in Pending state.
Diagnostic:
kubectl describe pod <pod-name> -n <namespace> | grep -A 10 Events
Common Causes:
| Event Message | Cause | Fix |
|---|---|---|
| Insufficient cpu/memory | Not enough resources | Add nodes or reduce requests |
| node(s) had taints | Node taints | Add tolerations or remove taints |
| no nodes available | No matching nodes | Check node selector/affinity |
| persistentvolumeclaim not found | PVC missing | Create the PVC |
Fix Resource Issues:
# Check resource requests vs available
kubectl describe nodes | grep -A 5 "Allocated resources"
# Check pending pod requests
kubectl get pod <pod> -o yaml | grep -A 10 resources
CrashLoopBackOff
Container keeps crashing and restarting.
Diagnostic:
# Check container logs (current)
kubectl logs <pod-name> -n <namespace>
# Check previous container logs
kubectl logs <pod-name> -n <namespace> --previous
# Check exit code
kubectl describe pod <pod-name> -n <namespace> | grep -A 3 "Last State"
Common Exit Codes:
| Exit Code | Meaning | Common Cause |
|---|---|---|
| 0 | Success | Process completed (might be wrong for long-running) |
| 1 | Application error | Check application logs |
| 137 | SIGKILL (OOM) | Memory limit exceeded |
| 139 | SIGSEGV | Segmentation fault |
| 143 | SIGTERM | Graceful termination |
Common Fixes:
- Check application logs for startup errors
- Verify environment variables and secrets
- Check if dependencies are available
- Verify resource limits aren't too restrictive
Image Pull Issues
ImagePullBackOff or ErrImagePull.
Diagnostic:
kubectl describe pod <pod-name> -n <namespace> | grep -A 5 Events
Common Causes:
| Error | Cause | Fix |
|---|---|---|
| repository does not exist | Wrong image name | Fix image name/tag |
| unauthorized | Auth failure | Check imagePullSecrets |
| manifest unknown | Tag doesn't exist | Verify tag exists |
| connection refused | Registry unreachable | Check network/firewall |
Fix Registry Auth:
# Create image pull secret
kubectl create secret docker-registry regcred \
--docker-server=<registry> \
--docker-username=<user> \
--docker-password=<password> \
-n <namespace>
# Reference in pod spec
spec:
imagePullSecrets:
- name: regcred
OOMKilled
Container killed due to memory exhaustion.
Diagnostic:
kubectl describe pod <pod-name> -n <namespace> | grep -i oom
kubectl get pod <pod-name> -n <namespace> -o yaml | grep -A 5 lastState
Fix Options:
- Increase memory limit (if available):
resources:
limits:
memory: '512Mi' # Increase this
requests:
memory: '256Mi'
- Profile memory usage:
kubectl top pod <pod-name> -n <namespace> --containers
- Check for memory leaks in application code
Readiness Issues
Pod is Running but not Ready.
Diagnostic:
# Check readiness probe
kubectl describe pod <pod-name> -n <namespace> | grep -A 10 Readiness
# Check probe endpoint manually
kubectl exec <pod-name> -n <namespace> -- wget -qO- localhost:<port>/health
Common Causes:
- Application not listening on expected port
- Readiness endpoint returning non-200
- Probe timeout too short
- Dependencies not available
Fix Readiness Probe:
readinessProbe:
httpGet:
path: /health
port: 8080
initialDelaySeconds: 10 # Give app time to start
periodSeconds: 5
timeoutSeconds: 3 # Increase if needed
failureThreshold: 3
Container Errors
Diagnostic:
# Get detailed container status
kubectl get pod <pod-name> -n <namespace> -o jsonpath='{.status.containerStatuses[*]}'
# Check init containers
kubectl logs <pod-name> -n <namespace> -c <init-container-name>
Networking Troubleshooting
Service Not Reachable
# Check service endpoints
kubectl get endpoints <service-name> -n <namespace>
# Check service selector matches pod labels
kubectl get svc <service-name> -n <namespace> -o yaml | grep selector -A 5
kubectl get pods -n <namespace> --show-labels
# Test connectivity from another pod
kubectl run debug --rm -it --image=busybox -- wget -qO- <service>:<port>
DNS Issues
# Check DNS resolution from pod
kubectl exec <pod> -n <namespace> -- nslookup <service-name>
kubectl exec <pod> -n <namespace> -- nslookup <service-name>.<namespace>.svc.cluster.local
# Check CoreDNS is running
kubectl get pods -n kube-system -l k8s-app=kube-dns
Resource Analysis
Node Pressure
# Check node conditions
kubectl describe nodes | grep -A 5 Conditions
# Check node resource usage
kubectl top nodes
# Find resource-heavy pods
kubectl top pods -A --sort-by=memory | head -20
PVC Issues
# Check PVC status
kubectl get pvc -n <namespace>
# Check PV status
kubectl get pv
# Describe for events
kubectl describe pvc <pvc-name> -n <namespace>
Quick Reference Commands
# Pod debugging
kubectl logs <pod> -n <ns> # Current logs
kubectl logs <pod> -n <ns> --previous # Previous container logs
kubectl logs <pod> -n <ns> -c <container> # Specific container
kubectl logs <pod> -n <ns> --tail=100 -f # Follow logs
# Interactive debugging
kubectl exec -it <pod> -n <ns> -- /bin/sh # Shell into container
kubectl exec <pod> -n <ns> -- env # Check environment
kubectl exec <pod> -n <ns> -- cat /etc/hosts # Check DNS
# Resource inspection
kubectl get pod <pod> -n <ns> -o yaml # Full pod spec
kubectl describe pod <pod> -n <ns> # Events and status
kubectl get events -n <ns> --sort-by='.lastTimestamp'
# Cluster-wide
kubectl get pods -A | grep -v Running # Non-running pods
kubectl top pods -A --sort-by=cpu # CPU usage
kubectl top pods -A --sort-by=memory # Memory usage
Additional Resources
# Supported AI Coding Agents
This skill is compatible with the SKILL.md standard and works with all major AI coding agents:
Learn more about the SKILL.md standard and how to use these skills with your preferred AI coding agent.