kubernetes-troubleshooting

Name: kubernetes-troubleshooting
Author: nik-kale

by @nik-kale in AI & LLM

# Install this skill:

npx skills add nik-kale/sre-skills --skill "kubernetes-troubleshooting"

Install specific skill from multi-skill repository

# Description

Systematic debugging workflows for Kubernetes issues including pod failures, resource problems, and networking. Use when debugging CrashLoopBackOff, OOMKilled, ImagePullBackOff, pod not starting, k8s issues, or any Kubernetes troubleshooting.

# SKILL.md

name: kubernetes-troubleshooting
description: Systematic debugging workflows for Kubernetes issues including pod failures, resource problems, and networking. Use when debugging CrashLoopBackOff, OOMKilled, ImagePullBackOff, pod not starting, k8s issues, or any Kubernetes troubleshooting.

Kubernetes Troubleshooting

Systematic approach to debugging Kubernetes issues.

When to Use This Skill

Pod stuck in CrashLoopBackOff
OOMKilled errors
ImagePullBackOff failures
Pod not starting or scheduling
Service connectivity issues
Resource constraint problems

Quick Diagnostic Commands

Start with these commands to understand the current state:

# Cluster overview
kubectl get nodes
kubectl get pods -A | grep -v Running

# Specific namespace
kubectl get pods -n <namespace>
kubectl get events -n <namespace> --sort-by='.lastTimestamp' | tail -20

# Resource usage
kubectl top nodes
kubectl top pods -n <namespace>

Pod Debugging Workflow

Step 1: Check Pod Status

kubectl get pod <pod-name> -n <namespace> -o wide
kubectl describe pod <pod-name> -n <namespace>

Look for:

Status: What state is the pod in?
Conditions: Ready, ContainersReady, PodScheduled
Events: Recent events at the bottom of describe output

Step 2: Identify the Problem Category

Symptom	Likely Cause	Go To Section
Pending	Scheduling issue	Scheduling Issues
CrashLoopBackOff	Application crash	CrashLoopBackOff
ImagePullBackOff	Image/registry issue	Image Pull Issues
OOMKilled	Memory exhaustion	OOMKilled
Running but not Ready	Health check failing	Readiness Issues
Error	Container error	Container Errors

Common Issues

Scheduling Issues

Pod stuck in Pending state.

Diagnostic:

kubectl describe pod <pod-name> -n <namespace> | grep -A 10 Events

Common Causes:

Event Message	Cause	Fix
Insufficient cpu/memory	Not enough resources	Add nodes or reduce requests
node(s) had taints	Node taints	Add tolerations or remove taints
no nodes available	No matching nodes	Check node selector/affinity
persistentvolumeclaim not found	PVC missing	Create the PVC

Fix Resource Issues:

# Check resource requests vs available
kubectl describe nodes | grep -A 5 "Allocated resources"

# Check pending pod requests
kubectl get pod <pod> -o yaml | grep -A 10 resources

CrashLoopBackOff

Container keeps crashing and restarting.

Diagnostic:

# Check container logs (current)
kubectl logs <pod-name> -n <namespace>

# Check previous container logs
kubectl logs <pod-name> -n <namespace> --previous

# Check exit code
kubectl describe pod <pod-name> -n <namespace> | grep -A 3 "Last State"

Common Exit Codes:

Exit Code	Meaning	Common Cause
0	Success	Process completed (might be wrong for long-running)
1	Application error	Check application logs
137	SIGKILL (OOM)	Memory limit exceeded
139	SIGSEGV	Segmentation fault
143	SIGTERM	Graceful termination

Common Fixes:

Check application logs for startup errors
Verify environment variables and secrets
Check if dependencies are available
Verify resource limits aren't too restrictive

Image Pull Issues

ImagePullBackOff or ErrImagePull.

Diagnostic:

kubectl describe pod <pod-name> -n <namespace> | grep -A 5 Events

Common Causes:

Error	Cause	Fix
repository does not exist	Wrong image name	Fix image name/tag
unauthorized	Auth failure	Check imagePullSecrets
manifest unknown	Tag doesn't exist	Verify tag exists
connection refused	Registry unreachable	Check network/firewall

Fix Registry Auth:

# Create image pull secret
kubectl create secret docker-registry regcred \
  --docker-server=<registry> \
  --docker-username=<user> \
  --docker-password=<password> \
  -n <namespace>

# Reference in pod spec
spec:
  imagePullSecrets:
  - name: regcred

OOMKilled

Container killed due to memory exhaustion.

Diagnostic:

kubectl describe pod <pod-name> -n <namespace> | grep -i oom
kubectl get pod <pod-name> -n <namespace> -o yaml | grep -A 5 lastState

Fix Options:

Increase memory limit (if available):

resources:
  limits:
    memory: '512Mi' # Increase this
  requests:
    memory: '256Mi'

Profile memory usage:

kubectl top pod <pod-name> -n <namespace> --containers

Check for memory leaks in application code

Readiness Issues

Pod is Running but not Ready.

Diagnostic:

# Check readiness probe
kubectl describe pod <pod-name> -n <namespace> | grep -A 10 Readiness

# Check probe endpoint manually
kubectl exec <pod-name> -n <namespace> -- wget -qO- localhost:<port>/health

Common Causes:

Application not listening on expected port
Readiness endpoint returning non-200
Probe timeout too short
Dependencies not available

Fix Readiness Probe:

readinessProbe:
  httpGet:
    path: /health
    port: 8080
  initialDelaySeconds: 10 # Give app time to start
  periodSeconds: 5
  timeoutSeconds: 3 # Increase if needed
  failureThreshold: 3

Container Errors

Diagnostic:

# Get detailed container status
kubectl get pod <pod-name> -n <namespace> -o jsonpath='{.status.containerStatuses[*]}'

# Check init containers
kubectl logs <pod-name> -n <namespace> -c <init-container-name>

Networking Troubleshooting

Service Not Reachable

# Check service endpoints
kubectl get endpoints <service-name> -n <namespace>

# Check service selector matches pod labels
kubectl get svc <service-name> -n <namespace> -o yaml | grep selector -A 5
kubectl get pods -n <namespace> --show-labels

# Test connectivity from another pod
kubectl run debug --rm -it --image=busybox -- wget -qO- <service>:<port>

DNS Issues

# Check DNS resolution from pod
kubectl exec <pod> -n <namespace> -- nslookup <service-name>
kubectl exec <pod> -n <namespace> -- nslookup <service-name>.<namespace>.svc.cluster.local

# Check CoreDNS is running
kubectl get pods -n kube-system -l k8s-app=kube-dns

Resource Analysis

Node Pressure

# Check node conditions
kubectl describe nodes | grep -A 5 Conditions

# Check node resource usage
kubectl top nodes

# Find resource-heavy pods
kubectl top pods -A --sort-by=memory | head -20

PVC Issues

# Check PVC status
kubectl get pvc -n <namespace>

# Check PV status
kubectl get pv

# Describe for events
kubectl describe pvc <pvc-name> -n <namespace>

Quick Reference Commands

# Pod debugging
kubectl logs <pod> -n <ns>                    # Current logs
kubectl logs <pod> -n <ns> --previous         # Previous container logs
kubectl logs <pod> -n <ns> -c <container>     # Specific container
kubectl logs <pod> -n <ns> --tail=100 -f      # Follow logs

# Interactive debugging
kubectl exec -it <pod> -n <ns> -- /bin/sh     # Shell into container
kubectl exec <pod> -n <ns> -- env             # Check environment
kubectl exec <pod> -n <ns> -- cat /etc/hosts  # Check DNS

# Resource inspection
kubectl get pod <pod> -n <ns> -o yaml         # Full pod spec
kubectl describe pod <pod> -n <ns>            # Events and status
kubectl get events -n <ns> --sort-by='.lastTimestamp'

# Cluster-wide
kubectl get pods -A | grep -v Running         # Non-running pods
kubectl top pods -A --sort-by=cpu             # CPU usage
kubectl top pods -A --sort-by=memory          # Memory usage

Additional Resources

# Supported AI Coding Agents

This skill is compatible with the SKILL.md standard and works with all major AI coding agents:

⚡ Amp 🚀 Antigravity 🤖 Claude Code 🦀 Clawdbot 📝 Codex ▶️ Cursor 🤖 Droid 💎 Gemini CLI 🐙 GitHub Copilot 🪿 Goose 📊 Kilo Code 🔧 Kiro CLI 💻 OpenCode 🦘 Roo Code 🌲 Trae 🏄 Windsurf

Learn more about the SKILL.md standard and how to use these skills with your preferred AI coding agent.

kubernetes-troubleshooting

# Description

# SKILL.md

Kubernetes Troubleshooting

When to Use This Skill

Quick Diagnostic Commands

Pod Debugging Workflow

Step 1: Check Pod Status

Step 2: Identify the Problem Category

Common Issues

Scheduling Issues

CrashLoopBackOff

Image Pull Issues

OOMKilled

Readiness Issues

Container Errors

Networking Troubleshooting

Service Not Reachable

DNS Issues

Resource Analysis

Node Pressure

PVC Issues

Quick Reference Commands

Additional Resources

# Related Skills

# Supported AI Coding Agents

Confirm

Submit a Skill