nik-kale

kubernetes-troubleshooting

0
0
# Install this skill:
npx skills add nik-kale/sre-skills --skill "kubernetes-troubleshooting"

Install specific skill from multi-skill repository

# Description

Systematic debugging workflows for Kubernetes issues including pod failures, resource problems, and networking. Use when debugging CrashLoopBackOff, OOMKilled, ImagePullBackOff, pod not starting, k8s issues, or any Kubernetes troubleshooting.

# SKILL.md


name: kubernetes-troubleshooting
description: Systematic debugging workflows for Kubernetes issues including pod failures, resource problems, and networking. Use when debugging CrashLoopBackOff, OOMKilled, ImagePullBackOff, pod not starting, k8s issues, or any Kubernetes troubleshooting.


Kubernetes Troubleshooting

Systematic approach to debugging Kubernetes issues.

When to Use This Skill

  • Pod stuck in CrashLoopBackOff
  • OOMKilled errors
  • ImagePullBackOff failures
  • Pod not starting or scheduling
  • Service connectivity issues
  • Resource constraint problems

Quick Diagnostic Commands

Start with these commands to understand the current state:

# Cluster overview
kubectl get nodes
kubectl get pods -A | grep -v Running

# Specific namespace
kubectl get pods -n <namespace>
kubectl get events -n <namespace> --sort-by='.lastTimestamp' | tail -20

# Resource usage
kubectl top nodes
kubectl top pods -n <namespace>

Pod Debugging Workflow

Step 1: Check Pod Status

kubectl get pod <pod-name> -n <namespace> -o wide
kubectl describe pod <pod-name> -n <namespace>

Look for:

  • Status: What state is the pod in?
  • Conditions: Ready, ContainersReady, PodScheduled
  • Events: Recent events at the bottom of describe output

Step 2: Identify the Problem Category

Symptom Likely Cause Go To Section
Pending Scheduling issue Scheduling Issues
CrashLoopBackOff Application crash CrashLoopBackOff
ImagePullBackOff Image/registry issue Image Pull Issues
OOMKilled Memory exhaustion OOMKilled
Running but not Ready Health check failing Readiness Issues
Error Container error Container Errors

Common Issues

Scheduling Issues

Pod stuck in Pending state.

Diagnostic:

kubectl describe pod <pod-name> -n <namespace> | grep -A 10 Events

Common Causes:

Event Message Cause Fix
Insufficient cpu/memory Not enough resources Add nodes or reduce requests
node(s) had taints Node taints Add tolerations or remove taints
no nodes available No matching nodes Check node selector/affinity
persistentvolumeclaim not found PVC missing Create the PVC

Fix Resource Issues:

# Check resource requests vs available
kubectl describe nodes | grep -A 5 "Allocated resources"

# Check pending pod requests
kubectl get pod <pod> -o yaml | grep -A 10 resources

CrashLoopBackOff

Container keeps crashing and restarting.

Diagnostic:

# Check container logs (current)
kubectl logs <pod-name> -n <namespace>

# Check previous container logs
kubectl logs <pod-name> -n <namespace> --previous

# Check exit code
kubectl describe pod <pod-name> -n <namespace> | grep -A 3 "Last State"

Common Exit Codes:

Exit Code Meaning Common Cause
0 Success Process completed (might be wrong for long-running)
1 Application error Check application logs
137 SIGKILL (OOM) Memory limit exceeded
139 SIGSEGV Segmentation fault
143 SIGTERM Graceful termination

Common Fixes:

  • Check application logs for startup errors
  • Verify environment variables and secrets
  • Check if dependencies are available
  • Verify resource limits aren't too restrictive

Image Pull Issues

ImagePullBackOff or ErrImagePull.

Diagnostic:

kubectl describe pod <pod-name> -n <namespace> | grep -A 5 Events

Common Causes:

Error Cause Fix
repository does not exist Wrong image name Fix image name/tag
unauthorized Auth failure Check imagePullSecrets
manifest unknown Tag doesn't exist Verify tag exists
connection refused Registry unreachable Check network/firewall

Fix Registry Auth:

# Create image pull secret
kubectl create secret docker-registry regcred \
  --docker-server=<registry> \
  --docker-username=<user> \
  --docker-password=<password> \
  -n <namespace>

# Reference in pod spec
spec:
  imagePullSecrets:
  - name: regcred

OOMKilled

Container killed due to memory exhaustion.

Diagnostic:

kubectl describe pod <pod-name> -n <namespace> | grep -i oom
kubectl get pod <pod-name> -n <namespace> -o yaml | grep -A 5 lastState

Fix Options:

  1. Increase memory limit (if available):
resources:
  limits:
    memory: '512Mi' # Increase this
  requests:
    memory: '256Mi'
  1. Profile memory usage:
kubectl top pod <pod-name> -n <namespace> --containers
  1. Check for memory leaks in application code

Readiness Issues

Pod is Running but not Ready.

Diagnostic:

# Check readiness probe
kubectl describe pod <pod-name> -n <namespace> | grep -A 10 Readiness

# Check probe endpoint manually
kubectl exec <pod-name> -n <namespace> -- wget -qO- localhost:<port>/health

Common Causes:

  • Application not listening on expected port
  • Readiness endpoint returning non-200
  • Probe timeout too short
  • Dependencies not available

Fix Readiness Probe:

readinessProbe:
  httpGet:
    path: /health
    port: 8080
  initialDelaySeconds: 10 # Give app time to start
  periodSeconds: 5
  timeoutSeconds: 3 # Increase if needed
  failureThreshold: 3

Container Errors

Diagnostic:

# Get detailed container status
kubectl get pod <pod-name> -n <namespace> -o jsonpath='{.status.containerStatuses[*]}'

# Check init containers
kubectl logs <pod-name> -n <namespace> -c <init-container-name>

Networking Troubleshooting

Service Not Reachable

# Check service endpoints
kubectl get endpoints <service-name> -n <namespace>

# Check service selector matches pod labels
kubectl get svc <service-name> -n <namespace> -o yaml | grep selector -A 5
kubectl get pods -n <namespace> --show-labels

# Test connectivity from another pod
kubectl run debug --rm -it --image=busybox -- wget -qO- <service>:<port>

DNS Issues

# Check DNS resolution from pod
kubectl exec <pod> -n <namespace> -- nslookup <service-name>
kubectl exec <pod> -n <namespace> -- nslookup <service-name>.<namespace>.svc.cluster.local

# Check CoreDNS is running
kubectl get pods -n kube-system -l k8s-app=kube-dns

Resource Analysis

Node Pressure

# Check node conditions
kubectl describe nodes | grep -A 5 Conditions

# Check node resource usage
kubectl top nodes

# Find resource-heavy pods
kubectl top pods -A --sort-by=memory | head -20

PVC Issues

# Check PVC status
kubectl get pvc -n <namespace>

# Check PV status
kubectl get pv

# Describe for events
kubectl describe pvc <pvc-name> -n <namespace>

Quick Reference Commands

# Pod debugging
kubectl logs <pod> -n <ns>                    # Current logs
kubectl logs <pod> -n <ns> --previous         # Previous container logs
kubectl logs <pod> -n <ns> -c <container>     # Specific container
kubectl logs <pod> -n <ns> --tail=100 -f      # Follow logs

# Interactive debugging
kubectl exec -it <pod> -n <ns> -- /bin/sh     # Shell into container
kubectl exec <pod> -n <ns> -- env             # Check environment
kubectl exec <pod> -n <ns> -- cat /etc/hosts  # Check DNS

# Resource inspection
kubectl get pod <pod> -n <ns> -o yaml         # Full pod spec
kubectl describe pod <pod> -n <ns>            # Events and status
kubectl get events -n <ns> --sort-by='.lastTimestamp'

# Cluster-wide
kubectl get pods -A | grep -v Running         # Non-running pods
kubectl top pods -A --sort-by=cpu             # CPU usage
kubectl top pods -A --sort-by=memory          # Memory usage

Additional Resources

# Supported AI Coding Agents

This skill is compatible with the SKILL.md standard and works with all major AI coding agents:

Learn more about the SKILL.md standard and how to use these skills with your preferred AI coding agent.