luiscamaral

troubleshooting-pods

1
0
# Install this skill:
npx skills add luiscamaral/k8s-cell-platform-skills --skill "troubleshooting-pods"

Install specific skill from multi-skill repository

# Description

Debugs failing Kubernetes pods across all namespaces and layers. Use for pod crashes, CrashLoopBackOff, OOMKilled, ImagePullBackOff, pending pods, container errors, or when pods are not ready. Provides diagnostic workflows and common solutions.

# SKILL.md


name: troubleshooting-pods
description: Debugs failing Kubernetes pods across all namespaces and layers. Use for pod crashes, CrashLoopBackOff, OOMKilled, ImagePullBackOff, pending pods, container errors, or when pods are not ready. Provides diagnostic workflows and common solutions.
allowed-tools: Read, Glob, Grep, Bash(kubectl:get,describe,logs,exec,top,events)


Pod Troubleshooting

Diagnoses and debugs pod issues across the Kubernetes Cell Platform.

Quick Diagnosis

# Find problematic pods
kubectl get pods -A | grep -Ev "Running|Completed"

# Recent events
kubectl get events -A --sort-by='.lastTimestamp' | tail -20

# Resource pressure
kubectl top nodes
kubectl top pods -A --sort-by=memory | head -20

Debugging Workflow

Step 1: Identify the Problem

kubectl get pods -n <namespace> -o wide
kubectl describe pod <pod-name> -n <namespace>

Step 2: Check Events

kubectl get events -n <namespace> --sort-by='.lastTimestamp'

Step 3: Check Logs

# Current container
kubectl logs <pod-name> -n <namespace>

# Previous container (after crash)
kubectl logs <pod-name> -n <namespace> --previous

# Specific container in multi-container pod
kubectl logs <pod-name> -n <namespace> -c <container-name>

Step 4: Interactive Debug

# Execute into running container
kubectl exec -it <pod-name> -n <namespace> -- /bin/sh

# Debug with ephemeral container
kubectl debug <pod-name> -n <namespace> --image=busybox -it

Common Error Patterns

CrashLoopBackOff

Symptoms: Pod repeatedly crashes
Diagnosis:

kubectl logs <pod> -n <ns> --previous
kubectl describe pod <pod> -n <ns> | grep -A5 "Last State"

Common Causes:
- Application error (check logs)
- Missing config/secrets
- Liveness probe failing

ImagePullBackOff

Symptoms: Cannot pull container image
Diagnosis:

kubectl describe pod <pod> -n <ns> | grep -A3 "Events"

Common Causes:
- Wrong image name/tag
- Private registry without credentials
- Network issues

OOMKilled

Symptoms: Container killed for memory
Diagnosis:

kubectl describe pod <pod> -n <ns> | grep -i oom
kubectl top pod <pod> -n <ns>

Solution: Increase memory limits

Pending

Symptoms: Pod stuck in Pending state
Diagnosis:

kubectl describe pod <pod> -n <ns> | grep -A10 "Events"

Common Causes:
- Insufficient resources (check Karpenter)
- Node selector/affinity mismatch
- PVC not bound

CreateContainerConfigError

Symptoms: Container cannot start
Diagnosis:

kubectl describe pod <pod> -n <ns> | grep -A5 "Warning"

Common Causes:
- Missing ConfigMap/Secret
- Invalid environment variable reference

Memory Files

  • meta/memory/troubleshooting-history.md - Past issues and solutions

Reference Documentation

  • reference/common-errors.md - Error pattern catalog

Diagnostic Script

Run scripts/collect-diagnostics.sh <namespace> <pod-name> for comprehensive diagnostics.

Layer-Specific Notes

Layer Common Issues
L0 (kube-system) Cilium agent, CoreDNS
L1 (metallb, dns) Speaker pods, external-dns permissions
L2 (argocd, kyverno) Image pull, CRD issues

# Supported AI Coding Agents

This skill is compatible with the SKILL.md standard and works with all major AI coding agents:

Learn more about the SKILL.md standard and how to use these skills with your preferred AI coding agent.