zonghui1968

kubernetes

0
0
# Install this skill:
npx skills add zonghui1968/clawd-skills --skill "kubernetes"

Install specific skill from multi-skill repository

# Description

|

# SKILL.md


name: kubernetes
description: |
Comprehensive Kubernetes and OpenShift cluster management skill covering operations, troubleshooting, manifest generation, security, and GitOps. Use this skill when:
(1) Cluster operations: upgrades, backups, node management, scaling, monitoring setup
(2) Troubleshooting: pod failures, networking issues, storage problems, performance analysis
(3) Creating manifests: Deployments, StatefulSets, Services, Ingress, NetworkPolicies, RBAC
(4) Security: audits, Pod Security Standards, RBAC, secrets management, vulnerability scanning
(5) GitOps: ArgoCD, Flux, Kustomize, Helm, CI/CD pipelines, progressive delivery
(6) OpenShift-specific: SCCs, Routes, Operators, Builds, ImageStreams
(7) Multi-cloud: AKS, EKS, GKE, ARO, ROSA operations
metadata:
author: cluster-skills
version: "1.0.0"


Kubernetes & OpenShift Cluster Management

Comprehensive skill for Kubernetes and OpenShift clusters covering operations, troubleshooting, manifests, security, and GitOps.

Current Versions (January 2026)

Platform Version Documentation
Kubernetes 1.31.x https://kubernetes.io/docs/
OpenShift 4.17.x https://docs.openshift.com/
EKS 1.31 https://docs.aws.amazon.com/eks/
AKS 1.31 https://learn.microsoft.com/azure/aks/
GKE 1.31 https://cloud.google.com/kubernetes-engine/docs

Key Tools

Tool Version Purpose
ArgoCD v2.13.x GitOps deployments
Flux v2.4.x GitOps toolkit
Kustomize v5.5.x Manifest customization
Helm v3.16.x Package management
Velero 1.15.x Backup/restore
Trivy 0.58.x Security scanning
Kyverno 1.13.x Policy engine

Command Convention

IMPORTANT: Use kubectl for standard Kubernetes. Use oc for OpenShift/ARO.


1. CLUSTER OPERATIONS

Node Management

# View nodes
kubectl get nodes -o wide

# Drain node for maintenance
kubectl drain ${NODE} --ignore-daemonsets --delete-emptydir-data --grace-period=60

# Uncordon after maintenance
kubectl uncordon ${NODE}

# View node resources
kubectl top nodes

Cluster Upgrades

AKS:

az aks get-upgrades -g ${RG} -n ${CLUSTER} -o table
az aks upgrade -g ${RG} -n ${CLUSTER} --kubernetes-version ${VERSION}

EKS:

aws eks update-cluster-version --name ${CLUSTER} --kubernetes-version ${VERSION}

GKE:

gcloud container clusters upgrade ${CLUSTER} --master --cluster-version ${VERSION}

OpenShift:

oc adm upgrade --to=${VERSION}
oc get clusterversion

Backup with Velero

# Install Velero
velero install --provider ${PROVIDER} --bucket ${BUCKET} --secret-file ${CREDS}

# Create backup
velero backup create ${BACKUP_NAME} --include-namespaces ${NS}

# Restore
velero restore create --from-backup ${BACKUP_NAME}

2. TROUBLESHOOTING

Health Assessment

Run the bundled script for comprehensive health check:

bash scripts/cluster-health-check.sh

Pod Status Interpretation

Status Meaning Action
Pending Scheduling issue Check resources, nodeSelector, tolerations
CrashLoopBackOff Container crashing Check logs: kubectl logs ${POD} --previous
ImagePullBackOff Image unavailable Verify image name, registry access
OOMKilled Out of memory Increase memory limits
Evicted Node pressure Check node resources

Debugging Commands

# Pod logs (current and previous)
kubectl logs ${POD} -c ${CONTAINER} --previous

# Multi-pod logs with stern
stern ${LABEL_SELECTOR} -n ${NS}

# Exec into pod
kubectl exec -it ${POD} -- /bin/sh

# Pod events
kubectl describe pod ${POD} | grep -A 20 Events

# Cluster events (sorted by time)
kubectl get events -A --sort-by='.lastTimestamp' | tail -50

Network Troubleshooting

# Test DNS
kubectl run -it --rm debug --image=busybox -- nslookup kubernetes.default

# Test service connectivity
kubectl run -it --rm debug --image=curlimages/curl -- curl -v http://${SVC}.${NS}:${PORT}

# Check endpoints
kubectl get endpoints ${SVC}

3. MANIFEST GENERATION

Production Deployment Template

apiVersion: apps/v1
kind: Deployment
metadata:
  name: ${APP_NAME}
  namespace: ${NAMESPACE}
  labels:
    app.kubernetes.io/name: ${APP_NAME}
    app.kubernetes.io/version: "${VERSION}"
spec:
  replicas: 3
  strategy:
    type: RollingUpdate
    rollingUpdate:
      maxSurge: 1
      maxUnavailable: 0
  selector:
    matchLabels:
      app.kubernetes.io/name: ${APP_NAME}
  template:
    metadata:
      labels:
        app.kubernetes.io/name: ${APP_NAME}
    spec:
      serviceAccountName: ${APP_NAME}
      securityContext:
        runAsNonRoot: true
        runAsUser: 1000
        fsGroup: 1000
        seccompProfile:
          type: RuntimeDefault
      containers:
        - name: ${APP_NAME}
          image: ${IMAGE}:${TAG}
          ports:
            - name: http
              containerPort: 8080
          securityContext:
            allowPrivilegeEscalation: false
            readOnlyRootFilesystem: true
            capabilities:
              drop: ["ALL"]
          resources:
            requests:
              cpu: 100m
              memory: 128Mi
            limits:
              cpu: 500m
              memory: 512Mi
          livenessProbe:
            httpGet:
              path: /healthz
              port: http
            initialDelaySeconds: 10
            periodSeconds: 10
          readinessProbe:
            httpGet:
              path: /ready
              port: http
            initialDelaySeconds: 5
            periodSeconds: 5
          volumeMounts:
            - name: tmp
              mountPath: /tmp
      volumes:
        - name: tmp
          emptyDir: {}
      affinity:
        podAntiAffinity:
          preferredDuringSchedulingIgnoredDuringExecution:
            - weight: 100
              podAffinityTerm:
                labelSelector:
                  matchLabels:
                    app.kubernetes.io/name: ${APP_NAME}
                topologyKey: kubernetes.io/hostname

Service & Ingress

apiVersion: v1
kind: Service
metadata:
  name: ${APP_NAME}
spec:
  selector:
    app.kubernetes.io/name: ${APP_NAME}
  ports:
    - name: http
      port: 80
      targetPort: http
---
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: ${APP_NAME}
  annotations:
    nginx.ingress.kubernetes.io/ssl-redirect: "true"
spec:
  ingressClassName: nginx
  tls:
    - hosts:
        - ${HOST}
      secretName: ${APP_NAME}-tls
  rules:
    - host: ${HOST}
      http:
        paths:
          - path: /
            pathType: Prefix
            backend:
              service:
                name: ${APP_NAME}
                port:
                  name: http

OpenShift Route

apiVersion: route.openshift.io/v1
kind: Route
metadata:
  name: ${APP_NAME}
spec:
  to:
    kind: Service
    name: ${APP_NAME}
  port:
    targetPort: http
  tls:
    termination: edge
    insecureEdgeTerminationPolicy: Redirect

Use the bundled script for manifest generation:

bash scripts/generate-manifest.sh deployment myapp production

4. SECURITY

Security Audit

Run the bundled script:

bash scripts/security-audit.sh [namespace]

Pod Security Standards

apiVersion: v1
kind: Namespace
metadata:
  name: ${NAMESPACE}
  labels:
    pod-security.kubernetes.io/enforce: restricted
    pod-security.kubernetes.io/audit: baseline
    pod-security.kubernetes.io/warn: restricted

NetworkPolicy (Zero Trust)

apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: ${APP_NAME}-policy
spec:
  podSelector:
    matchLabels:
      app.kubernetes.io/name: ${APP_NAME}
  policyTypes:
    - Ingress
    - Egress
  ingress:
    - from:
        - podSelector:
            matchLabels:
              app.kubernetes.io/name: frontend
      ports:
        - protocol: TCP
          port: 8080
  egress:
    - to:
        - podSelector:
            matchLabels:
              app.kubernetes.io/name: database
      ports:
        - protocol: TCP
          port: 5432
    # Allow DNS
    - to:
        - namespaceSelector: {}
          podSelector:
            matchLabels:
              k8s-app: kube-dns
      ports:
        - protocol: UDP
          port: 53

RBAC Best Practices

apiVersion: v1
kind: ServiceAccount
metadata:
  name: ${APP_NAME}
---
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
  name: ${APP_NAME}-role
rules:
  - apiGroups: [""]
    resources: ["configmaps"]
    verbs: ["get", "list"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
  name: ${APP_NAME}-binding
subjects:
  - kind: ServiceAccount
    name: ${APP_NAME}
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: Role
  name: ${APP_NAME}-role

Image Scanning

# Scan image with Trivy
trivy image ${IMAGE}:${TAG}

# Scan with severity filter
trivy image --severity HIGH,CRITICAL ${IMAGE}:${TAG}

# Generate SBOM
trivy image --format spdx-json -o sbom.json ${IMAGE}:${TAG}

5. GITOPS

ArgoCD Application

apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
  name: ${APP_NAME}
  namespace: argocd
  finalizers:
    - resources-finalizer.argocd.argoproj.io
spec:
  project: default
  source:
    repoURL: ${GIT_REPO}
    targetRevision: main
    path: k8s/overlays/${ENV}
  destination:
    server: https://kubernetes.default.svc
    namespace: ${NAMESPACE}
  syncPolicy:
    automated:
      prune: true
      selfHeal: true
    syncOptions:
      - CreateNamespace=true

Kustomize Structure

k8s/
├── base/
│   ├── kustomization.yaml
│   ├── deployment.yaml
│   └── service.yaml
└── overlays/
    ├── dev/
    │   └── kustomization.yaml
    ├── staging/
    │   └── kustomization.yaml
    └── prod/
        └── kustomization.yaml

base/kustomization.yaml:

apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization
resources:
  - deployment.yaml
  - service.yaml

overlays/prod/kustomization.yaml:

apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization
resources:
  - ../../base
namePrefix: prod-
namespace: production
replicas:
  - name: myapp
    count: 5
images:
  - name: myregistry/myapp
    newTag: v1.2.3

GitHub Actions CI/CD

name: Build and Deploy
on:
  push:
    branches: [main]

jobs:
  build:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4

      - name: Build and push image
        uses: docker/build-push-action@v5
        with:
          push: true
          tags: ${{ secrets.REGISTRY }}/${{ github.event.repository.name }}:${{ github.sha }}

      - name: Update Kustomize image
        run: |
          cd k8s/overlays/prod
          kustomize edit set image myapp=${{ secrets.REGISTRY }}/${{ github.event.repository.name }}:${{ github.sha }}

      - name: Commit and push
        run: |
          git config user.name "github-actions"
          git config user.email "[email protected]"
          git add .
          git commit -m "Update image to ${{ github.sha }}"
          git push

Use the bundled script for ArgoCD sync:

bash scripts/argocd-app-sync.sh ${APP_NAME} --prune

Helper Scripts

This skill includes automation scripts in the scripts/ directory:

Script Purpose
cluster-health-check.sh Comprehensive cluster health assessment with scoring
security-audit.sh Security posture audit (privileged, root, RBAC, NetworkPolicy)
node-maintenance.sh Safe node drain and maintenance prep
pre-upgrade-check.sh Pre-upgrade validation checklist
generate-manifest.sh Generate production-ready K8s manifests
argocd-app-sync.sh ArgoCD application sync helper

Run any script:

bash scripts/<script-name>.sh [arguments]

# Supported AI Coding Agents

This skill is compatible with the SKILL.md standard and works with all major AI coding agents:

Learn more about the SKILL.md standard and how to use these skills with your preferred AI coding agent.