Loading source
Pulling the file list, source metadata, and syntax-aware rendering for this listing.
Source from repo
Deploy and manage Kubernetes workloads: manifests, RBAC, Helm charts, service mesh, GitOps, and troubleshooting.
Files
Skill
Size
Entrypoint
Format
Open file
Syntax-highlighted preview of this file as included in the skill package.
references/troubleshooting.md
1# Kubernetes Troubleshooting23## Essential kubectl Commands45### Pod Inspection67```bash8# Get pods with details9kubectl get pods -n production -o wide10kubectl get pods --all-namespaces11kubectl get pods --field-selector status.phase=Running12kubectl get pods --selector app=web-app1314# Describe pod (shows events)15kubectl describe pod web-app-7d5c8b9f4-xk2pm -n production1617# Get pod logs18kubectl logs web-app-7d5c8b9f4-xk2pm -n production19kubectl logs web-app-7d5c8b9f4-xk2pm -n production --previous # Previous container20kubectl logs web-app-7d5c8b9f4-xk2pm -n production -c init-container21kubectl logs -f web-app-7d5c8b9f4-xk2pm -n production # Follow logs22kubectl logs --tail=100 web-app-7d5c8b9f4-xk2pm -n production23kubectl logs --since=1h web-app-7d5c8b9f4-xk2pm -n production2425# Get all pod logs from deployment26kubectl logs deployment/web-app -n production --all-containers=true2728# Execute commands in pod29kubectl exec -it web-app-7d5c8b9f4-xk2pm -n production -- /bin/sh30kubectl exec web-app-7d5c8b9f4-xk2pm -n production -- env31kubectl exec web-app-7d5c8b9f4-xk2pm -n production -- cat /etc/config/app.yaml3233# Copy files to/from pod34kubectl cp web-app-7d5c8b9f4-xk2pm:/app/logs/app.log ./app.log -n production35kubectl cp ./config.yaml web-app-7d5c8b9f4-xk2pm:/tmp/config.yaml -n production3637# Port forward38kubectl port-forward web-app-7d5c8b9f4-xk2pm 8080:8080 -n production39kubectl port-forward service/web-app 8080:80 -n production40```4142### Deployment Debugging4344```bash45# Check deployment status46kubectl get deployment web-app -n production47kubectl describe deployment web-app -n production48kubectl rollout status deployment/web-app -n production49kubectl rollout history deployment/web-app -n production5051# Check replica sets52kubectl get rs -n production53kubectl describe rs web-app-7d5c8b9f4 -n production5455# Scale deployment56kubectl scale deployment web-app --replicas=5 -n production5758# Rollback deployment59kubectl rollout undo deployment/web-app -n production60kubectl rollout undo deployment/web-app --to-revision=2 -n production6162# Restart deployment (recreate pods)63kubectl rollout restart deployment/web-app -n production64```6566### Service and Network Debugging6768```bash69# Get services70kubectl get svc -n production71kubectl describe svc web-app -n production7273# Get endpoints74kubectl get endpoints web-app -n production75kubectl describe endpoints web-app -n production7677# Get ingress78kubectl get ingress -n production79kubectl describe ingress web-app -n production8081# Get network policies82kubectl get networkpolicy -n production83kubectl describe networkpolicy frontend-to-backend -n production84```8586### Resource and Configuration8788```bash89# Get ConfigMaps and Secrets90kubectl get configmap -n production91kubectl describe configmap app-config -n production92kubectl get configmap app-config -n production -o yaml9394kubectl get secret -n production95kubectl describe secret app-secrets -n production96kubectl get secret app-secrets -n production -o jsonpath='{.data.password}' | base64 -d9798# Get PVCs and PVs99kubectl get pvc -n production100kubectl describe pvc database-pvc -n production101kubectl get pv102103# Get events (sorted by timestamp)104kubectl get events -n production --sort-by='.lastTimestamp'105kubectl get events -n production --field-selector involvedObject.name=web-app-7d5c8b9f4-xk2pm106```107108## Debug Pod109110### Ephemeral Debug Container111112```bash113# Attach debug container to running pod114kubectl debug -it web-app-7d5c8b9f4-xk2pm -n production \115--image=busybox:latest \116--target=web-app117118# Create copy of pod with debug tools119kubectl debug web-app-7d5c8b9f4-xk2pm -n production \120-it \121--image=ubuntu:latest \122--share-processes \123--copy-to=web-app-debug124125# Debug with different image126kubectl debug web-app-7d5c8b9f4-xk2pm -n production \127-it \128--image=nicolaka/netshoot:latest \129--target=web-app130```131132### Debug on Node133134```bash135# Create privileged pod on specific node136kubectl debug node/node-01 -it --image=ubuntu:latest137138# Access node filesystem139kubectl debug node/node-01 -it --image=ubuntu:latest -- chroot /host140```141142## Common Issues and Solutions143144### Issue 1: Pod in Pending State145146```bash147# Check pod status and events148kubectl describe pod web-app-7d5c8b9f4-xk2pm -n production149150# Common causes:151# 1. Insufficient resources152kubectl top nodes153kubectl describe nodes154155# 2. PVC not bound156kubectl get pvc -n production157kubectl describe pvc database-pvc -n production158159# 3. ImagePullBackOff160kubectl describe pod web-app-7d5c8b9f4-xk2pm -n production | grep -A 10 Events161162# 4. Node selector/affinity issues163kubectl get pod web-app-7d5c8b9f4-xk2pm -n production -o yaml | grep -A 5 nodeSelector164```165166### Issue 2: CrashLoopBackOff167168```bash169# Check logs from crashed container170kubectl logs web-app-7d5c8b9f4-xk2pm -n production --previous171172# Check if liveness probe is failing173kubectl describe pod web-app-7d5c8b9f4-xk2pm -n production | grep -A 10 "Liveness"174175# Debug with different command176kubectl run debug-pod --image=myapp:latest -it --rm --restart=Never -- /bin/sh177178# Check resource limits179kubectl describe pod web-app-7d5c8b9f4-xk2pm -n production | grep -A 10 "Limits"180```181182### Issue 3: ImagePullBackOff183184```bash185# Check image pull secret186kubectl get secret registry-credentials -n production -o yaml187188# Test image pull manually189kubectl run test-pull --image=myregistry.io/myapp:v1.2.0 \190--image-pull-policy=Always \191--restart=Never \192-n production193194# Create/update image pull secret195kubectl create secret docker-registry registry-credentials \196--docker-server=myregistry.io \197--docker-username=myuser \198--docker-password=mypassword \199[email protected] \200-n production201```202203### Issue 4: Service Not Accessible204205```bash206# Check service endpoints207kubectl get endpoints web-app -n production208kubectl describe endpoints web-app -n production209210# Verify pod labels match service selector211kubectl get pod web-app-7d5c8b9f4-xk2pm -n production --show-labels212kubectl get service web-app -n production -o yaml | grep -A 3 selector213214# Test service connectivity from debug pod215kubectl run debug --image=nicolaka/netshoot:latest -it --rm -n production -- bash216# Inside pod:217curl http://web-app.production.svc.cluster.local218nslookup web-app.production.svc.cluster.local219telnet web-app.production.svc.cluster.local 80220```221222### Issue 5: DNS Resolution Issues223224```bash225# Check CoreDNS pods226kubectl get pods -n kube-system -l k8s-app=kube-dns227kubectl logs -n kube-system -l k8s-app=kube-dns228229# Test DNS resolution230kubectl run dnsutils --image=tutum/dnsutils -it --rm -- bash231# Inside pod:232nslookup kubernetes.default233nslookup web-app.production.svc.cluster.local234dig web-app.production.svc.cluster.local235236# Check DNS config in pod237kubectl exec web-app-7d5c8b9f4-xk2pm -n production -- cat /etc/resolv.conf238```239240### Issue 6: NetworkPolicy Blocking Traffic241242```bash243# List network policies244kubectl get networkpolicy -n production245kubectl describe networkpolicy default-deny-all -n production246247# Test connectivity248kubectl run test-connectivity --image=nicolaka/netshoot:latest -it --rm -n production -- bash249# Inside pod:250curl -v http://web-app:80251nc -zv web-app 80252253# Temporarily allow all traffic (testing only)254kubectl delete networkpolicy --all -n production255```256257### Issue 7: High Resource Usage258259```bash260# Check resource usage261kubectl top nodes262kubectl top pods -n production263kubectl top pod web-app-7d5c8b9f4-xk2pm -n production --containers264265# Check resource requests and limits266kubectl describe pod web-app-7d5c8b9f4-xk2pm -n production | grep -A 10 "Limits"267268# Get pods sorted by CPU/memory usage269kubectl top pods -n production --sort-by=cpu270kubectl top pods -n production --sort-by=memory271272# Check node capacity273kubectl describe node node-01 | grep -A 10 "Allocated resources"274```275276### Issue 8: PersistentVolumeClaim Issues277278```bash279# Check PVC status280kubectl get pvc -n production281kubectl describe pvc database-pvc -n production282283# Check PV status284kubectl get pv285kubectl describe pv pvc-abc123286287# Check storage class288kubectl get storageclass289kubectl describe storageclass fast-ssd290291# Events related to PVC292kubectl get events -n production --field-selector involvedObject.name=database-pvc293```294295## Advanced Debugging296297### API Server Debugging298299```bash300# Enable verbose output301kubectl get pods -n production -v=9302303# Check API server logs (on master node)304journalctl -u kube-apiserver -f305306# Check cluster info307kubectl cluster-info308kubectl cluster-info dump > cluster-dump.txt309```310311### RBAC Debugging312313```bash314# Check if ServiceAccount can perform action315kubectl auth can-i get pods --as=system:serviceaccount:production:web-app-sa -n production316317# List permissions for ServiceAccount318kubectl describe sa web-app-sa -n production319kubectl describe role web-app-role -n production320kubectl describe rolebinding web-app-rolebinding -n production321322# Check all permissions323kubectl auth can-i --list --as=system:serviceaccount:production:web-app-sa -n production324```325326### Performance Debugging327328```bash329# Get resource metrics330kubectl get --raw /apis/metrics.k8s.io/v1beta1/nodes331kubectl get --raw /apis/metrics.k8s.io/v1beta1/pods332333# Check pod overhead334kubectl get pod web-app-7d5c8b9f4-xk2pm -n production -o json | jq '.spec.overhead'335336# Check priority classes337kubectl get priorityclasses338kubectl describe priorityclass high-priority339```340341## Diagnostic Tools342343### Network Tools Container344345```yaml346apiVersion: v1347kind: Pod348metadata:349name: netshoot350namespace: production351spec:352containers:353- name: netshoot354image: nicolaka/netshoot:latest355command: ["/bin/sleep", "3600"]356restartPolicy: Never357```358359### Database Client Container360361```yaml362apiVersion: v1363kind: Pod364metadata:365name: postgres-client366namespace: production367spec:368containers:369- name: postgres370image: postgres:15-alpine371command: ["/bin/sleep", "3600"]372env:373- name: PGHOST374value: postgres-service375- name: PGUSER376value: myapp377- name: PGPASSWORD378valueFrom:379secretKeyRef:380name: postgres-secrets381key: password382restartPolicy: Never383```384385## Quick Reference386387### Pod States388- **Pending**: Waiting to be scheduled389- **ContainerCreating**: Pulling image / creating container390- **Running**: Pod is running391- **Succeeded**: All containers exited successfully392- **Failed**: At least one container failed393- **CrashLoopBackOff**: Container keeps crashing394- **ImagePullBackOff**: Cannot pull image395- **ErrImagePull**: Image pull error396- **Unknown**: Cannot get pod status397398### Common Exit Codes399- **0**: Success400- **1**: General error401- **137**: SIGKILL (OOMKilled - out of memory)402- **139**: SIGSEGV (segmentation fault)403- **143**: SIGTERM (graceful termination)404405## Best Practices4064071. **Logs**: Always check logs first with `kubectl logs`4082. **Events**: Use `kubectl describe` to see events4093. **Labels**: Use consistent labels for easier debugging4104. **Resources**: Set appropriate requests and limits4115. **Health Checks**: Implement proper liveness and readiness probes4126. **Monitoring**: Set up comprehensive monitoring and alerting4137. **Debug Tools**: Keep debug containers ready4148. **Documentation**: Document common issues and solutions415