Loading source
Pulling the file list, source metadata, and syntax-aware rendering for this listing.
Source from repo
Diagnose Azure service issues, query logs, and troubleshoot failures using GitHub Copilot for Azure
Files
Skill
Size
Entrypoint
Format
Open file
Syntax-highlighted preview of this file as included in the skill package.
troubleshooting/aks/pod-failures.md
1# Pod Failures & Application Issues23## Common Pod Diagnostic Commands45```bash6# List unhealthy pods across all namespaces7kubectl get pods -A --field-selector=status.phase!=Running,status.phase!=Succeeded8# All pods wide view9kubectl get pods -A -o wide10# Detailed pod status - events section is critical11kubectl describe pod <pod-name> -n <namespace>12# Pod logs (current and previous crash)13kubectl logs <pod-name> -n <namespace>14kubectl logs <pod-name> -n <namespace> --previous15```1617---1819## CrashLoopBackOff2021Pod starts, crashes, restarts with exponential backoff (10s, 20s, 40s... up to 5m).2223**Diagnostics:**2425```bash26kubectl describe pod <pod-name> -n <namespace>27# Check: Exit Code, Reason, Last State, Events2829kubectl logs <pod-name> -n <namespace> --previous30# Shows stdout/stderr from the last crashed container31```3233**Decision tree:**3435| Exit Code | Meaning | Fix Path |36| --------- | ----------------------------------------------------- | ------------------------------------------------------------- |37| `0` | App exited successfully (unexpected for long-running) | Check if entrypoint/command is correct; app may be a one-shot |38| `1` | Application error | Read logs - unhandled exception, missing config, bad startup |39| `137` | OOMKilled (SIGKILL) | Increase `resources.limits.memory`; check for memory leaks |40| `139` | Segfault (SIGSEGV) | Binary compatibility issue or native code bug |41| `143` | SIGTERM - graceful shutdown | Pod was terminated; check if liveness probe killed it |4243**OOMKilled specifically:**4445```bash46kubectl describe pod <pod-name> -n <namespace> | grep -A2 "Last State"47# Reason: OOMKilled -> container exceeded memory limit48```4950Fix: increase `resources.limits.memory` or optimize application memory usage. Check `kubectl top pod <pod-name> -n <namespace>` for actual usage.5152**OOM kill tracing with Inspektor Gadget:** Use `trace_oomkill` (timeout 30) with `--k8s-namespace <namespace> --k8s-podname <pod-name>` to see which process was killed and memory at kill time. See [references/inspektor-gadget.md](references/inspektor-gadget.md).5354**Deep diagnostics with Inspektor Gadget** (when logs and describe are inconclusive):5556Use the [IG base command pattern](references/inspektor-gadget.md) with `--k8s-namespace <namespace> --k8s-podname <pod-name>` and these gadgets:5758- `trace_exec` (timeout 30) — see what the container executes at startup59- `trace_open` (timeout 30) — find missing configs/secrets (retval -2 = ENOENT, -13 = EACCES)60- `snapshot_process` (timeout 5) — list running processes in the pod6162See [references/inspektor-gadget.md](references/inspektor-gadget.md).6364---6566## ImagePullBackOff6768Pod can't pull the container image.6970**Diagnostics:**7172```bash73kubectl describe pod <pod-name> -n <namespace>74# Events section shows the exact pull error75```7677| Error Message | Cause | Fix |78| --------------------------------------- | ---------------------------- | -------------------------------------------------------------- |79| `ErrImagePull` / `ImagePullBackOff` | Image name or tag is wrong | Verify image name and tag exist in the registry |80| `unauthorized: authentication required` | Missing or wrong pull secret | Create/update `imagePullSecrets` on the pod or service account |81| `manifest unknown` | Tag doesn't exist | Check available tags in the registry |82| `context deadline exceeded` | Registry unreachable | Check network/firewall; for ACR, verify AKS -> ACR integration |8384**ACR integration check:**8586```bash87# Verify AKS is attached to ACR88az aks check-acr -g <rg> -n <cluster> --acr <acr-name>.azurecr.io89```9091---9293## Pending Pods9495Pod stays in `Pending` - scheduler can't place it.9697**Diagnostics:**9899```bash100kubectl describe pod <pod-name> -n <namespace>101# Events section shows why scheduling failed102```103104| Event Message | Cause | Fix |105| ---------------------------------------------------------------------- | ----------------------------------- | --------------------------------------------------------------- |106| `Insufficient cpu` / `Insufficient memory` | No node has enough resources | Scale node pool; reduce resource requests; check for overcommit |107| `node(s) had taint ... that the pod didn't tolerate` | Taint/toleration mismatch | Add matching toleration or use a different node pool |108| `node(s) didn't match Pod's node affinity/selector` | Affinity rule unsatisfiable | Check `nodeSelector` or `nodeAffinity` rules |109| `persistentvolumeclaim ... not found` / `unbound` | PVC not ready | Check PVC status; verify storage class exists |110| `0/N nodes are available: N node(s) had volume node affinity conflict` | Zonal disk vs pod in different zone | Use ZRS storage class or ensure same zone |111112---113114## Readiness & Liveness Probe Failures115116**Readiness probe failure** -> pod removed from Service endpoints (no traffic). **Liveness probe failure** -> pod killed and restarted.117118**Diagnostics:**119120```bash121kubectl describe pod <pod-name> -n <namespace>122# Look for: "Readiness probe failed" or "Liveness probe failed" in Events123124# Check the pod's READY column - must show n/n125kubectl get pod <pod-name> -n <namespace>126```127128| Symptom | Cause | Fix |129| ------------------------------------ | ----------------------- | ---------------------------------------------------------- |130| READY shows `0/1` but pod is Running | Readiness probe failing | Check probe path, port, and app health endpoint |131| Pod restarts repeatedly | Liveness probe failing | Increase `initialDelaySeconds`; check if app starts slowly |132| Probe timeout errors | App responds too slowly | Increase `timeoutSeconds`; check app performance |133134> 💡 **Tip:** Set `initialDelaySeconds` on liveness probes to be longer than your app's startup time. A common mistake is killing pods before they finish initializing.135136---137138## Resource Constraints (CPU/Memory)139140**Check actual usage vs limits:**141142```bash143kubectl top pod <pod-name> -n <namespace>144kubectl top pod -n <namespace> --sort-by=memory145146# Compare with requests/limits147kubectl get pod <pod-name> -n <namespace> -o jsonpath='{.spec.containers[*].resources}'148```149150| Symptom | Cause | Fix |151| -------------------------------- | --------------------------------------- | --------------------------------------------------- |152| OOMKilled (exit code 137) | Container exceeded memory limit | Increase `limits.memory` or fix memory leak |153| CPU throttling (slow responses) | Container hitting CPU limit | Increase `limits.cpu` or remove CPU limits |154| Pending - insufficient resources | Requests exceed available node capacity | Lower requests, scale nodes, or use larger VM sizes |155156> ⚠️ **Warning:** Setting CPU limits can cause unnecessary throttling even when the node has spare capacity. Many teams set CPU requests but not limits. Memory limits should always be set.157