Loading source
Pulling the file list, source metadata, and syntax-aware rendering for this listing.
Source from repo
Diagnose Azure service issues, query logs, and troubleshoot failures using GitHub Copilot for Azure
Files
Skill
Size
Entrypoint
Format
Open file
Syntax-highlighted preview of this file as included in the skill package.
troubleshooting/aks/pod-failures.md
1# Pod Failures & Application Issues23## Common Pod Diagnostic Commands45```bash6# List unhealthy pods across all namespaces7kubectl get pods -A --field-selector=status.phase!=Running,status.phase!=Succeeded8# All pods wide view9kubectl get pods -A -o wide10# Detailed pod status - events section is critical11kubectl describe pod <pod-name> -n <namespace>12# Pod logs (current and previous crash)13kubectl logs <pod-name> -n <namespace>14kubectl logs <pod-name> -n <namespace> --previous15```1617---1819## CrashLoopBackOff2021Pod starts, crashes, restarts with exponential backoff (10s, 20s, 40s... up to 5m).2223**Diagnostics:**2425```bash26kubectl describe pod <pod-name> -n <namespace>27# Check: Exit Code, Reason, Last State, Events2829kubectl logs <pod-name> -n <namespace> --previous30# Shows stdout/stderr from the last crashed container31```3233**Decision tree:**3435| Exit Code | Meaning | Fix Path |36| --------- | ----------------------------------------------------- | ------------------------------------------------------------- |37| `0` | App exited successfully (unexpected for long-running) | Check if entrypoint/command is correct; app may be a one-shot |38| `1` | Application error | Read logs - unhandled exception, missing config, bad startup |39| `137` | OOMKilled (SIGKILL) | Increase `resources.limits.memory`; check for memory leaks |40| `139` | Segfault (SIGSEGV) | Binary compatibility issue or native code bug |41| `143` | SIGTERM - graceful shutdown | Pod was terminated; check if liveness probe killed it |4243**OOMKilled specifically:**4445```bash46kubectl describe pod <pod-name> -n <namespace> | grep -A2 "Last State"47# Reason: OOMKilled -> container exceeded memory limit48```4950Fix: increase `resources.limits.memory` or optimize application memory usage. Check `kubectl top pod <pod-name> -n <namespace>` for actual usage.5152---5354## ImagePullBackOff5556Pod can't pull the container image.5758**Diagnostics:**5960```bash61kubectl describe pod <pod-name> -n <namespace>62# Events section shows the exact pull error63```6465| Error Message | Cause | Fix |66| --------------------------------------- | ---------------------------- | -------------------------------------------------------------- |67| `ErrImagePull` / `ImagePullBackOff` | Image name or tag is wrong | Verify image name and tag exist in the registry |68| `unauthorized: authentication required` | Missing or wrong pull secret | Create/update `imagePullSecrets` on the pod or service account |69| `manifest unknown` | Tag doesn't exist | Check available tags in the registry |70| `context deadline exceeded` | Registry unreachable | Check network/firewall; for ACR, verify AKS -> ACR integration |7172**ACR integration check:**7374```bash75# Verify AKS is attached to ACR76az aks check-acr -g <rg> -n <cluster> --acr <acr-name>.azurecr.io77```7879---8081## Pending Pods8283Pod stays in `Pending` - scheduler can't place it.8485**Diagnostics:**8687```bash88kubectl describe pod <pod-name> -n <namespace>89# Events section shows why scheduling failed90```9192| Event Message | Cause | Fix |93| ---------------------------------------------------------------------- | ----------------------------------- | --------------------------------------------------------------- |94| `Insufficient cpu` / `Insufficient memory` | No node has enough resources | Scale node pool; reduce resource requests; check for overcommit |95| `node(s) had taint ... that the pod didn't tolerate` | Taint/toleration mismatch | Add matching toleration or use a different node pool |96| `node(s) didn't match Pod's node affinity/selector` | Affinity rule unsatisfiable | Check `nodeSelector` or `nodeAffinity` rules |97| `persistentvolumeclaim ... not found` / `unbound` | PVC not ready | Check PVC status; verify storage class exists |98| `0/N nodes are available: N node(s) had volume node affinity conflict` | Zonal disk vs pod in different zone | Use ZRS storage class or ensure same zone |99100---101102## Readiness & Liveness Probe Failures103104**Readiness probe failure** -> pod removed from Service endpoints (no traffic). **Liveness probe failure** -> pod killed and restarted.105106**Diagnostics:**107108```bash109kubectl describe pod <pod-name> -n <namespace>110# Look for: "Readiness probe failed" or "Liveness probe failed" in Events111112# Check the pod's READY column - must show n/n113kubectl get pod <pod-name> -n <namespace>114```115116| Symptom | Cause | Fix |117| ------------------------------------ | ----------------------- | ---------------------------------------------------------- |118| READY shows `0/1` but pod is Running | Readiness probe failing | Check probe path, port, and app health endpoint |119| Pod restarts repeatedly | Liveness probe failing | Increase `initialDelaySeconds`; check if app starts slowly |120| Probe timeout errors | App responds too slowly | Increase `timeoutSeconds`; check app performance |121122> 💡 **Tip:** Set `initialDelaySeconds` on liveness probes to be longer than your app's startup time. A common mistake is killing pods before they finish initializing.123124---125126## Resource Constraints (CPU/Memory)127128**Check actual usage vs limits:**129130```bash131kubectl top pod <pod-name> -n <namespace>132kubectl top pod -n <namespace> --sort-by=memory133134# Compare with requests/limits135kubectl get pod <pod-name> -n <namespace> -o jsonpath='{.spec.containers[*].resources}'136```137138| Symptom | Cause | Fix |139| -------------------------------- | --------------------------------------- | --------------------------------------------------- |140| OOMKilled (exit code 137) | Container exceeded memory limit | Increase `limits.memory` or fix memory leak |141| CPU throttling (slow responses) | Container hitting CPU limit | Increase `limits.cpu` or remove CPU limits |142| Pending - insufficient resources | Requests exceed available node capacity | Lower requests, scale nodes, or use larger VM sizes |143144> ⚠️ **Warning:** Setting CPU limits can cause unnecessary throttling even when the node has spare capacity. Many teams set CPU requests but not limits. Memory limits should always be set.145