Loading source
Pulling the file list, source metadata, and syntax-aware rendering for this listing.
Source from repo
Debug and troubleshoot Azure Container Apps and Function Apps using logs, KQL, and health checks.
Files
Skill
Size
Entrypoint
Format
Open file
Syntax-highlighted preview of this file as included in the skill package.
troubleshooting/aks/pod-failures.md
1# Pod Failures & Application Issues23## Common Pod Diagnostic Commands45```bash6# List unhealthy pods across all namespaces7kubectl get pods -A --field-selector=status.phase!=Running,status.phase!=Succeeded8# All pods wide view9kubectl get pods -A -o wide10# Detailed pod status - events section is critical11kubectl describe pod <pod-name> -n <namespace>12# Pod logs (current and previous crash)13kubectl logs <pod-name> -n <namespace>14kubectl logs <pod-name> -n <namespace> --previous15```1617---1819## CrashLoopBackOff2021Pod starts, crashes, restarts with exponential backoff (10s, 20s, 40s... up to 5m).2223**Diagnostics:**2425```bash26kubectl describe pod <pod-name> -n <namespace>27# Check: Exit Code, Reason, Last State, Events2829kubectl logs <pod-name> -n <namespace> --previous30# Shows stdout/stderr from the last crashed container31```3233**Decision tree:**3435| Exit Code | Meaning | Fix Path |36| --------- | ----------------------------------------------------- | ------------------------------------------------------------- |37| `0` | App exited successfully (unexpected for long-running) | Check if entrypoint/command is correct; app may be a one-shot |38| `1` | Application error | Read logs - unhandled exception, missing config, bad startup |39| `137` | OOMKilled (SIGKILL) | Increase `resources.limits.memory`; check for memory leaks |40| `139` | Segfault (SIGSEGV) | Binary compatibility issue or native code bug |41| `143` | SIGTERM - graceful shutdown | Pod was terminated; check if liveness probe killed it |4243**OOMKilled specifically:**4445```bash46kubectl describe pod <pod-name> -n <namespace> | grep -A2 "Last State"47# Reason: OOMKilled -> container exceeded memory limit48```4950Fix: increase `resources.limits.memory` or optimize application memory usage. Check `kubectl top pod <pod-name> -n <namespace>` for actual usage.5152---5354## ImagePullBackOff5556Pod can't pull the container image.5758**Diagnostics:**5960```bash61kubectl describe pod <pod-name> -n <namespace>62# Events section shows the exact pull error63```6465| Error Message | Cause | Fix |66| --------------------------------------- | ---------------------------- | -------------------------------------------------------------- |67| `ErrImagePull` / `ImagePullBackOff` | Image name or tag is wrong | Verify image name and tag exist in the registry |68| `unauthorized: authentication required` | Missing or wrong pull secret | Create/update `imagePullSecrets` on the pod or service account |69| `manifest unknown` | Tag doesn't exist | Check available tags in the registry |70| `context deadline exceeded` | Registry unreachable | Check network/firewall; for ACR, verify AKS -> ACR integration |7172**ACR integration check:**7374```bash75# Verify AKS is attached to ACR76az aks check-acr -g <rg> -n <cluster> --acr <acr-name>.azurecr.io77```7879---8081## Pending Pods8283Pod stays in `Pending` - scheduler can't place it.8485**Diagnostics:**8687```bash88kubectl describe pod <pod-name> -n <namespace>89# Events section shows why scheduling failed90```9192| Event Message | Cause | Fix |93| ---------------------------------------------------------------------- | ----------------------------------- | --------------------------------------------------------------- |94| `Insufficient cpu` / `Insufficient memory` | No node has enough resources | Scale node pool; reduce resource requests; check for overcommit |95| `node(s) had taint ... that the pod didn't tolerate` | Taint/toleration mismatch | Add matching toleration or use a different node pool |96| `node(s) didn't match Pod's node affinity/selector` | Affinity rule unsatisfiable | Check `nodeSelector` or `nodeAffinity` rules |97| `persistentvolumeclaim ... not found` / `unbound` | PVC not ready | Check PVC status; verify storage class exists |98| `0/N nodes are available: N node(s) had volume node affinity conflict` | Zonal disk vs pod in different zone | Use ZRS storage class or ensure same zone |99100---101102## Readiness & Liveness Probe Failures103104**Readiness probe failure** -> pod removed from Service endpoints (no traffic). **Liveness probe failure** -> pod killed and restarted.105106**Diagnostics:**107108```bash109kubectl describe pod <pod-name> -n <namespace>110# Look for: "Readiness probe failed" or "Liveness probe failed" in Events111112# Check the pod's READY column - must show n/n113kubectl get pod <pod-name> -n <namespace>114```115116| Symptom | Cause | Fix |117| ------------------------------------ | ----------------------- | ---------------------------------------------------------- |118| READY shows `0/1` but pod is Running | Readiness probe failing | Check probe path, port, and app health endpoint |119| Pod restarts repeatedly | Liveness probe failing | Increase `initialDelaySeconds`; check if app starts slowly |120| Probe timeout errors | App responds too slowly | Increase `timeoutSeconds`; check app performance |121122> 💡 **Tip:** Set `initialDelaySeconds` on liveness probes to be longer than your app's startup time. A common mistake is killing pods before they finish initializing.123124---125126## Resource Constraints (CPU/Memory)127128**Check actual usage vs limits:**129130```bash131kubectl top pod <pod-name> -n <namespace>132kubectl top pod -n <namespace> --sort-by=memory133134# Compare with requests/limits135kubectl get pod <pod-name> -n <namespace> -o jsonpath='{.spec.containers[*].resources}'136```137138| Symptom | Cause | Fix |139| -------------------------------- | --------------------------------------- | --------------------------------------------------- |140| OOMKilled (exit code 137) | Container exceeded memory limit | Increase `limits.memory` or fix memory leak |141| CPU throttling (slow responses) | Container hitting CPU limit | Increase `limits.cpu` or remove CPU limits |142| Pending - insufficient resources | Requests exceed available node capacity | Lower requests, scale nodes, or use larger VM sizes |143144> ⚠️ **Warning:** Setting CPU limits can cause unnecessary throttling even when the node has spare capacity. Many teams set CPU requests but not limits. Memory limits should always be set.145