Source from repo
Azure Diagnostics

Diagnose Azure service issues, query logs, and troubleshoot failures using GitHub Copilot for Azure
microsoftGitHub microsoftOfficialSource repo Original GitHub link Publisher page
Files
Skill
n/a
Size
95.1 KB
Entrypoint
SKILL.md
Format
git-repo
Open file
troubleshooting/aks/pod-failures.md

Syntax-highlighted preview of this file as included in the skill package.
Rendered Source
markdown145 linesFree
troubleshooting/aks/pod-failures.md
1# Pod Failures & Application Issues
2 
3## Common Pod Diagnostic Commands
4 
5```bash
6# List unhealthy pods across all namespaces
7kubectl get pods -A --field-selector=status.phase!=Running,status.phase!=Succeeded
8# All pods wide view
9kubectl get pods -A -o wide
10# Detailed pod status - events section is critical
11kubectl describe pod <pod-name> -n <namespace>
12# Pod logs (current and previous crash)
13kubectl logs <pod-name> -n <namespace>
14kubectl logs <pod-name> -n <namespace> --previous
15```
16 
17---
18 
19## CrashLoopBackOff
20 
21Pod starts, crashes, restarts with exponential backoff (10s, 20s, 40s... up to 5m).
22 
23**Diagnostics:**
24 
25```bash
26kubectl describe pod <pod-name> -n <namespace>
27# Check: Exit Code, Reason, Last State, Events
28 
29kubectl logs <pod-name> -n <namespace> --previous
30# Shows stdout/stderr from the last crashed container
31```
32 
33**Decision tree:**
34 
35| Exit Code | Meaning                                               | Fix Path                                                      |
36| --------- | ----------------------------------------------------- | ------------------------------------------------------------- |
37| `0`       | App exited successfully (unexpected for long-running) | Check if entrypoint/command is correct; app may be a one-shot |
38| `1`       | Application error                                     | Read logs - unhandled exception, missing config, bad startup  |
39| `137`     | OOMKilled (SIGKILL)                                   | Increase `resources.limits.memory`; check for memory leaks    |
40| `139`     | Segfault (SIGSEGV)                                    | Binary compatibility issue or native code bug                 |
41| `143`     | SIGTERM - graceful shutdown                           | Pod was terminated; check if liveness probe killed it         |
42 
43**OOMKilled specifically:**
44 
45```bash
46kubectl describe pod <pod-name> -n <namespace> | grep -A2 "Last State"
47# Reason: OOMKilled -> container exceeded memory limit
48```
49 
50Fix: increase `resources.limits.memory` or optimize application memory usage. Check `kubectl top pod <pod-name> -n <namespace>` for actual usage.
51 
52---
53 
54## ImagePullBackOff
55 
56Pod can't pull the container image.
57 
58**Diagnostics:**
59 
60```bash
61kubectl describe pod <pod-name> -n <namespace>
62# Events section shows the exact pull error
63```
64 
65| Error Message                           | Cause                        | Fix                                                            |
66| --------------------------------------- | ---------------------------- | -------------------------------------------------------------- |
67| `ErrImagePull` / `ImagePullBackOff`     | Image name or tag is wrong   | Verify image name and tag exist in the registry                |
68| `unauthorized: authentication required` | Missing or wrong pull secret | Create/update `imagePullSecrets` on the pod or service account |
69| `manifest unknown`                      | Tag doesn't exist            | Check available tags in the registry                           |
70| `context deadline exceeded`             | Registry unreachable         | Check network/firewall; for ACR, verify AKS -> ACR integration |
71 
72**ACR integration check:**
73 
74```bash
75# Verify AKS is attached to ACR
76az aks check-acr -g <rg> -n <cluster> --acr <acr-name>.azurecr.io
77```
78 
79---
80 
81## Pending Pods
82 
83Pod stays in `Pending` - scheduler can't place it.
84 
85**Diagnostics:**
86 
87```bash
88kubectl describe pod <pod-name> -n <namespace>
89# Events section shows why scheduling failed
90```
91 
92| Event Message                                                          | Cause                               | Fix                                                             |
93| ---------------------------------------------------------------------- | ----------------------------------- | --------------------------------------------------------------- |
94| `Insufficient cpu` / `Insufficient memory`                             | No node has enough resources        | Scale node pool; reduce resource requests; check for overcommit |
95| `node(s) had taint ... that the pod didn't tolerate`                   | Taint/toleration mismatch           | Add matching toleration or use a different node pool            |
96| `node(s) didn't match Pod's node affinity/selector`                    | Affinity rule unsatisfiable         | Check `nodeSelector` or `nodeAffinity` rules                    |
97| `persistentvolumeclaim ... not found` / `unbound`                      | PVC not ready                       | Check PVC status; verify storage class exists                   |
98| `0/N nodes are available: N node(s) had volume node affinity conflict` | Zonal disk vs pod in different zone | Use ZRS storage class or ensure same zone                       |
99 
100---
101 
102## Readiness & Liveness Probe Failures
103 
104**Readiness probe failure** -> pod removed from Service endpoints (no traffic). **Liveness probe failure** -> pod killed and restarted.
105 
106**Diagnostics:**
107 
108```bash
109kubectl describe pod <pod-name> -n <namespace>
110# Look for: "Readiness probe failed" or "Liveness probe failed" in Events
111 
112# Check the pod's READY column - must show n/n
113kubectl get pod <pod-name> -n <namespace>
114```
115 
116| Symptom                              | Cause                   | Fix                                                        |
117| ------------------------------------ | ----------------------- | ---------------------------------------------------------- |
118| READY shows `0/1` but pod is Running | Readiness probe failing | Check probe path, port, and app health endpoint            |
119| Pod restarts repeatedly              | Liveness probe failing  | Increase `initialDelaySeconds`; check if app starts slowly |
120| Probe timeout errors                 | App responds too slowly | Increase `timeoutSeconds`; check app performance           |
121 
122> 💡 **Tip:** Set `initialDelaySeconds` on liveness probes to be longer than your app's startup time. A common mistake is killing pods before they finish initializing.
123 
124---
125 
126## Resource Constraints (CPU/Memory)
127 
128**Check actual usage vs limits:**
129 
130```bash
131kubectl top pod <pod-name> -n <namespace>
132kubectl top pod -n <namespace> --sort-by=memory
133 
134# Compare with requests/limits
135kubectl get pod <pod-name> -n <namespace> -o jsonpath='{.spec.containers[*].resources}'
136```
137 
138| Symptom                          | Cause                                   | Fix                                                 |
139| -------------------------------- | --------------------------------------- | --------------------------------------------------- |
140| OOMKilled (exit code 137)        | Container exceeded memory limit         | Increase `limits.memory` or fix memory leak         |
141| CPU throttling (slow responses)  | Container hitting CPU limit             | Increase `limits.cpu` or remove CPU limits          |
142| Pending - insufficient resources | Requests exceed available node capacity | Lower requests, scale nodes, or use larger VM sizes |
143 
144> ⚠️ **Warning:** Setting CPU limits can cause unnecessary throttling even when the node has spare capacity. Many teams set CPU requests but not limits. Memory limits should always be set.
145
Preparing the source view

Azure Diagnostics

troubleshooting/aks/pod-failures.md