Source from repo
Azure Diagnostics

Diagnose Azure service issues, query logs, and troubleshoot failures using GitHub Copilot for Azure
microsoftGitHub microsoftOfficialSource repo Original GitHub link Publisher page
Files
Skill
n/a
Size
105.0 KB
Entrypoint
SKILL.md
Format
git-repo
Open file
troubleshooting/aks/node-issues.md

Syntax-highlighted preview of this file as included in the skill package.
Rendered Source
markdown118 linesFree
troubleshooting/aks/node-issues.md
1# Node & Cluster Troubleshooting
2 
3## Node NotReady
4 
5**Diagnostics:**
6 
7```bash
8kubectl get nodes -o wide
9kubectl describe node <node-name>
10# Look for: Conditions, Taints, Events, resource usage, kubelet status
11```
12 
13**Condition decision tree:**
14 
15| Condition            | Value   | Meaning                           | Fix Path                                                      |
16| -------------------- | ------- | --------------------------------- | ------------------------------------------------------------- |
17| `Ready`              | `False` | kubelet stopped reporting         | SSH to node; if unrecoverable, consider cordon/drain/delete\* |
18| `MemoryPressure`     | `True`  | Node running out of memory        | Evict pods; scale out pool; reduce pod density                |
19| `DiskPressure`       | `True`  | OS disk or ephemeral storage full | Check logs and images; clean up or increase disk              |
20| `PIDPressure`        | `True`  | Too many processes                | App spawning excessive threads/processes; use IG `snapshot_process` |
21| `NetworkUnavailable` | `True`  | CNI plugin issue                  | Check CNI pods in kube-system; node network config            |
22 
23\*Only after explicit user request for remediation and confirmation of workload impact.
24 
25**AKS-specific - SSH to a node:**
26 
27> ⚠️ **Warning:** `kubectl debug node/...` creates a privileged debug pod on the node and is not a read-only diagnostic step. Default to read-only evidence gathering first. Only suggest or run this after the user explicitly asks for remediation or approves a privileged diagnostic action and understands the change-control impact.
28 
29```bash
30# Create a privileged debug pod on the node
31kubectl debug node/<node-name> -it --image=mcr.microsoft.com/cbl-mariner/base/core:2.0
32 
33# Check kubelet status inside the node
34chroot /host systemctl status kubelet
35chroot /host journalctl -u kubelet -n 50
36```
37 
38**Optional remediation if kubelet can't recover (after confirmation):** cordon -> drain -> delete. AKS auto-replaces via node pool VMSS.
39 
40> ⚠️ **Warning:** These commands are disruptive. By default, stay in read-only diagnostic mode. Only suggest or run them if the user has explicitly requested remediation and confirmed they understand the workload and PodDisruptionBudget impact.
41 
42```bash
43kubectl cordon <node-name>
44kubectl drain <node-name> --ignore-daemonsets --delete-emptydir-data
45kubectl delete node <node-name>
46```
47 
48---
49 
50## Node Pool Not Scaling
51 
52### Cluster Autoscaler Not Triggering
53 
54**Diagnostics:**
55 
56```bash
57# Autoscaler logs
58kubectl logs -n kube-system -l app=cluster-autoscaler --tail=100
59 
60# Autoscaler status
61kubectl get configmap cluster-autoscaler-status -n kube-system -o yaml
62 
63# Verify autoscaler is enabled on the node pool
64az aks nodepool show -g <rg> --cluster-name <cluster> -n <nodepool> \
65  --query "{autoscaleEnabled:enableAutoScaling, min:minCount, max:maxCount}"
66```
67 
68**Autoscaler won't scale up - common reasons:**
69 
70- Node pool already at `maxCount`
71- VM quota exhausted: `az vm list-usage -l <region> -o table | grep -i "DSv3\|quota"`
72- Pod `nodeAffinity` is unsatisfiable on any new node template
73- 10-minute cooldown period still active after last scale event
74 
75**Autoscaler won't scale down - common reasons:**
76 
77- Pods with `emptyDir` local storage (configure `--skip-nodes-with-local-storage=false` if safe)
78- Standalone pods with no controller (not in a ReplicaSet)
79- `cluster-autoscaler.kubernetes.io/safe-to-evict: "false"` annotation on a pod
80 
81### Manual Scaling
82 
83```bash
84az aks nodepool scale -g <rg> --cluster-name <cluster> -n <nodepool> --node-count <n>
85```
86 
87---
88 
89## Resource Pressure & Capacity Planning
90 
91**Check actual vs allocatable:**
92 
93```bash
94kubectl describe node <node> | grep -A6 "Allocated resources:"
95```
96 
97See [AKS resource reservations](https://learn.microsoft.com/azure/aks/concepts-clusters-workloads#resource-reservations) for allocatable math.
98 
99**Ephemeral storage pressure:**
100 
101```bash
102# Check what's consuming ephemeral storage on a node
103kubectl debug node/<node> -it --image=mcr.microsoft.com/cbl-mariner/base/core:2.0
104```
105 
106Common culprit: high-volume container logs accumulating in `/var/log/containers`.
107 
108**Deep diagnostics with Inspektor Gadget** (PID pressure or unknown process load):
109 
110Use `snapshot_process` (timeout 5) to list all processes on the node. For node-wide scope, omit pod filters. See [references/inspektor-gadget.md](references/inspektor-gadget.md).
111 
112---
113 
114## Detailed Node And Cluster Guides
115 
116- [Upgrade Operations](upgrade-operations.md) for node images, Kubernetes version upgrades, surge settings, and PDB-related drain blockers.
117- [Spot And Zone Issues](spot-and-zone-issues.md) for spot evictions, tolerations, zone skew, and zonal storage or service behavior.
118
Preparing the source view

Azure Diagnostics

troubleshooting/aks/node-issues.md