Loading source
Pulling the file list, source metadata, and syntax-aware rendering for this listing.
Source from repo
Diagnose Azure service issues, query logs, and troubleshoot failures using GitHub Copilot for Azure
Files
Skill
Size
Entrypoint
Format
Open file
Syntax-highlighted preview of this file as included in the skill package.
troubleshooting/aks/node-issues.md
1# Node & Cluster Troubleshooting23## Node NotReady45**Diagnostics:**67```bash8kubectl get nodes -o wide9kubectl describe node <node-name>10# Look for: Conditions, Taints, Events, resource usage, kubelet status11```1213**Condition decision tree:**1415| Condition | Value | Meaning | Fix Path |16| -------------------- | ------- | --------------------------------- | ------------------------------------------------------------- |17| `Ready` | `False` | kubelet stopped reporting | SSH to node; if unrecoverable, consider cordon/drain/delete\* |18| `MemoryPressure` | `True` | Node running out of memory | Evict pods; scale out pool; reduce pod density |19| `DiskPressure` | `True` | OS disk or ephemeral storage full | Check logs and images; clean up or increase disk |20| `PIDPressure` | `True` | Too many processes | App spawning excessive threads/processes; use IG `snapshot_process` |21| `NetworkUnavailable` | `True` | CNI plugin issue | Check CNI pods in kube-system; node network config |2223\*Only after explicit user request for remediation and confirmation of workload impact.2425**AKS-specific - SSH to a node:**2627> ⚠️ **Warning:** `kubectl debug node/...` creates a privileged debug pod on the node and is not a read-only diagnostic step. Default to read-only evidence gathering first. Only suggest or run this after the user explicitly asks for remediation or approves a privileged diagnostic action and understands the change-control impact.2829```bash30# Create a privileged debug pod on the node31kubectl debug node/<node-name> -it --image=mcr.microsoft.com/cbl-mariner/base/core:2.03233# Check kubelet status inside the node34chroot /host systemctl status kubelet35chroot /host journalctl -u kubelet -n 5036```3738**Optional remediation if kubelet can't recover (after confirmation):** cordon -> drain -> delete. AKS auto-replaces via node pool VMSS.3940> ⚠️ **Warning:** These commands are disruptive. By default, stay in read-only diagnostic mode. Only suggest or run them if the user has explicitly requested remediation and confirmed they understand the workload and PodDisruptionBudget impact.4142```bash43kubectl cordon <node-name>44kubectl drain <node-name> --ignore-daemonsets --delete-emptydir-data45kubectl delete node <node-name>46```4748---4950## Node Pool Not Scaling5152### Cluster Autoscaler Not Triggering5354**Diagnostics:**5556```bash57# Autoscaler logs58kubectl logs -n kube-system -l app=cluster-autoscaler --tail=1005960# Autoscaler status61kubectl get configmap cluster-autoscaler-status -n kube-system -o yaml6263# Verify autoscaler is enabled on the node pool64az aks nodepool show -g <rg> --cluster-name <cluster> -n <nodepool> \65--query "{autoscaleEnabled:enableAutoScaling, min:minCount, max:maxCount}"66```6768**Autoscaler won't scale up - common reasons:**6970- Node pool already at `maxCount`71- VM quota exhausted: `az vm list-usage -l <region> -o table | grep -i "DSv3\|quota"`72- Pod `nodeAffinity` is unsatisfiable on any new node template73- 10-minute cooldown period still active after last scale event7475**Autoscaler won't scale down - common reasons:**7677- Pods with `emptyDir` local storage (configure `--skip-nodes-with-local-storage=false` if safe)78- Standalone pods with no controller (not in a ReplicaSet)79- `cluster-autoscaler.kubernetes.io/safe-to-evict: "false"` annotation on a pod8081### Manual Scaling8283```bash84az aks nodepool scale -g <rg> --cluster-name <cluster> -n <nodepool> --node-count <n>85```8687---8889## Resource Pressure & Capacity Planning9091**Check actual vs allocatable:**9293```bash94kubectl describe node <node> | grep -A6 "Allocated resources:"95```9697See [AKS resource reservations](https://learn.microsoft.com/azure/aks/concepts-clusters-workloads#resource-reservations) for allocatable math.9899**Ephemeral storage pressure:**100101```bash102# Check what's consuming ephemeral storage on a node103kubectl debug node/<node> -it --image=mcr.microsoft.com/cbl-mariner/base/core:2.0104```105106Common culprit: high-volume container logs accumulating in `/var/log/containers`.107108**Deep diagnostics with Inspektor Gadget** (PID pressure or unknown process load):109110Use `snapshot_process` (timeout 5) to list all processes on the node. For node-wide scope, omit pod filters. See [references/inspektor-gadget.md](references/inspektor-gadget.md).111112---113114## Detailed Node And Cluster Guides115116- [Upgrade Operations](upgrade-operations.md) for node images, Kubernetes version upgrades, surge settings, and PDB-related drain blockers.117- [Spot And Zone Issues](spot-and-zone-issues.md) for spot evictions, tolerations, zone skew, and zonal storage or service behavior.118