Loading source
Pulling the file list, source metadata, and syntax-aware rendering for this listing.
Source from repo
Debug and troubleshoot Azure Container Apps and Function Apps using logs, KQL, and health checks.
Files
Skill
Size
Entrypoint
Format
Open file
Syntax-highlighted preview of this file as included in the skill package.
troubleshooting/aks/node-issues.md
1# Node & Cluster Troubleshooting23## Node NotReady45**Diagnostics:**67```bash8kubectl get nodes -o wide9kubectl describe node <node-name>10# Look for: Conditions, Taints, Events, resource usage, kubelet status11```1213**Condition decision tree:**1415| Condition | Value | Meaning | Fix Path |16| -------------------- | ------- | --------------------------------- | ------------------------------------------------------------- |17| `Ready` | `False` | kubelet stopped reporting | SSH to node; if unrecoverable, consider cordon/drain/delete\* |18| `MemoryPressure` | `True` | Node running out of memory | Evict pods; scale out pool; reduce pod density |19| `DiskPressure` | `True` | OS disk or ephemeral storage full | Check logs and images; clean up or increase disk |20| `PIDPressure` | `True` | Too many processes | App spawning excessive threads/processes |21| `NetworkUnavailable` | `True` | CNI plugin issue | Check CNI pods in kube-system; node network config |2223\*Only after explicit user request for remediation and confirmation of workload impact.2425**AKS-specific - SSH to a node:**2627> ⚠️ **Warning:** `kubectl debug node/...` creates a privileged debug pod on the node and is not a read-only diagnostic step. Default to read-only evidence gathering first. Only suggest or run this after the user explicitly asks for remediation or approves a privileged diagnostic action and understands the change-control impact.2829```bash30# Create a privileged debug pod on the node31kubectl debug node/<node-name> -it --image=mcr.microsoft.com/cbl-mariner/base/core:2.03233# Check kubelet status inside the node34chroot /host systemctl status kubelet35chroot /host journalctl -u kubelet -n 5036```3738**Optional remediation if kubelet can't recover (after confirmation):** cordon -> drain -> delete. AKS auto-replaces via node pool VMSS.3940> ⚠️ **Warning:** These commands are disruptive. By default, stay in read-only diagnostic mode. Only suggest or run them if the user has explicitly requested remediation and confirmed they understand the workload and PodDisruptionBudget impact.4142```bash43kubectl cordon <node-name>44kubectl drain <node-name> --ignore-daemonsets --delete-emptydir-data45kubectl delete node <node-name>46```4748---4950## Node Pool Not Scaling5152### Cluster Autoscaler Not Triggering5354**Diagnostics:**5556```bash57# Autoscaler logs58kubectl logs -n kube-system -l app=cluster-autoscaler --tail=1005960# Autoscaler status61kubectl get configmap cluster-autoscaler-status -n kube-system -o yaml6263# Verify autoscaler is enabled on the node pool64az aks nodepool show -g <rg> --cluster-name <cluster> -n <nodepool> \65--query "{autoscaleEnabled:enableAutoScaling, min:minCount, max:maxCount}"66```6768**Autoscaler won't scale up - common reasons:**6970- Node pool already at `maxCount`71- VM quota exhausted: `az vm list-usage -l <region> -o table | grep -i "DSv3\|quota"`72- Pod `nodeAffinity` is unsatisfiable on any new node template73- 10-minute cooldown period still active after last scale event7475**Autoscaler won't scale down - common reasons:**7677- Pods with `emptyDir` local storage (configure `--skip-nodes-with-local-storage=false` if safe)78- Standalone pods with no controller (not in a ReplicaSet)79- `cluster-autoscaler.kubernetes.io/safe-to-evict: "false"` annotation on a pod8081### Manual Scaling8283```bash84az aks nodepool scale -g <rg> --cluster-name <cluster> -n <nodepool> --node-count <n>85```8687---8889## Resource Pressure & Capacity Planning9091**Check actual vs allocatable:**9293```bash94kubectl describe node <node> | grep -A6 "Allocated resources:"95```9697See [AKS resource reservations](https://learn.microsoft.com/azure/aks/concepts-clusters-workloads#resource-reservations) for allocatable math.9899**Ephemeral storage pressure:**100101```bash102# Check what's consuming ephemeral storage on a node103kubectl debug node/<node> -it --image=mcr.microsoft.com/cbl-mariner/base/core:2.0104```105106Common culprit: high-volume container logs accumulating in `/var/log/containers`.107108---109110## Detailed Node And Cluster Guides111112- [Upgrade Operations](upgrade-operations.md) for node images, Kubernetes version upgrades, surge settings, and PDB-related drain blockers.113- [Spot And Zone Issues](spot-and-zone-issues.md) for spot evictions, tolerations, zone skew, and zonal storage or service behavior.114