Loading source
Pulling the file list, source metadata, and syntax-aware rendering for this listing.
Source from repo
Debug and troubleshoot Azure Container Apps and Function Apps using logs, KQL, and health checks.
Files
Skill
Size
Entrypoint
Format
Open file
Syntax-highlighted preview of this file as included in the skill package.
troubleshooting/aks/spot-and-zone-issues.md
1# Spot And Zone Issues23Use this guide when workload placement, evictions, or zonal behavior is causing node-pool instability.45## Spot Node Pool Evictions67AKS spot nodes use Azure Spot VMs - they can be evicted with 30 seconds notice when Azure needs capacity.89**Diagnose spot eviction:**1011```bash12# Spot nodes carry this taint automatically13kubectl describe node <node> | grep "Taint"14# kubernetes.azure.com/scalesetpriority=spot:NoSchedule1516# Check eviction events17kubectl get events -A --field-selector reason=SpotEviction18kubectl get events -A | grep -i "evict\|spot\|preempt"19```2021**Spot workload pattern:** pods must tolerate the spot taint. Prefer PDBs and avoid stateful PVC workloads on spot.2223```yaml24tolerations:25- key: "kubernetes.azure.com/scalesetpriority"26operator: Equal27value: spot28effect: NoSchedule29```3031Add this preferred node affinity when you want the workload to bias toward spot nodes:3233```yaml34affinity:35nodeAffinity:36preferredDuringSchedulingIgnoredDuringExecution:37- weight: 138preference:39matchExpressions:40- key: kubernetes.azure.com/scalesetpriority41operator: In42values: ["spot"]43```4445---4647## Multi-AZ Node Pool & Zone-Related Failures4849**Check zone distribution:**5051```bash52kubectl get nodes -L topology.kubernetes.io/zone53```5455**Zone-related failure patterns:**5657| Symptom | Cause | Fix |58| ------------------------------------------------ | ---------------------------------------------------- | ------------------------------------------------------------ |59| Pods stack on one zone after node failures | Scheduling imbalance after zone failure | `kubectl rollout restart deployment/<n>` to rebalance |60| PVC pending with `volume node affinity conflict` | Azure Disk is zonal; pod scheduled in different zone | Use ZRS storage class or ensure PVC and pod are in same zone |61| Service endpoints unreachable from one zone | Topology-aware routing misconfigured | Check `service.spec.trafficDistribution` or TopologyKeys |62| Upgrade causing zone imbalance | Surge nodes in one zone | Configure `maxSurge` in node pool upgrade settings |6364Use `Premium_ZRS` or `StandardSSD_ZRS` in custom StorageClasses to reduce zonal PVC conflicts. See [AKS storage best practices](https://learn.microsoft.com/azure/aks/operator-best-practices-storage).65