Loading source
Pulling the file list, source metadata, and syntax-aware rendering for this listing.
Source from repo
Diagnose Azure service issues, query logs, and troubleshoot failures using GitHub Copilot for Azure
Files
Skill
Size
Entrypoint
Format
Open file
Syntax-highlighted preview of this file as included in the skill package.
troubleshooting/aks/spot-and-zone-issues.md
1# Spot And Zone Issues23Use this guide when workload placement, evictions, or zonal behavior is causing node-pool instability.45## Spot Node Pool Evictions67AKS spot nodes use Azure Spot VMs - they can be evicted with 30 seconds notice when Azure needs capacity.89**Diagnose spot eviction:**1011```bash12# Spot nodes carry this taint automatically13kubectl describe node <node> | grep "Taint"14# kubernetes.azure.com/scalesetpriority=spot:NoSchedule1516# Check eviction events17kubectl get events -A --field-selector reason=SpotEviction18kubectl get events -A | grep -i "evict\|spot\|preempt"19```2021**Spot workload pattern:** pods must tolerate the spot taint. Prefer PDBs and avoid stateful PVC workloads on spot.2223```yaml24tolerations:25- key: "kubernetes.azure.com/scalesetpriority"26operator: Equal27value: spot28effect: NoSchedule29```3031Add this preferred node affinity when you want the workload to bias toward spot nodes:3233```yaml34affinity:35nodeAffinity:36preferredDuringSchedulingIgnoredDuringExecution:37- weight: 138preference:39matchExpressions:40- key: kubernetes.azure.com/scalesetpriority41operator: In42values: ["spot"]43```4445---4647## Multi-AZ Node Pool & Zone-Related Failures4849**Check zone distribution:**5051```bash52kubectl get nodes -L topology.kubernetes.io/zone53```5455**Zone-related failure patterns:**5657| Symptom | Cause | Fix |58| ------------------------------------------------ | ---------------------------------------------------- | ------------------------------------------------------------ |59| Pods stack on one zone after node failures | Scheduling imbalance after zone failure | `kubectl rollout restart deployment/<n>` to rebalance |60| PVC pending with `volume node affinity conflict` | Azure Disk is zonal; pod scheduled in different zone | Use ZRS storage class or ensure PVC and pod are in same zone |61| Service endpoints unreachable from one zone | Topology-aware routing misconfigured | Check `service.spec.trafficDistribution` or TopologyKeys |62| Upgrade causing zone imbalance | Surge nodes in one zone | Configure `maxSurge` in node pool upgrade settings |6364Use `Premium_ZRS` or `StandardSSD_ZRS` in custom StorageClasses to reduce zonal PVC conflicts. See [AKS storage best practices](https://learn.microsoft.com/azure/aks/operator-best-practices-storage).65