Loading source
Pulling the file list, source metadata, and syntax-aware rendering for this listing.
Source from repo
Diagnose Azure service issues, query logs, and troubleshoot failures using GitHub Copilot for Azure
Files
Skill
Size
Entrypoint
Format
Open file
Syntax-highlighted preview of this file as included in the skill package.
troubleshooting/aks/aks-troubleshooting.md
1# AKS Troubleshooting Guide23Primary AKS troubleshooting guide for incidents routed from [../../SKILL.md](../../SKILL.md).45## When to Use This Guide67- lifecycle, access, node, `kube-system`, workload, ingress, DNS, or scaling issues8- `kubectl` cannot connect, nodes are `NotReady`, or pods are unhealthy910## Scenario Playbooks1112| Scenario | Reference |13| ------------------------------------------------------------- | ------------------------------------------------ |14| broad cluster investigation | [general-diagnostics.md](general-diagnostics.md) |15| workload, crash, image pull, readiness, or pending pod issues | [pod-failures.md](pod-failures.md) |16| node health, scaling, pressure, upgrade, or zone issues | [node-issues.md](node-issues.md) |17| service, ingress, DNS, or network policy issues | [networking.md](networking.md) |1819## Tool Selection For Diagnostics2021When gathering AKS diagnostic evidence, prefer `mcp_azure_mcp_aks`, then the smallest discovered AKS-MCP tool that fits the read, then supporting Azure tools such as `mcp_azure_mcp_applens`, `mcp_azure_mcp_monitor`, or `mcp_azure_mcp_resourcehealth`. Use raw `az aks` and `kubectl` only when the AKS-MCP surface cannot perform the needed check.2223When standard diagnostics do not reveal root cause, use **Inspektor Gadget** for real-time, low-level node and pod observability (DNS traces, TCP traces, process snapshots, file access traces). See [references/inspektor-gadget.md](references/inspektor-gadget.md) for the gadget catalog, command patterns, and symptom-to-gadget mapping.2425See [references/aks-mcp.md](references/aks-mcp.md), [references/structured-input-modes.md](references/structured-input-modes.md), [references/command-flows.md](references/command-flows.md)2627## Required Inputs2829- subscription or active Azure context30- resource group and cluster name31- symptom summary32- first observed time or recent change window33- impacted namespace, workload, service, or ingress when known3435If cluster identity is missing, stop and ask for it.3637## Scope Buckets3839- Lifecycle: create, update, start, stop, upgrade, or provisioning failures40- API access: kubeconfig, auth, private endpoint, DNS, or reachability problems41- Nodes: missing nodes, `NotReady`, pressure, CNI, kubelet, certificate, or VMSS drift42- `kube-system`: CoreDNS, metrics-server, konnectivity, ingress, CNI, CSI, or add-on failures43- Workloads: `Pending`, `CrashLoopBackOff`, `OOMKilled`, PVC, quota, secret, readiness, or dependency issues44- Connectivity and DNS: pod -> service -> endpoints -> ingress/load balancer -> DNS -> network controls45- Scaling: node pool sizing, pending pods, autoscaler config, metrics, quota, or subnet constraints4647## Evidence Order48491. Azure-side state first: cluster state, resource health, recent operations, node pool state, detector or monitoring output.502. Kubernetes-side state second: cluster reachability, nodes, `kube-system`, events, affected namespace, pod detail, logs.513. Use detector, warning-event, or metrics modes when the incoming data already matches them.524. Deep diagnostics; when steps 1–3 do not reveal root cause, use [inspektor-gadget.md](references/inspektor-gadget.md) for real-time tracing and snapshots on the affected node.5354## Workflow55561. Get cluster context.572. Classify the problem by scope bucket.583. Prefer Azure-side evidence before Kubernetes-side evidence.594. Use the matching AKS-MCP path first, then the documented CLI fallback if MCP cannot perform that read.605. Return evidence, failure domain, confidence, next checks, remediation, and escalation.6162## Error Patterns6364- No cluster context: ask for subscription, resource group, and cluster name.65- MCP unavailable: fall back to safe `az aks` and `kubectl` reads.66- `kubectl` blocked: separate auth problems from network reachability.67- Logs or metrics missing: use events, node state, and resource descriptions.68- Detector noise: ignore `emergingIssues`, prefer critical findings, rank the most actionable signal first.6970## Safe Fallback Checks7172```bash73az aks show -g <resource-group> -n <cluster-name>74az aks nodepool list -g <resource-group> --cluster-name <cluster-name>75kubectl cluster-info76kubectl get nodes -o wide77kubectl get pods -n kube-system78kubectl get events -A --sort-by=.lastTimestamp79kubectl describe pod <pod-name> -n <namespace>80kubectl logs <pod-name> -n <namespace> --previous81```8283Keep these read-only unless the user explicitly asks for remediation.8485## Guardrails8687- default to read-only diagnostics88- do not restart, delete, cordon, drain, scale, upgrade, or reconfigure resources unless the user explicitly asks for remediation89- do not conclude root cause without quoting the evidence that supports it9091## Output Checklist9293Return scope and impact, evidence, failure domain, root cause, confidence, next checks, remediation, and escalation.94