Loading source
Pulling the file list, source metadata, and syntax-aware rendering for this listing.
Source from repo
Debug and troubleshoot Azure Container Apps and Function Apps using logs, KQL, and health checks.
Files
Skill
Size
Entrypoint
Format
Open file
Syntax-highlighted preview of this file as included in the skill package.
troubleshooting/aks/aks-troubleshooting.md
1# AKS Troubleshooting Guide23Primary AKS troubleshooting guide for incidents routed from [../../SKILL.md](../../SKILL.md).45## When to Use This Guide67- lifecycle, access, node, `kube-system`, workload, ingress, DNS, or scaling issues8- `kubectl` cannot connect, nodes are `NotReady`, or pods are unhealthy910## Scenario Playbooks1112| Scenario | Reference |13| ------------------------------------------------------------- | ------------------------------------------------ |14| broad cluster investigation | [general-diagnostics.md](general-diagnostics.md) |15| workload, crash, image pull, readiness, or pending pod issues | [pod-failures.md](pod-failures.md) |16| node health, scaling, pressure, upgrade, or zone issues | [node-issues.md](node-issues.md) |17| service, ingress, DNS, or network policy issues | [networking.md](networking.md) |1819## Tool Selection For Diagnostics2021When gathering AKS diagnostic evidence, prefer `mcp_azure_mcp_aks`, then the smallest discovered AKS-MCP tool that fits the read, then supporting Azure tools such as `mcp_azure_mcp_applens`, `mcp_azure_mcp_monitor`, or `mcp_azure_mcp_resourcehealth`. Use raw `az aks` and `kubectl` only when the AKS-MCP surface cannot perform the needed check.2223When standard diagnostics do not reveal root cause, use **Inspektor Gadget** for real-time, low-level node and pod observability (DNS traces, TCP traces, process snapshots, file access traces). See [references/inspektor-gadget.md](references/inspektor-gadget.md) for the gadget catalog, command patterns, and symptom-to-gadget mapping.2425See [references/aks-mcp.md](references/aks-mcp.md), [references/structured-input-modes.md](references/structured-input-modes.md), [references/command-flows.md](references/command-flows.md)2627## Required Inputs2829- subscription or active Azure context30- resource group and cluster name31- symptom summary32- first observed time or recent change window33- impacted namespace, workload, service, or ingress when known3435If cluster identity is missing, stop and ask for it.3637## Scope Buckets3839- Lifecycle: create, update, start, stop, upgrade, or provisioning failures40- API access: kubeconfig, auth, private endpoint, DNS, or reachability problems41- Nodes: missing nodes, `NotReady`, pressure, CNI, kubelet, certificate, or VMSS drift42- `kube-system`: CoreDNS, metrics-server, konnectivity, ingress, CNI, CSI, or add-on failures43- Workloads: `Pending`, `CrashLoopBackOff`, `OOMKilled`, PVC, quota, secret, readiness, or dependency issues44- Connectivity and DNS: pod -> service -> endpoints -> ingress/load balancer -> DNS -> network controls45- Scaling: node pool sizing, pending pods, autoscaler config, metrics, quota, or subnet constraints4647## Evidence Order48491. Azure-side state first: cluster state, resource health, recent operations, node pool state, detector or monitoring output.502. Kubernetes-side state second: cluster reachability, nodes, `kube-system`, events, affected namespace, pod detail, logs.513. Use detector, warning-event, or metrics modes when the incoming data already matches them.524. Deep diagnostics; when steps 1–3 do not reveal root cause, use [inspektor-gadget.md](references/inspektor-gadget.md) for real-time tracing and snapshots on the affected node.5354## Workflow55561. Get cluster context.572. Classify the problem by scope bucket.583. Prefer Azure-side evidence before Kubernetes-side evidence.594. Use the matching AKS-MCP path first, then the documented CLI fallback if MCP cannot perform that read.605. Return evidence, failure domain, confidence, next checks, remediation, and escalation.6162## Error Patterns6364- No cluster context: ask for subscription, resource group, and cluster name.65- MCP unavailable: fall back to safe `az aks` and `kubectl` reads.66- `kubectl` blocked: separate auth problems from network reachability.67- Logs or metrics missing: use events, node state, and resource descriptions.68- Detector noise: ignore `emergingIssues`, prefer critical findings, rank the most actionable signal first.6970## Safe Fallback Checks7172```bash73az aks show -g <resource-group> -n <cluster-name>74az aks nodepool list -g <resource-group> --cluster-name <cluster-name>75kubectl cluster-info76kubectl get nodes -o wide77kubectl get pods -n kube-system78kubectl get events -A --sort-by=.lastTimestamp79kubectl describe pod <pod-name> -n <namespace>80kubectl logs <pod-name> -n <namespace> --previous81```8283Keep these read-only unless the user explicitly asks for remediation.8485## Guardrails8687- default to read-only diagnostics88- do not restart, delete, cordon, drain, scale, upgrade, or reconfigure resources unless the user explicitly asks for remediation89- do not conclude root cause without quoting the evidence that supports it9091## Output Checklist9293Return scope and impact, evidence, failure domain, root cause, confidence, next checks, remediation, and escalation.94