Loading source
Pulling the file list, source metadata, and syntax-aware rendering for this listing.
Source from repo
Debug and troubleshoot Azure Container Apps and Function Apps using logs, KQL, and health checks.
Files
Skill
Size
Entrypoint
Format
Open file
Syntax-highlighted preview of this file as included in the skill package.
troubleshooting/aks/aks-troubleshooting.md
1# AKS Troubleshooting Guide23Primary AKS troubleshooting guide for incidents routed from [../../SKILL.md](../../SKILL.md).45## When to Use This Guide67- lifecycle, access, node, `kube-system`, workload, ingress, DNS, or scaling issues8- `kubectl` cannot connect, nodes are `NotReady`, or pods are unhealthy910## Scenario Playbooks1112| Scenario | Reference |13| ------------------------------------------------------------- | ------------------------------------------------ |14| broad cluster investigation | [general-diagnostics.md](general-diagnostics.md) |15| workload, crash, image pull, readiness, or pending pod issues | [pod-failures.md](pod-failures.md) |16| node health, scaling, pressure, upgrade, or zone issues | [node-issues.md](node-issues.md) |17| service, ingress, DNS, or network policy issues | [networking.md](networking.md) |1819## Tool Selection For Diagnostics2021When gathering AKS diagnostic evidence, prefer `mcp_azure_mcp_aks`, then the smallest discovered AKS-MCP tool that fits the read, then supporting Azure tools such as `mcp_azure_mcp_applens`, `mcp_azure_mcp_monitor`, or `mcp_azure_mcp_resourcehealth`. Use raw `az aks` and `kubectl` only when the AKS-MCP surface cannot perform the needed check.2223See [references/aks-mcp.md](references/aks-mcp.md), [references/structured-input-modes.md](references/structured-input-modes.md), [references/command-flows.md](references/command-flows.md)2425## Required Inputs2627- subscription or active Azure context28- resource group and cluster name29- symptom summary30- first observed time or recent change window31- impacted namespace, workload, service, or ingress when known3233If cluster identity is missing, stop and ask for it.3435## Scope Buckets3637- Lifecycle: create, update, start, stop, upgrade, or provisioning failures38- API access: kubeconfig, auth, private endpoint, DNS, or reachability problems39- Nodes: missing nodes, `NotReady`, pressure, CNI, kubelet, certificate, or VMSS drift40- `kube-system`: CoreDNS, metrics-server, konnectivity, ingress, CNI, CSI, or add-on failures41- Workloads: `Pending`, `CrashLoopBackOff`, `OOMKilled`, PVC, quota, secret, readiness, or dependency issues42- Connectivity and DNS: pod -> service -> endpoints -> ingress/load balancer -> DNS -> network controls43- Scaling: node pool sizing, pending pods, autoscaler config, metrics, quota, or subnet constraints4445## Evidence Order46471. Azure-side state first: cluster state, resource health, recent operations, node pool state, detector or monitoring output.482. Kubernetes-side state second: cluster reachability, nodes, `kube-system`, events, affected namespace, pod detail, logs.493. Use detector, warning-event, or metrics modes when the incoming data already matches them.5051## Workflow52531. Get cluster context.542. Classify the problem by scope bucket.553. Prefer Azure-side evidence before Kubernetes-side evidence.564. Use the matching AKS-MCP path first, then the documented CLI fallback if MCP cannot perform that read.575. Return evidence, failure domain, confidence, next checks, remediation, and escalation.5859## Error Patterns6061- No cluster context: ask for subscription, resource group, and cluster name.62- MCP unavailable: fall back to safe `az aks` and `kubectl` reads.63- `kubectl` blocked: separate auth problems from network reachability.64- Logs or metrics missing: use events, node state, and resource descriptions.65- Detector noise: ignore `emergingIssues`, prefer critical findings, rank the most actionable signal first.6667## Safe Fallback Checks6869```bash70az aks show -g <resource-group> -n <cluster-name>71az aks nodepool list -g <resource-group> --cluster-name <cluster-name>72kubectl cluster-info73kubectl get nodes -o wide74kubectl get pods -n kube-system75kubectl get events -A --sort-by=.lastTimestamp76kubectl describe pod <pod-name> -n <namespace>77kubectl logs <pod-name> -n <namespace> --previous78```7980Keep these read-only unless the user explicitly asks for remediation.8182## Guardrails8384- default to read-only diagnostics85- do not restart, delete, cordon, drain, scale, upgrade, or reconfigure resources unless the user explicitly asks for remediation86- do not conclude root cause without quoting the evidence that supports it8788## Output Checklist8990Return scope and impact, evidence, failure domain, root cause, confidence, next checks, remediation, and escalation.91