Source from repo
Azure Diagnostics

Diagnose Azure service issues, query logs, and troubleshoot failures using GitHub Copilot for Azure
microsoftGitHub microsoftOfficialSource repo Original GitHub link Publisher page
Files
Skill
n/a
Size
105.0 KB
Entrypoint
SKILL.md
Format
git-repo
Open file
troubleshooting/aks/aks-troubleshooting.md

Syntax-highlighted preview of this file as included in the skill package.
Rendered Source
markdown94 linesFree
troubleshooting/aks/aks-troubleshooting.md
1# AKS Troubleshooting Guide
2 
3Primary AKS troubleshooting guide for incidents routed from [../../SKILL.md](../../SKILL.md).
4 
5## When to Use This Guide
6 
7- lifecycle, access, node, `kube-system`, workload, ingress, DNS, or scaling issues
8- `kubectl` cannot connect, nodes are `NotReady`, or pods are unhealthy
9 
10## Scenario Playbooks
11 
12| Scenario                                                      | Reference                                        |
13| ------------------------------------------------------------- | ------------------------------------------------ |
14| broad cluster investigation                                   | [general-diagnostics.md](general-diagnostics.md) |
15| workload, crash, image pull, readiness, or pending pod issues | [pod-failures.md](pod-failures.md)               |
16| node health, scaling, pressure, upgrade, or zone issues       | [node-issues.md](node-issues.md)                 |
17| service, ingress, DNS, or network policy issues               | [networking.md](networking.md)                   |
18 
19## Tool Selection For Diagnostics
20 
21When gathering AKS diagnostic evidence, prefer `mcp_azure_mcp_aks`, then the smallest discovered AKS-MCP tool that fits the read, then supporting Azure tools such as `mcp_azure_mcp_applens`, `mcp_azure_mcp_monitor`, or `mcp_azure_mcp_resourcehealth`. Use raw `az aks` and `kubectl` only when the AKS-MCP surface cannot perform the needed check.
22 
23When standard diagnostics do not reveal root cause, use **Inspektor Gadget** for real-time, low-level node and pod observability (DNS traces, TCP traces, process snapshots, file access traces). See [references/inspektor-gadget.md](references/inspektor-gadget.md) for the gadget catalog, command patterns, and symptom-to-gadget mapping.
24 
25See [references/aks-mcp.md](references/aks-mcp.md), [references/structured-input-modes.md](references/structured-input-modes.md), [references/command-flows.md](references/command-flows.md)
26 
27## Required Inputs
28 
29- subscription or active Azure context
30- resource group and cluster name
31- symptom summary
32- first observed time or recent change window
33- impacted namespace, workload, service, or ingress when known
34 
35If cluster identity is missing, stop and ask for it.
36 
37## Scope Buckets
38 
39- Lifecycle: create, update, start, stop, upgrade, or provisioning failures
40- API access: kubeconfig, auth, private endpoint, DNS, or reachability problems
41- Nodes: missing nodes, `NotReady`, pressure, CNI, kubelet, certificate, or VMSS drift
42- `kube-system`: CoreDNS, metrics-server, konnectivity, ingress, CNI, CSI, or add-on failures
43- Workloads: `Pending`, `CrashLoopBackOff`, `OOMKilled`, PVC, quota, secret, readiness, or dependency issues
44- Connectivity and DNS: pod -> service -> endpoints -> ingress/load balancer -> DNS -> network controls
45- Scaling: node pool sizing, pending pods, autoscaler config, metrics, quota, or subnet constraints
46 
47## Evidence Order
48 
491. Azure-side state first: cluster state, resource health, recent operations, node pool state, detector or monitoring output.
502. Kubernetes-side state second: cluster reachability, nodes, `kube-system`, events, affected namespace, pod detail, logs.
513. Use detector, warning-event, or metrics modes when the incoming data already matches them.
524. Deep diagnostics; when steps 1–3 do not reveal root cause, use [inspektor-gadget.md](references/inspektor-gadget.md) for real-time tracing and snapshots on the affected node.
53 
54## Workflow
55 
561. Get cluster context.
572. Classify the problem by scope bucket.
583. Prefer Azure-side evidence before Kubernetes-side evidence.
594. Use the matching AKS-MCP path first, then the documented CLI fallback if MCP cannot perform that read.
605. Return evidence, failure domain, confidence, next checks, remediation, and escalation.
61 
62## Error Patterns
63 
64- No cluster context: ask for subscription, resource group, and cluster name.
65- MCP unavailable: fall back to safe `az aks` and `kubectl` reads.
66- `kubectl` blocked: separate auth problems from network reachability.
67- Logs or metrics missing: use events, node state, and resource descriptions.
68- Detector noise: ignore `emergingIssues`, prefer critical findings, rank the most actionable signal first.
69 
70## Safe Fallback Checks
71 
72```bash
73az aks show -g <resource-group> -n <cluster-name>
74az aks nodepool list -g <resource-group> --cluster-name <cluster-name>
75kubectl cluster-info
76kubectl get nodes -o wide
77kubectl get pods -n kube-system
78kubectl get events -A --sort-by=.lastTimestamp
79kubectl describe pod <pod-name> -n <namespace>
80kubectl logs <pod-name> -n <namespace> --previous
81```
82 
83Keep these read-only unless the user explicitly asks for remediation.
84 
85## Guardrails
86 
87- default to read-only diagnostics
88- do not restart, delete, cordon, drain, scale, upgrade, or reconfigure resources unless the user explicitly asks for remediation
89- do not conclude root cause without quoting the evidence that supports it
90 
91## Output Checklist
92 
93Return scope and impact, evidence, failure domain, root cause, confidence, next checks, remediation, and escalation.
94
Preparing the source view

Azure Diagnostics

troubleshooting/aks/aks-troubleshooting.md