AKS Troubleshooting Guide

Primary AKS troubleshooting guide for incidents routed from ../../SKILL.md.

When to Use This Guide

lifecycle, access, node, kube-system, workload, ingress, DNS, or scaling issues
kubectl cannot connect, nodes are NotReady, or pods are unhealthy

Scenario Playbooks

Scenario	Reference
broad cluster investigation	general-diagnostics.md
workload, crash, image pull, readiness, or pending pod issues	pod-failures.md
node health, scaling, pressure, upgrade, or zone issues	node-issues.md
service, ingress, DNS, or network policy issues	networking.md

Tool Selection For Diagnostics

When gathering AKS diagnostic evidence, prefer mcp_azure_mcp_aks, then the smallest discovered AKS-MCP tool that fits the read, then supporting Azure tools such as mcp_azure_mcp_applens, mcp_azure_mcp_monitor, or mcp_azure_mcp_resourcehealth. Use raw az aks and kubectl only when the AKS-MCP surface cannot perform the needed check.

When standard diagnostics do not reveal root cause, use Inspektor Gadget for real-time, low-level node and pod observability (DNS traces, TCP traces, process snapshots, file access traces). See references/inspektor-gadget.md for the gadget catalog, command patterns, and symptom-to-gadget mapping.

See references/aks-mcp.md, references/structured-input-modes.md, references/command-flows.md

Required Inputs

subscription or active Azure context
resource group and cluster name
symptom summary
first observed time or recent change window
impacted namespace, workload, service, or ingress when known

If cluster identity is missing, stop and ask for it.

Scope Buckets

Lifecycle: create, update, start, stop, upgrade, or provisioning failures
API access: kubeconfig, auth, private endpoint, DNS, or reachability problems
Nodes: missing nodes, NotReady, pressure, CNI, kubelet, certificate, or VMSS drift
kube-system: CoreDNS, metrics-server, konnectivity, ingress, CNI, CSI, or add-on failures
Workloads: Pending, CrashLoopBackOff, OOMKilled, PVC, quota, secret, readiness, or dependency issues
Connectivity and DNS: pod -> service -> endpoints -> ingress/load balancer -> DNS -> network controls
Scaling: node pool sizing, pending pods, autoscaler config, metrics, quota, or subnet constraints

Evidence Order

Azure-side state first: cluster state, resource health, recent operations, node pool state, detector or monitoring output.
Kubernetes-side state second: cluster reachability, nodes, kube-system, events, affected namespace, pod detail, logs.
Use detector, warning-event, or metrics modes when the incoming data already matches them.
Deep diagnostics; when steps 1–3 do not reveal root cause, use inspektor-gadget.md for real-time tracing and snapshots on the affected node.

Workflow

Get cluster context.
Classify the problem by scope bucket.
Prefer Azure-side evidence before Kubernetes-side evidence.
Use the matching AKS-MCP path first, then the documented CLI fallback if MCP cannot perform that read.
Return evidence, failure domain, confidence, next checks, remediation, and escalation.

Error Patterns

No cluster context: ask for subscription, resource group, and cluster name.
MCP unavailable: fall back to safe az aks and kubectl reads.
kubectl blocked: separate auth problems from network reachability.
Logs or metrics missing: use events, node state, and resource descriptions.
Detector noise: ignore emergingIssues, prefer critical findings, rank the most actionable signal first.

Safe Fallback Checks

az aks show -g <resource-group> -n <cluster-name>
az aks nodepool list -g <resource-group> --cluster-name <cluster-name>
kubectl cluster-info
kubectl get nodes -o wide
kubectl get pods -n kube-system
kubectl get events -A --sort-by=.lastTimestamp
kubectl describe pod <pod-name> -n <namespace>
kubectl logs <pod-name> -n <namespace> --previous

Keep these read-only unless the user explicitly asks for remediation.

Guardrails

default to read-only diagnostics
do not restart, delete, cordon, drain, scale, upgrade, or reconfigure resources unless the user explicitly asks for remediation
do not conclude root cause without quoting the evidence that supports it

Output Checklist

Return scope and impact, evidence, failure domain, root cause, confidence, next checks, remediation, and escalation.

AKS Troubleshooting Guide

Primary AKS troubleshooting guide for incidents routed from ../../SKILL.md.

When to Use This Guide

lifecycle, access, node, kube-system, workload, ingress, DNS, or scaling issues
kubectl cannot connect, nodes are NotReady, or pods are unhealthy

Scenario Playbooks

Scenario	Reference
broad cluster investigation	general-diagnostics.md
workload, crash, image pull, readiness, or pending pod issues	pod-failures.md
node health, scaling, pressure, upgrade, or zone issues	node-issues.md
service, ingress, DNS, or network policy issues	networking.md

Tool Selection For Diagnostics

See references/aks-mcp.md, references/structured-input-modes.md, references/command-flows.md

Required Inputs

subscription or active Azure context
resource group and cluster name
symptom summary
first observed time or recent change window
impacted namespace, workload, service, or ingress when known

If cluster identity is missing, stop and ask for it.

Scope Buckets

Lifecycle: create, update, start, stop, upgrade, or provisioning failures
API access: kubeconfig, auth, private endpoint, DNS, or reachability problems
Nodes: missing nodes, NotReady, pressure, CNI, kubelet, certificate, or VMSS drift
kube-system: CoreDNS, metrics-server, konnectivity, ingress, CNI, CSI, or add-on failures
Workloads: Pending, CrashLoopBackOff, OOMKilled, PVC, quota, secret, readiness, or dependency issues
Connectivity and DNS: pod -> service -> endpoints -> ingress/load balancer -> DNS -> network controls
Scaling: node pool sizing, pending pods, autoscaler config, metrics, quota, or subnet constraints

Evidence Order

Azure-side state first: cluster state, resource health, recent operations, node pool state, detector or monitoring output.
Kubernetes-side state second: cluster reachability, nodes, kube-system, events, affected namespace, pod detail, logs.
Use detector, warning-event, or metrics modes when the incoming data already matches them.
Deep diagnostics; when steps 1–3 do not reveal root cause, use inspektor-gadget.md for real-time tracing and snapshots on the affected node.

Workflow

Get cluster context.
Classify the problem by scope bucket.
Prefer Azure-side evidence before Kubernetes-side evidence.
Use the matching AKS-MCP path first, then the documented CLI fallback if MCP cannot perform that read.
Return evidence, failure domain, confidence, next checks, remediation, and escalation.

Error Patterns

No cluster context: ask for subscription, resource group, and cluster name.
MCP unavailable: fall back to safe az aks and kubectl reads.
kubectl blocked: separate auth problems from network reachability.
Logs or metrics missing: use events, node state, and resource descriptions.
Detector noise: ignore emergingIssues, prefer critical findings, rank the most actionable signal first.

Safe Fallback Checks

az aks show -g <resource-group> -n <cluster-name>
az aks nodepool list -g <resource-group> --cluster-name <cluster-name>
kubectl cluster-info
kubectl get nodes -o wide
kubectl get pods -n kube-system
kubectl get events -A --sort-by=.lastTimestamp
kubectl describe pod <pod-name> -n <namespace>
kubectl logs <pod-name> -n <namespace> --previous

Keep these read-only unless the user explicitly asks for remediation.

Guardrails

default to read-only diagnostics
do not restart, delete, cordon, drain, scale, upgrade, or reconfigure resources unless the user explicitly asks for remediation
do not conclude root cause without quoting the evidence that supports it

Output Checklist

Return scope and impact, evidence, failure domain, root cause, confidence, next checks, remediation, and escalation.

Azure Diagnostics

troubleshooting/aks/aks-troubleshooting.md

AKS Troubleshooting Guide

When to Use This Guide

Scenario Playbooks

Tool Selection For Diagnostics

Required Inputs

Scope Buckets

Evidence Order

Workflow

Error Patterns

Safe Fallback Checks

Guardrails

Output Checklist

Preparing the source view

Azure Diagnostics

troubleshooting/aks/aks-troubleshooting.md

AKS Troubleshooting Guide

When to Use This Guide

Scenario Playbooks

Tool Selection For Diagnostics

Required Inputs

Scope Buckets

Evidence Order

Workflow

Error Patterns

Safe Fallback Checks

Guardrails

Output Checklist