Source from repo

Azure Diagnostics

Debug and troubleshoot Azure Container Apps and Function Apps using logs, KQL, and health checks.

microsoftGitHub microsoftOfficialSource repo Original GitHub link Publisher page

Files

Skill

n/a

Size

95.0 KB

Entrypoint

SKILL.md

Format

git-repo

Open file

troubleshooting/aks/aks-troubleshooting.md

Syntax-highlighted preview of this file as included in the skill package.

Rendered Source

markdown91 linesFree

troubleshooting/aks/aks-troubleshooting.md

1# AKS Troubleshooting Guide
2 
3Primary AKS troubleshooting guide for incidents routed from [../../SKILL.md](../../SKILL.md).
4 
5## When to Use This Guide
6 
7- lifecycle, access, node, `kube-system`, workload, ingress, DNS, or scaling issues
8- `kubectl` cannot connect, nodes are `NotReady`, or pods are unhealthy
9 
10## Scenario Playbooks
11 
12| Scenario                                                      | Reference                                        |
13| ------------------------------------------------------------- | ------------------------------------------------ |
14| broad cluster investigation                                   | [general-diagnostics.md](general-diagnostics.md) |
15| workload, crash, image pull, readiness, or pending pod issues | [pod-failures.md](pod-failures.md)               |
16| node health, scaling, pressure, upgrade, or zone issues       | [node-issues.md](node-issues.md)                 |
17| service, ingress, DNS, or network policy issues               | [networking.md](networking.md)                   |
18 
19## Tool Selection For Diagnostics
20 
21When gathering AKS diagnostic evidence, prefer `mcp_azure_mcp_aks`, then the smallest discovered AKS-MCP tool that fits the read, then supporting Azure tools such as `mcp_azure_mcp_applens`, `mcp_azure_mcp_monitor`, or `mcp_azure_mcp_resourcehealth`. Use raw `az aks` and `kubectl` only when the AKS-MCP surface cannot perform the needed check.
22 
23See [references/aks-mcp.md](references/aks-mcp.md), [references/structured-input-modes.md](references/structured-input-modes.md), [references/command-flows.md](references/command-flows.md)
24 
25## Required Inputs
26 
27- subscription or active Azure context
28- resource group and cluster name
29- symptom summary
30- first observed time or recent change window
31- impacted namespace, workload, service, or ingress when known
32 
33If cluster identity is missing, stop and ask for it.
34 
35## Scope Buckets
36 
37- Lifecycle: create, update, start, stop, upgrade, or provisioning failures
38- API access: kubeconfig, auth, private endpoint, DNS, or reachability problems
39- Nodes: missing nodes, `NotReady`, pressure, CNI, kubelet, certificate, or VMSS drift
40- `kube-system`: CoreDNS, metrics-server, konnectivity, ingress, CNI, CSI, or add-on failures
41- Workloads: `Pending`, `CrashLoopBackOff`, `OOMKilled`, PVC, quota, secret, readiness, or dependency issues
42- Connectivity and DNS: pod -> service -> endpoints -> ingress/load balancer -> DNS -> network controls
43- Scaling: node pool sizing, pending pods, autoscaler config, metrics, quota, or subnet constraints
44 
45## Evidence Order
46 
471. Azure-side state first: cluster state, resource health, recent operations, node pool state, detector or monitoring output.
482. Kubernetes-side state second: cluster reachability, nodes, `kube-system`, events, affected namespace, pod detail, logs.
493. Use detector, warning-event, or metrics modes when the incoming data already matches them.
50 
51## Workflow
52 
531. Get cluster context.
542. Classify the problem by scope bucket.
553. Prefer Azure-side evidence before Kubernetes-side evidence.
564. Use the matching AKS-MCP path first, then the documented CLI fallback if MCP cannot perform that read.
575. Return evidence, failure domain, confidence, next checks, remediation, and escalation.
58 
59## Error Patterns
60 
61- No cluster context: ask for subscription, resource group, and cluster name.
62- MCP unavailable: fall back to safe `az aks` and `kubectl` reads.
63- `kubectl` blocked: separate auth problems from network reachability.
64- Logs or metrics missing: use events, node state, and resource descriptions.
65- Detector noise: ignore `emergingIssues`, prefer critical findings, rank the most actionable signal first.
66 
67## Safe Fallback Checks
68 
69```bash
70az aks show -g <resource-group> -n <cluster-name>
71az aks nodepool list -g <resource-group> --cluster-name <cluster-name>
72kubectl cluster-info
73kubectl get nodes -o wide
74kubectl get pods -n kube-system
75kubectl get events -A --sort-by=.lastTimestamp
76kubectl describe pod <pod-name> -n <namespace>
77kubectl logs <pod-name> -n <namespace> --previous
78```
79 
80Keep these read-only unless the user explicitly asks for remediation.
81 
82## Guardrails
83 
84- default to read-only diagnostics
85- do not restart, delete, cordon, drain, scale, upgrade, or reconfigure resources unless the user explicitly asks for remediation
86- do not conclude root cause without quoting the evidence that supports it
87 
88## Output Checklist
89 
90Return scope and impact, evidence, failure domain, root cause, confidence, next checks, remediation, and escalation.
91

Marketplace

Source from repo

Azure Diagnostics

Debug and troubleshoot Azure Container Apps and Function Apps using logs, KQL, and health checks.

microsoftGitHub microsoftOfficialSource repo Original GitHub link Publisher page

Files

Skill

n/a

Size

95.0 KB

Entrypoint

SKILL.md

Format

git-repo

Open file

troubleshooting/aks/aks-troubleshooting.md

Syntax-highlighted preview of this file as included in the skill package.

Rendered Source

markdown91 linesFree

troubleshooting/aks/aks-troubleshooting.md

1# AKS Troubleshooting Guide
2 
3Primary AKS troubleshooting guide for incidents routed from [../../SKILL.md](../../SKILL.md).
4 
5## When to Use This Guide
6 
7- lifecycle, access, node, `kube-system`, workload, ingress, DNS, or scaling issues
8- `kubectl` cannot connect, nodes are `NotReady`, or pods are unhealthy
9 
10## Scenario Playbooks
11 
12| Scenario                                                      | Reference                                        |
13| ------------------------------------------------------------- | ------------------------------------------------ |
14| broad cluster investigation                                   | [general-diagnostics.md](general-diagnostics.md) |
15| workload, crash, image pull, readiness, or pending pod issues | [pod-failures.md](pod-failures.md)               |
16| node health, scaling, pressure, upgrade, or zone issues       | [node-issues.md](node-issues.md)                 |
17| service, ingress, DNS, or network policy issues               | [networking.md](networking.md)                   |
18 
19## Tool Selection For Diagnostics
20 
21When gathering AKS diagnostic evidence, prefer `mcp_azure_mcp_aks`, then the smallest discovered AKS-MCP tool that fits the read, then supporting Azure tools such as `mcp_azure_mcp_applens`, `mcp_azure_mcp_monitor`, or `mcp_azure_mcp_resourcehealth`. Use raw `az aks` and `kubectl` only when the AKS-MCP surface cannot perform the needed check.
22 
23See [references/aks-mcp.md](references/aks-mcp.md), [references/structured-input-modes.md](references/structured-input-modes.md), [references/command-flows.md](references/command-flows.md)
24 
25## Required Inputs
26 
27- subscription or active Azure context
28- resource group and cluster name
29- symptom summary
30- first observed time or recent change window
31- impacted namespace, workload, service, or ingress when known
32 
33If cluster identity is missing, stop and ask for it.
34 
35## Scope Buckets
36 
37- Lifecycle: create, update, start, stop, upgrade, or provisioning failures
38- API access: kubeconfig, auth, private endpoint, DNS, or reachability problems
39- Nodes: missing nodes, `NotReady`, pressure, CNI, kubelet, certificate, or VMSS drift
40- `kube-system`: CoreDNS, metrics-server, konnectivity, ingress, CNI, CSI, or add-on failures
41- Workloads: `Pending`, `CrashLoopBackOff`, `OOMKilled`, PVC, quota, secret, readiness, or dependency issues
42- Connectivity and DNS: pod -> service -> endpoints -> ingress/load balancer -> DNS -> network controls
43- Scaling: node pool sizing, pending pods, autoscaler config, metrics, quota, or subnet constraints
44 
45## Evidence Order
46 
471. Azure-side state first: cluster state, resource health, recent operations, node pool state, detector or monitoring output.
482. Kubernetes-side state second: cluster reachability, nodes, `kube-system`, events, affected namespace, pod detail, logs.
493. Use detector, warning-event, or metrics modes when the incoming data already matches them.
50 
51## Workflow
52 
531. Get cluster context.
542. Classify the problem by scope bucket.
553. Prefer Azure-side evidence before Kubernetes-side evidence.
564. Use the matching AKS-MCP path first, then the documented CLI fallback if MCP cannot perform that read.
575. Return evidence, failure domain, confidence, next checks, remediation, and escalation.
58 
59## Error Patterns
60 
61- No cluster context: ask for subscription, resource group, and cluster name.
62- MCP unavailable: fall back to safe `az aks` and `kubectl` reads.
63- `kubectl` blocked: separate auth problems from network reachability.
64- Logs or metrics missing: use events, node state, and resource descriptions.
65- Detector noise: ignore `emergingIssues`, prefer critical findings, rank the most actionable signal first.
66 
67## Safe Fallback Checks
68 
69```bash
70az aks show -g <resource-group> -n <cluster-name>
71az aks nodepool list -g <resource-group> --cluster-name <cluster-name>
72kubectl cluster-info
73kubectl get nodes -o wide
74kubectl get pods -n kube-system
75kubectl get events -A --sort-by=.lastTimestamp
76kubectl describe pod <pod-name> -n <namespace>
77kubectl logs <pod-name> -n <namespace> --previous
78```
79 
80Keep these read-only unless the user explicitly asks for remediation.
81 
82## Guardrails
83 
84- default to read-only diagnostics
85- do not restart, delete, cordon, drain, scale, upgrade, or reconfigure resources unless the user explicitly asks for remediation
86- do not conclude root cause without quoting the evidence that supports it
87 
88## Output Checklist
89 
90Return scope and impact, evidence, failure domain, root cause, confidence, next checks, remediation, and escalation.
91

Azure Diagnostics

troubleshooting/aks/aks-troubleshooting.md

Preparing the source view

Azure Diagnostics

troubleshooting/aks/aks-troubleshooting.md