Source from repo

Azure Diagnostics

Diagnose Azure service issues, query logs, and troubleshoot failures using GitHub Copilot for Azure

microsoftGitHub microsoftOfficialSource repo Original GitHub link Publisher page

Files

Skill

n/a

Size

105.0 KB

Entrypoint

SKILL.md

Format

git-repo

Open file

troubleshooting/aks/networking.md

Syntax-highlighted preview of this file as included in the skill package.

Rendered Source

markdown94 linesFree

troubleshooting/aks/networking.md

1# Networking Troubleshooting
2 
3For CNI-specific issues, check CNI pod health and review [AKS networking concepts](https://learn.microsoft.com/azure/aks/concepts-network).
4 
5## Service Unreachable / Connection Refused
6 
7**Diagnostics - always start here:**
8 
9```bash
10# 1. Verify service exists and has endpoints (read-only)
11kubectl get svc <service-name> -n <ns>
12kubectl get endpoints <service-name> -n <ns>
13 
14# 2. Optional connectivity test from inside the namespace
15# This creates a temporary pod. Prefer read-only checks first.
16# Only use it after the user explicitly approves a mutating test.
17kubectl run netdebug --image=curlimages/curl -it --rm -n <ns> -- \
18  curl -sv http://<service>.<ns>.svc.cluster.local:<port>/healthz
19```
20 
21**Decision tree:**
22 
23| Observation                             | Cause                              | Fix                                             |
24| --------------------------------------- | ---------------------------------- | ----------------------------------------------- |
25| Endpoints shows `<none>`                | Label selector mismatch            | Align selector with pod labels; check for typos |
26| Endpoints has IPs but unreachable       | Port mismatch or app not listening | Confirm `targetPort` = actual container port    |
27| Works from some pods, fails from others | Network policy blocking            | See Network Policy section                      |
28| Works inside cluster, fails externally  | Load balancer issue                | See Load Balancer section                       |
29| `ECONNREFUSED` immediately              | App not listening on that port     | Check listening ports in the pod                |
30 
31Pods that are running but not Ready are removed from Endpoints. Check `kubectl get pod <pod> -n <ns>`.
32 
33**Deep diagnostics with Inspektor Gadget** (when the above checks are inconclusive):
34 
35Use the [IG base command pattern](references/inspektor-gadget.md) with `--k8s-namespace <ns> --k8s-podname <pod-name>` and these gadgets:
36 
37- `snapshot_socket` (timeout 5) — check what ports the pod is listening on
38- `trace_tcp` (timeout 30) — trace connect/accept/close events
39- `trace_tcpretrans` (timeout 30) — packet retransmissions
40 
41See [references/inspektor-gadget.md](references/inspektor-gadget.md).
42 
43---
44 
45## DNS Resolution Failures
46 
47**Diagnostics:**
48 
49The live DNS test creates a temporary pod. Prefer `get`, `describe`, `logs`, or `exec` into an existing pod first. Only use it after the user explicitly approves creating the test pod.
50 
51```bash
52# Confirm CoreDNS is running and healthy (read-only)
53kubectl get pods -n kube-system -l k8s-app=kube-dns -o wide
54kubectl top pod -n kube-system -l k8s-app=kube-dns
55 
56# Optional live DNS test from the same namespace as the failing pod
57kubectl run dnstest --image=busybox:1.28 -it --rm -n <ns> -- \
58  nslookup <service-name>.<ns>.svc.cluster.local
59 
60# CoreDNS logs - errors show here first
61kubectl logs -n kube-system -l k8s-app=kube-dns --tail=100
62```
63 
64**DNS failure patterns:**
65 
66| Symptom                               | Cause                                        | Fix                                                                                                                        |
67| ------------------------------------- | -------------------------------------------- | -------------------------------------------------------------------------------------------------------------------------- |
68| `NXDOMAIN` for `svc.cluster.local`    | CoreDNS down or pod network broken           | After confirming the diagnostics above, coordinate with the cluster operator to restart or redeploy CoreDNS and verify CNI |
69| Internal resolves, external NXDOMAIN  | Custom DNS not forwarding to `168.63.129.16` | Fix upstream forwarder                                                                                                     |
70| Intermittent SERVFAIL under load      | CoreDNS CPU throttled                        | Remove CPU limits or add replicas                                                                                          |
71| Private cluster - external names fail | Custom DNS missing privatelink forwarder     | Add conditional forwarder to Azure DNS                                                                                     |
72| `i/o timeout` not `NXDOMAIN`          | Port 53 blocked by NetworkPolicy or NSG      | Allow UDP/TCP 53 from pods to kube-dns ClusterIP                                                                           |
73 
74> ⚠️ **Warning:** The fixes in this table can change cluster state. Use them only after performing the read-only diagnostics above, and only with explicit confirmation from the cluster owner or operator.
75 
76```bash
77kubectl get svc kube-dns -n kube-system -o jsonpath='{.spec.clusterIP}'
78```
79 
80Custom VNet DNS must forward `.cluster.local` to the CoreDNS ClusterIP and other lookups to `168.63.129.16`.
81 
82**Deep diagnostics with Inspektor Gadget** (when the above checks are inconclusive):
83 
84Use the [IG base command pattern](references/inspektor-gadget.md) with `--k8s-namespace <ns> --k8s-podname <pod-name>` and `trace_dns` (timeout 30). Key signals: `rcode=3` (NXDOMAIN), `rcode=2` (SERVFAIL), high `latency` values, queries going to unexpected destinations.
85 
86See [references/inspektor-gadget.md](references/inspektor-gadget.md).
87 
88---
89 
90## Detailed Networking Guides
91 
92- [Load Balancer And Ingress Troubleshooting](load-balancer-and-ingress.md) for pending services, ingress controller state, backend routing, and TLS failures.
93- [Network Policy Troubleshooting](network-policy.md) for default-deny checks, Azure NPM or Calico validation, and ingress or egress rule audits.
94

Marketplace

Source from repo

Azure Diagnostics

Diagnose Azure service issues, query logs, and troubleshoot failures using GitHub Copilot for Azure

microsoftGitHub microsoftOfficialSource repo Original GitHub link Publisher page

Files

Skill

n/a

Size

105.0 KB

Entrypoint

SKILL.md

Format

git-repo

Open file

troubleshooting/aks/networking.md

Syntax-highlighted preview of this file as included in the skill package.

Rendered Source

markdown94 linesFree

troubleshooting/aks/networking.md

1# Networking Troubleshooting
2 
3For CNI-specific issues, check CNI pod health and review [AKS networking concepts](https://learn.microsoft.com/azure/aks/concepts-network).
4 
5## Service Unreachable / Connection Refused
6 
7**Diagnostics - always start here:**
8 
9```bash
10# 1. Verify service exists and has endpoints (read-only)
11kubectl get svc <service-name> -n <ns>
12kubectl get endpoints <service-name> -n <ns>
13 
14# 2. Optional connectivity test from inside the namespace
15# This creates a temporary pod. Prefer read-only checks first.
16# Only use it after the user explicitly approves a mutating test.
17kubectl run netdebug --image=curlimages/curl -it --rm -n <ns> -- \
18  curl -sv http://<service>.<ns>.svc.cluster.local:<port>/healthz
19```
20 
21**Decision tree:**
22 
23| Observation                             | Cause                              | Fix                                             |
24| --------------------------------------- | ---------------------------------- | ----------------------------------------------- |
25| Endpoints shows `<none>`                | Label selector mismatch            | Align selector with pod labels; check for typos |
26| Endpoints has IPs but unreachable       | Port mismatch or app not listening | Confirm `targetPort` = actual container port    |
27| Works from some pods, fails from others | Network policy blocking            | See Network Policy section                      |
28| Works inside cluster, fails externally  | Load balancer issue                | See Load Balancer section                       |
29| `ECONNREFUSED` immediately              | App not listening on that port     | Check listening ports in the pod                |
30 
31Pods that are running but not Ready are removed from Endpoints. Check `kubectl get pod <pod> -n <ns>`.
32 
33**Deep diagnostics with Inspektor Gadget** (when the above checks are inconclusive):
34 
35Use the [IG base command pattern](references/inspektor-gadget.md) with `--k8s-namespace <ns> --k8s-podname <pod-name>` and these gadgets:
36 
37- `snapshot_socket` (timeout 5) — check what ports the pod is listening on
38- `trace_tcp` (timeout 30) — trace connect/accept/close events
39- `trace_tcpretrans` (timeout 30) — packet retransmissions
40 
41See [references/inspektor-gadget.md](references/inspektor-gadget.md).
42 
43---
44 
45## DNS Resolution Failures
46 
47**Diagnostics:**
48 
49The live DNS test creates a temporary pod. Prefer `get`, `describe`, `logs`, or `exec` into an existing pod first. Only use it after the user explicitly approves creating the test pod.
50 
51```bash
52# Confirm CoreDNS is running and healthy (read-only)
53kubectl get pods -n kube-system -l k8s-app=kube-dns -o wide
54kubectl top pod -n kube-system -l k8s-app=kube-dns
55 
56# Optional live DNS test from the same namespace as the failing pod
57kubectl run dnstest --image=busybox:1.28 -it --rm -n <ns> -- \
58  nslookup <service-name>.<ns>.svc.cluster.local
59 
60# CoreDNS logs - errors show here first
61kubectl logs -n kube-system -l k8s-app=kube-dns --tail=100
62```
63 
64**DNS failure patterns:**
65 
66| Symptom                               | Cause                                        | Fix                                                                                                                        |
67| ------------------------------------- | -------------------------------------------- | -------------------------------------------------------------------------------------------------------------------------- |
68| `NXDOMAIN` for `svc.cluster.local`    | CoreDNS down or pod network broken           | After confirming the diagnostics above, coordinate with the cluster operator to restart or redeploy CoreDNS and verify CNI |
69| Internal resolves, external NXDOMAIN  | Custom DNS not forwarding to `168.63.129.16` | Fix upstream forwarder                                                                                                     |
70| Intermittent SERVFAIL under load      | CoreDNS CPU throttled                        | Remove CPU limits or add replicas                                                                                          |
71| Private cluster - external names fail | Custom DNS missing privatelink forwarder     | Add conditional forwarder to Azure DNS                                                                                     |
72| `i/o timeout` not `NXDOMAIN`          | Port 53 blocked by NetworkPolicy or NSG      | Allow UDP/TCP 53 from pods to kube-dns ClusterIP                                                                           |
73 
74> ⚠️ **Warning:** The fixes in this table can change cluster state. Use them only after performing the read-only diagnostics above, and only with explicit confirmation from the cluster owner or operator.
75 
76```bash
77kubectl get svc kube-dns -n kube-system -o jsonpath='{.spec.clusterIP}'
78```
79 
80Custom VNet DNS must forward `.cluster.local` to the CoreDNS ClusterIP and other lookups to `168.63.129.16`.
81 
82**Deep diagnostics with Inspektor Gadget** (when the above checks are inconclusive):
83 
84Use the [IG base command pattern](references/inspektor-gadget.md) with `--k8s-namespace <ns> --k8s-podname <pod-name>` and `trace_dns` (timeout 30). Key signals: `rcode=3` (NXDOMAIN), `rcode=2` (SERVFAIL), high `latency` values, queries going to unexpected destinations.
85 
86See [references/inspektor-gadget.md](references/inspektor-gadget.md).
87 
88---
89 
90## Detailed Networking Guides
91 
92- [Load Balancer And Ingress Troubleshooting](load-balancer-and-ingress.md) for pending services, ingress controller state, backend routing, and TLS failures.
93- [Network Policy Troubleshooting](network-policy.md) for default-deny checks, Azure NPM or Calico validation, and ingress or egress rule audits.
94

Azure Diagnostics

troubleshooting/aks/networking.md

Preparing the source view

Azure Diagnostics

troubleshooting/aks/networking.md