Source from repo

Azure Diagnostics

Diagnose Azure service issues, query logs, and troubleshoot failures using GitHub Copilot for Azure

microsoftGitHub microsoftOfficialSource repo Original GitHub link Publisher page

Files

Skill

n/a

Size

105.0 KB

Entrypoint

SKILL.md

Format

git-repo

Open file

troubleshooting/aks/references/inspektor-gadget.md

Syntax-highlighted preview of this file as included in the skill package.

Rendered Source

markdown136 linesFree

troubleshooting/aks/references/inspektor-gadget.md

1# Inspektor Gadget (IG) Reference
2 
3Use Inspektor Gadget for real-time, low-level node/pod diagnostics when `kubectl` is insufficient.
4 
5## IG Version
6 
7`<ig-version>` = `v0.51.0` — substitute this exact tag (with `v` prefix) wherever `<ig-version>` appears. Bump this line only.
8 
9## Base Command Pattern
10 
11```bash
12kubectl debug --profile=sysadmin node/<node-name> --attach --quiet \
13  --image=mcr.microsoft.com/oss/v2/inspektor-gadget/ig:<ig-version> \
14  -- ig run <gadget>:<ig-version> -o json --timeout <seconds> [filters...]
15```
16 
17Always set `--timeout` after `--` to cap runtime. Use `--timeout 5` for snapshot/top, `--timeout 30` for trace/profile.
18 
19> **Note:** IG uses `kubectl debug --profile=sysadmin` (privileged debug pod). Only run with explicit user approval and appropriate RBAC.
20 
21**Required:** Resolve the node name first:
22 
23```bash
24kubectl get pod <pod-name> -n <namespace> -o jsonpath='{.spec.nodeName}'
25```
26 
27## Common Filters
28 
29| Filter | Description |
30|---|---|
31| `--k8s-namespace <ns>` | Scope to a Kubernetes namespace |
32| `--k8s-podname <pod>` | Scope to a specific pod |
33| `--k8s-containername <ctr>` | Scope to a specific container |
34| `--timeout <seconds>` | Cap streaming duration for trace/profile gadgets |
35| `--max-entries <n>` | Max entries per batch for top/profile gadgets |
36| `--map-fetch-interval <dur>` | Map fetch interval for top (except `top_process`) and profile gadgets (default `1000ms`) |
37| `--interval <dur>` | Reporting interval for `top_process` only (e.g. `5s`) |
38| `--syscall-filters <list>` | Comma-separated syscalls for `traceloop` (e.g. `open,connect,accept`). **Always specify** to limit data volume |
39 
40> **Tip:** For top/profile, set `--map-fetch-interval` ≤ half of `--timeout` to collect at least one batch. E.g. `--timeout 2 --map-fetch-interval 1000ms --max-entries 20`.
41>
42> **Note:** `top_process` uses `--interval` instead of `--map-fetch-interval`. E.g. `--timeout 10 --interval 5s --max-entries 20`.
43 
44## Gadget Catalog
45 
46### Networking
47 
48| Gadget | Type | What It Does | When To Use |
49|---|---|---|---|
50| `trace_dns` | trace | Trace DNS queries and responses with latency | DNS failures, NXDOMAIN, SERVFAIL, slow resolution, intermittent DNS |
51| `trace_tcp` | trace | Trace TCP connect/accept/close events | Connection refused, timeouts, unexpected drops, mapping pod connectivity |
52| `trace_tcpretrans` | trace | Trace TCP retransmissions | Network congestion, lossy links, high latency between pods/services |
53| `trace_bind` | trace | Trace socket bind calls | Port conflicts, address-already-in-use errors |
54| `trace_sni` | trace | Trace TLS SNI (Server Name Indication) values | HTTPS routing issues, ingress TLS debugging, mTLS problems |
55| `snapshot_socket` | snapshot | List open sockets (TCP/UDP/Unix) | Port conflicts, listening ports, connection leaks, ECONNREFUSED |
56| `tcpdump` | special | Capture raw packets in pcap-ng format | Deep packet inspection, protocol-level debugging, reproducing network issues |
57 
58#### tcpdump gadget
59 
60Outputs raw pcap-ng data. Pipe to `tcpdump` for readable output:
61 
62```bash
63kubectl debug --profile=sysadmin node/<node-name> --attach --quiet \
64  --image=mcr.microsoft.com/oss/v2/inspektor-gadget/ig:<ig-version> \
65  -- ig run tcpdump:<ig-version> -o pcap-ng --k8s-namespace <ns> --k8s-podname <pod> \
66     --timeout 30 --pf "port 80" \
67  | tcpdump -nvr -
68```
69 
70Use `--pf "<expr>"` for tcpdump filters (e.g., `port 80`, `host 10.0.0.1`). Output must be `-o pcap-ng` (not `-o json`).
71 
72### Process & Workload
73 
74| Gadget | Type | What It Does | When To Use |
75|---|---|---|---|
76| `snapshot_process` | snapshot | List running processes in pod/node | PID pressure, unknown processes, verifying entrypoint, CrashLoopBackOff |
77| `trace_exec` | trace | Trace process execution (execve calls) | CrashLoopBackOff (what actually runs), unexpected child processes, security audit |
78| `trace_oomkill` | trace | Trace OOM kill events with victim details | OOMKilled pods — see which process was killed, memory usage at kill time |
79| `trace_signal` | trace | Trace signals delivered to processes | Unexpected SIGKILL/SIGTERM, liveness probe kills, graceful shutdown issues |
80| `top_process` | top | Rank processes by CPU/memory usage | Identifying resource-hungry processes inside a pod or across a node |
81| `profile_cpu` | profile | CPU profiling via stack sampling | High CPU usage investigation, finding hot code paths |
82| `traceloop` | trace | Record syscalls as a flight recorder | Catch-all for intermittent issues. **Always use `--syscall-filters`** (e.g., `open,connect,accept`) to limit data volume |
83 
84### File & Storage
85 
86| Gadget | Type | What It Does | When To Use |
87|---|---|---|---|
88| `trace_open` | trace | Trace openat syscall | Missing config/secret files (ENOENT), permission denied (EACCES), startup failures |
89| `trace_fsslower` | trace | Trace slow filesystem operations | Slow disk I/O, PVC performance issues, NFS/Azure Disk latency |
90| `top_file` | top | Rank files by read/write activity | Identifying I/O-heavy files, noisy log writers, disk pressure diagnosis |
91 
92### Security & Audit
93 
94| Gadget | Type | What It Does | When To Use |
95|---|---|---|---|
96| `trace_capabilities` | trace | Trace Linux capability checks | Permission denied from dropped capabilities, SecurityContext debugging |
97 
98## Symptom-to-Gadget Map
99 
100| Symptom | Gadget(s) |
101|---|---|
102| DNS resolution failures | `trace_dns` |
103| Connection refused / timeout | `trace_tcp` + `snapshot_socket` |
104| Silent connection drops | `trace_tcpretrans` |
105| High network latency | `trace_tcpretrans` |
106| TLS / HTTPS routing issues | `trace_sni` |
107| Port already in use | `trace_bind` + `snapshot_socket` |
108| CrashLoopBackOff (unknown cause) | `trace_exec` + `trace_open` |
109| OOMKilled pods | `trace_oomkill` + `top_process` |
110| Pod killed unexpectedly | `trace_signal` |
111| PID pressure on node | `snapshot_process` + `top_process` |
112| "Too many open files" | `top_file` |
113| Missing config / secret mount | `trace_open` |
114| Slow disk / PVC performance | `trace_fsslower` + `top_file` |
115| Permission denied (capabilities) | `trace_capabilities` |
116| High CPU (unknown cause) | `profile_cpu` + `top_process` |
117| Deep packet inspection | `tcpdump` |
118| Catch-all / intermittent issues | `traceloop` (use `--syscall-filters`) |
119 
120## Gadget Type Reference
121 
122| Type | Behavior | IG --timeout |
123|---|---|---|
124| `snapshot` | Point-in-time data, returns immediately | `--timeout 5` |
125| `top` | Aggregated view, returns quickly | `--timeout 5` |
126| `trace` | Streams events in real-time | `--timeout 30` |
127| `profile` | Samples over a duration | `--timeout 30` |
128| `tcpdump` | Streams pcap-ng data, pipe to `tcpdump -nvr -` | `--timeout 30` |
129 
130## Guardrails
131 
132- IG gadgets are **read-only** — they do not modify cluster or application state.
133- Resolve the correct node name before running any IG command.
134- Always set `--timeout` to cap runtime. Prefer snapshot/top for quick checks; trace/profile for behavior over time.
135- For reproduction: launch a trace gadget first, then reproduce the problem. The debug pod persists after the gadget exits, so run `kubectl logs <debug-pod>` to retrieve the captured output afterward.
136

Marketplace

Source from repo

Azure Diagnostics

Diagnose Azure service issues, query logs, and troubleshoot failures using GitHub Copilot for Azure

microsoftGitHub microsoftOfficialSource repo Original GitHub link Publisher page

Files

Skill

n/a

Size

105.0 KB

Entrypoint

SKILL.md

Format

git-repo

Open file

troubleshooting/aks/references/inspektor-gadget.md

Syntax-highlighted preview of this file as included in the skill package.

Rendered Source

markdown136 linesFree

troubleshooting/aks/references/inspektor-gadget.md

1# Inspektor Gadget (IG) Reference
2 
3Use Inspektor Gadget for real-time, low-level node/pod diagnostics when `kubectl` is insufficient.
4 
5## IG Version
6 
7`<ig-version>` = `v0.51.0` — substitute this exact tag (with `v` prefix) wherever `<ig-version>` appears. Bump this line only.
8 
9## Base Command Pattern
10 
11```bash
12kubectl debug --profile=sysadmin node/<node-name> --attach --quiet \
13  --image=mcr.microsoft.com/oss/v2/inspektor-gadget/ig:<ig-version> \
14  -- ig run <gadget>:<ig-version> -o json --timeout <seconds> [filters...]
15```
16 
17Always set `--timeout` after `--` to cap runtime. Use `--timeout 5` for snapshot/top, `--timeout 30` for trace/profile.
18 
19> **Note:** IG uses `kubectl debug --profile=sysadmin` (privileged debug pod). Only run with explicit user approval and appropriate RBAC.
20 
21**Required:** Resolve the node name first:
22 
23```bash
24kubectl get pod <pod-name> -n <namespace> -o jsonpath='{.spec.nodeName}'
25```
26 
27## Common Filters
28 
29| Filter | Description |
30|---|---|
31| `--k8s-namespace <ns>` | Scope to a Kubernetes namespace |
32| `--k8s-podname <pod>` | Scope to a specific pod |
33| `--k8s-containername <ctr>` | Scope to a specific container |
34| `--timeout <seconds>` | Cap streaming duration for trace/profile gadgets |
35| `--max-entries <n>` | Max entries per batch for top/profile gadgets |
36| `--map-fetch-interval <dur>` | Map fetch interval for top (except `top_process`) and profile gadgets (default `1000ms`) |
37| `--interval <dur>` | Reporting interval for `top_process` only (e.g. `5s`) |
38| `--syscall-filters <list>` | Comma-separated syscalls for `traceloop` (e.g. `open,connect,accept`). **Always specify** to limit data volume |
39 
40> **Tip:** For top/profile, set `--map-fetch-interval` ≤ half of `--timeout` to collect at least one batch. E.g. `--timeout 2 --map-fetch-interval 1000ms --max-entries 20`.
41>
42> **Note:** `top_process` uses `--interval` instead of `--map-fetch-interval`. E.g. `--timeout 10 --interval 5s --max-entries 20`.
43 
44## Gadget Catalog
45 
46### Networking
47 
48| Gadget | Type | What It Does | When To Use |
49|---|---|---|---|
50| `trace_dns` | trace | Trace DNS queries and responses with latency | DNS failures, NXDOMAIN, SERVFAIL, slow resolution, intermittent DNS |
51| `trace_tcp` | trace | Trace TCP connect/accept/close events | Connection refused, timeouts, unexpected drops, mapping pod connectivity |
52| `trace_tcpretrans` | trace | Trace TCP retransmissions | Network congestion, lossy links, high latency between pods/services |
53| `trace_bind` | trace | Trace socket bind calls | Port conflicts, address-already-in-use errors |
54| `trace_sni` | trace | Trace TLS SNI (Server Name Indication) values | HTTPS routing issues, ingress TLS debugging, mTLS problems |
55| `snapshot_socket` | snapshot | List open sockets (TCP/UDP/Unix) | Port conflicts, listening ports, connection leaks, ECONNREFUSED |
56| `tcpdump` | special | Capture raw packets in pcap-ng format | Deep packet inspection, protocol-level debugging, reproducing network issues |
57 
58#### tcpdump gadget
59 
60Outputs raw pcap-ng data. Pipe to `tcpdump` for readable output:
61 
62```bash
63kubectl debug --profile=sysadmin node/<node-name> --attach --quiet \
64  --image=mcr.microsoft.com/oss/v2/inspektor-gadget/ig:<ig-version> \
65  -- ig run tcpdump:<ig-version> -o pcap-ng --k8s-namespace <ns> --k8s-podname <pod> \
66     --timeout 30 --pf "port 80" \
67  | tcpdump -nvr -
68```
69 
70Use `--pf "<expr>"` for tcpdump filters (e.g., `port 80`, `host 10.0.0.1`). Output must be `-o pcap-ng` (not `-o json`).
71 
72### Process & Workload
73 
74| Gadget | Type | What It Does | When To Use |
75|---|---|---|---|
76| `snapshot_process` | snapshot | List running processes in pod/node | PID pressure, unknown processes, verifying entrypoint, CrashLoopBackOff |
77| `trace_exec` | trace | Trace process execution (execve calls) | CrashLoopBackOff (what actually runs), unexpected child processes, security audit |
78| `trace_oomkill` | trace | Trace OOM kill events with victim details | OOMKilled pods — see which process was killed, memory usage at kill time |
79| `trace_signal` | trace | Trace signals delivered to processes | Unexpected SIGKILL/SIGTERM, liveness probe kills, graceful shutdown issues |
80| `top_process` | top | Rank processes by CPU/memory usage | Identifying resource-hungry processes inside a pod or across a node |
81| `profile_cpu` | profile | CPU profiling via stack sampling | High CPU usage investigation, finding hot code paths |
82| `traceloop` | trace | Record syscalls as a flight recorder | Catch-all for intermittent issues. **Always use `--syscall-filters`** (e.g., `open,connect,accept`) to limit data volume |
83 
84### File & Storage
85 
86| Gadget | Type | What It Does | When To Use |
87|---|---|---|---|
88| `trace_open` | trace | Trace openat syscall | Missing config/secret files (ENOENT), permission denied (EACCES), startup failures |
89| `trace_fsslower` | trace | Trace slow filesystem operations | Slow disk I/O, PVC performance issues, NFS/Azure Disk latency |
90| `top_file` | top | Rank files by read/write activity | Identifying I/O-heavy files, noisy log writers, disk pressure diagnosis |
91 
92### Security & Audit
93 
94| Gadget | Type | What It Does | When To Use |
95|---|---|---|---|
96| `trace_capabilities` | trace | Trace Linux capability checks | Permission denied from dropped capabilities, SecurityContext debugging |
97 
98## Symptom-to-Gadget Map
99 
100| Symptom | Gadget(s) |
101|---|---|
102| DNS resolution failures | `trace_dns` |
103| Connection refused / timeout | `trace_tcp` + `snapshot_socket` |
104| Silent connection drops | `trace_tcpretrans` |
105| High network latency | `trace_tcpretrans` |
106| TLS / HTTPS routing issues | `trace_sni` |
107| Port already in use | `trace_bind` + `snapshot_socket` |
108| CrashLoopBackOff (unknown cause) | `trace_exec` + `trace_open` |
109| OOMKilled pods | `trace_oomkill` + `top_process` |
110| Pod killed unexpectedly | `trace_signal` |
111| PID pressure on node | `snapshot_process` + `top_process` |
112| "Too many open files" | `top_file` |
113| Missing config / secret mount | `trace_open` |
114| Slow disk / PVC performance | `trace_fsslower` + `top_file` |
115| Permission denied (capabilities) | `trace_capabilities` |
116| High CPU (unknown cause) | `profile_cpu` + `top_process` |
117| Deep packet inspection | `tcpdump` |
118| Catch-all / intermittent issues | `traceloop` (use `--syscall-filters`) |
119 
120## Gadget Type Reference
121 
122| Type | Behavior | IG --timeout |
123|---|---|---|
124| `snapshot` | Point-in-time data, returns immediately | `--timeout 5` |
125| `top` | Aggregated view, returns quickly | `--timeout 5` |
126| `trace` | Streams events in real-time | `--timeout 30` |
127| `profile` | Samples over a duration | `--timeout 30` |
128| `tcpdump` | Streams pcap-ng data, pipe to `tcpdump -nvr -` | `--timeout 30` |
129 
130## Guardrails
131 
132- IG gadgets are **read-only** — they do not modify cluster or application state.
133- Resolve the correct node name before running any IG command.
134- Always set `--timeout` to cap runtime. Prefer snapshot/top for quick checks; trace/profile for behavior over time.
135- For reproduction: launch a trace gadget first, then reproduce the problem. The debug pod persists after the gadget exits, so run `kubectl logs <debug-pod>` to retrieve the captured output afterward.
136

Azure Diagnostics

troubleshooting/aks/references/inspektor-gadget.md

Preparing the source view

Azure Diagnostics

troubleshooting/aks/references/inspektor-gadget.md