Loading source
Pulling the file list, source metadata, and syntax-aware rendering for this listing.
Source from repo
Diagnose Azure service issues, query logs, and troubleshoot failures using GitHub Copilot for Azure
Files
Skill
Size
Entrypoint
Format
Open file
Syntax-highlighted preview of this file as included in the skill package.
troubleshooting/aks/references/inspektor-gadget.md
1# Inspektor Gadget (IG) Reference23Use Inspektor Gadget for real-time, low-level node/pod diagnostics when `kubectl` is insufficient.45## IG Version67`<ig-version>` = `v0.51.0` — substitute this exact tag (with `v` prefix) wherever `<ig-version>` appears. Bump this line only.89## Base Command Pattern1011```bash12kubectl debug --profile=sysadmin node/<node-name> --attach --quiet \13--image=mcr.microsoft.com/oss/v2/inspektor-gadget/ig:<ig-version> \14-- ig run <gadget>:<ig-version> -o json --timeout <seconds> [filters...]15```1617Always set `--timeout` after `--` to cap runtime. Use `--timeout 5` for snapshot/top, `--timeout 30` for trace/profile.1819> **Note:** IG uses `kubectl debug --profile=sysadmin` (privileged debug pod). Only run with explicit user approval and appropriate RBAC.2021**Required:** Resolve the node name first:2223```bash24kubectl get pod <pod-name> -n <namespace> -o jsonpath='{.spec.nodeName}'25```2627## Common Filters2829| Filter | Description |30|---|---|31| `--k8s-namespace <ns>` | Scope to a Kubernetes namespace |32| `--k8s-podname <pod>` | Scope to a specific pod |33| `--k8s-containername <ctr>` | Scope to a specific container |34| `--timeout <seconds>` | Cap streaming duration for trace/profile gadgets |35| `--max-entries <n>` | Max entries per batch for top/profile gadgets |36| `--map-fetch-interval <dur>` | Map fetch interval for top (except `top_process`) and profile gadgets (default `1000ms`) |37| `--interval <dur>` | Reporting interval for `top_process` only (e.g. `5s`) |38| `--syscall-filters <list>` | Comma-separated syscalls for `traceloop` (e.g. `open,connect,accept`). **Always specify** to limit data volume |3940> **Tip:** For top/profile, set `--map-fetch-interval` ≤ half of `--timeout` to collect at least one batch. E.g. `--timeout 2 --map-fetch-interval 1000ms --max-entries 20`.41>42> **Note:** `top_process` uses `--interval` instead of `--map-fetch-interval`. E.g. `--timeout 10 --interval 5s --max-entries 20`.4344## Gadget Catalog4546### Networking4748| Gadget | Type | What It Does | When To Use |49|---|---|---|---|50| `trace_dns` | trace | Trace DNS queries and responses with latency | DNS failures, NXDOMAIN, SERVFAIL, slow resolution, intermittent DNS |51| `trace_tcp` | trace | Trace TCP connect/accept/close events | Connection refused, timeouts, unexpected drops, mapping pod connectivity |52| `trace_tcpretrans` | trace | Trace TCP retransmissions | Network congestion, lossy links, high latency between pods/services |53| `trace_bind` | trace | Trace socket bind calls | Port conflicts, address-already-in-use errors |54| `trace_sni` | trace | Trace TLS SNI (Server Name Indication) values | HTTPS routing issues, ingress TLS debugging, mTLS problems |55| `snapshot_socket` | snapshot | List open sockets (TCP/UDP/Unix) | Port conflicts, listening ports, connection leaks, ECONNREFUSED |56| `tcpdump` | special | Capture raw packets in pcap-ng format | Deep packet inspection, protocol-level debugging, reproducing network issues |5758#### tcpdump gadget5960Outputs raw pcap-ng data. Pipe to `tcpdump` for readable output:6162```bash63kubectl debug --profile=sysadmin node/<node-name> --attach --quiet \64--image=mcr.microsoft.com/oss/v2/inspektor-gadget/ig:<ig-version> \65-- ig run tcpdump:<ig-version> -o pcap-ng --k8s-namespace <ns> --k8s-podname <pod> \66--timeout 30 --pf "port 80" \67| tcpdump -nvr -68```6970Use `--pf "<expr>"` for tcpdump filters (e.g., `port 80`, `host 10.0.0.1`). Output must be `-o pcap-ng` (not `-o json`).7172### Process & Workload7374| Gadget | Type | What It Does | When To Use |75|---|---|---|---|76| `snapshot_process` | snapshot | List running processes in pod/node | PID pressure, unknown processes, verifying entrypoint, CrashLoopBackOff |77| `trace_exec` | trace | Trace process execution (execve calls) | CrashLoopBackOff (what actually runs), unexpected child processes, security audit |78| `trace_oomkill` | trace | Trace OOM kill events with victim details | OOMKilled pods — see which process was killed, memory usage at kill time |79| `trace_signal` | trace | Trace signals delivered to processes | Unexpected SIGKILL/SIGTERM, liveness probe kills, graceful shutdown issues |80| `top_process` | top | Rank processes by CPU/memory usage | Identifying resource-hungry processes inside a pod or across a node |81| `profile_cpu` | profile | CPU profiling via stack sampling | High CPU usage investigation, finding hot code paths |82| `traceloop` | trace | Record syscalls as a flight recorder | Catch-all for intermittent issues. **Always use `--syscall-filters`** (e.g., `open,connect,accept`) to limit data volume |8384### File & Storage8586| Gadget | Type | What It Does | When To Use |87|---|---|---|---|88| `trace_open` | trace | Trace openat syscall | Missing config/secret files (ENOENT), permission denied (EACCES), startup failures |89| `trace_fsslower` | trace | Trace slow filesystem operations | Slow disk I/O, PVC performance issues, NFS/Azure Disk latency |90| `top_file` | top | Rank files by read/write activity | Identifying I/O-heavy files, noisy log writers, disk pressure diagnosis |9192### Security & Audit9394| Gadget | Type | What It Does | When To Use |95|---|---|---|---|96| `trace_capabilities` | trace | Trace Linux capability checks | Permission denied from dropped capabilities, SecurityContext debugging |9798## Symptom-to-Gadget Map99100| Symptom | Gadget(s) |101|---|---|102| DNS resolution failures | `trace_dns` |103| Connection refused / timeout | `trace_tcp` + `snapshot_socket` |104| Silent connection drops | `trace_tcpretrans` |105| High network latency | `trace_tcpretrans` |106| TLS / HTTPS routing issues | `trace_sni` |107| Port already in use | `trace_bind` + `snapshot_socket` |108| CrashLoopBackOff (unknown cause) | `trace_exec` + `trace_open` |109| OOMKilled pods | `trace_oomkill` + `top_process` |110| Pod killed unexpectedly | `trace_signal` |111| PID pressure on node | `snapshot_process` + `top_process` |112| "Too many open files" | `top_file` |113| Missing config / secret mount | `trace_open` |114| Slow disk / PVC performance | `trace_fsslower` + `top_file` |115| Permission denied (capabilities) | `trace_capabilities` |116| High CPU (unknown cause) | `profile_cpu` + `top_process` |117| Deep packet inspection | `tcpdump` |118| Catch-all / intermittent issues | `traceloop` (use `--syscall-filters`) |119120## Gadget Type Reference121122| Type | Behavior | IG --timeout |123|---|---|---|124| `snapshot` | Point-in-time data, returns immediately | `--timeout 5` |125| `top` | Aggregated view, returns quickly | `--timeout 5` |126| `trace` | Streams events in real-time | `--timeout 30` |127| `profile` | Samples over a duration | `--timeout 30` |128| `tcpdump` | Streams pcap-ng data, pipe to `tcpdump -nvr -` | `--timeout 30` |129130## Guardrails131132- IG gadgets are **read-only** — they do not modify cluster or application state.133- Resolve the correct node name before running any IG command.134- Always set `--timeout` to cap runtime. Prefer snapshot/top for quick checks; trace/profile for behavior over time.135- For reproduction: launch a trace gadget first, then reproduce the problem. The debug pod persists after the gadget exits, so run `kubectl logs <debug-pod>` to retrieve the captured output afterward.136