Loading source
Pulling the file list, source metadata, and syntax-aware rendering for this listing.
Source from repo
Debug and troubleshoot Azure Container Apps and Function Apps using logs, KQL, and health checks.
Files
Skill
Size
Entrypoint
Format
Open file
Syntax-highlighted preview of this file as included in the skill package.
troubleshooting/aks/references/inspektor-gadget.md
1# Inspektor Gadget (IG) Reference23Use Inspektor Gadget for real-time, low-level node/pod diagnostics when `kubectl` is insufficient.45## IG Version67`<ig-version>` = `v0.51.0` — substitute this exact tag (with `v` prefix) wherever `<ig-version>` appears. Bump this line only.89## Base Command Pattern1011```bash12kubectl debug --profile=sysadmin node/<node-name> --attach --quiet \13--image=mcr.microsoft.com/oss/v2/inspektor-gadget/ig:<ig-version> \14-- ig run <gadget>:<ig-version> -o json --timeout <seconds> [filters...]15```1617Always set `--timeout` after `--` to cap runtime. Use `--timeout 5` for snapshot/top, `--timeout 30` for trace/profile.1819> **Note:** IG uses `kubectl debug --profile=sysadmin` (privileged debug pod). Only run with explicit user approval and appropriate RBAC.2021**Required:** Resolve the node name first:2223```bash24kubectl get pod <pod-name> -n <namespace> -o jsonpath='{.spec.nodeName}'25```2627## Common Filters2829| Filter | Description |30|---|---|31| `--k8s-namespace <ns>` | Scope to a Kubernetes namespace |32| `--k8s-podname <pod>` | Scope to a specific pod |33| `--k8s-containername <ctr>` | Scope to a specific container |34| `--timeout <seconds>` | Cap streaming duration for trace/profile gadgets |35| `--max-entries <n>` | Max entries per batch for top/profile gadgets |36| `--map-fetch-interval <dur>` | Map fetch interval for top (except `top_process`) and profile gadgets (default `1000ms`) |37| `--interval <dur>` | Reporting interval for `top_process` only (e.g. `5s`) |38| `--syscall-filters <list>` | Comma-separated syscalls for `traceloop` (e.g. `open,connect,accept`). **Always specify** to limit data volume |3940> **Tip:** For top/profile, set `--map-fetch-interval` ≤ half of `--timeout` to collect at least one batch. E.g. `--timeout 2 --map-fetch-interval 1000ms --max-entries 20`.41>42> **Note:** `top_process` uses `--interval` instead of `--map-fetch-interval`. E.g. `--timeout 10 --interval 5s --max-entries 20`.4344## Gadget Catalog4546### Networking4748| Gadget | Type | What It Does | When To Use |49|---|---|---|---|50| `trace_dns` | trace | Trace DNS queries and responses with latency | DNS failures, NXDOMAIN, SERVFAIL, slow resolution, intermittent DNS |51| `trace_tcp` | trace | Trace TCP connect/accept/close events | Connection refused, timeouts, unexpected drops, mapping pod connectivity |52| `trace_tcpretrans` | trace | Trace TCP retransmissions | Network congestion, lossy links, high latency between pods/services |53| `trace_bind` | trace | Trace socket bind calls | Port conflicts, address-already-in-use errors |54| `trace_sni` | trace | Trace TLS SNI (Server Name Indication) values | HTTPS routing issues, ingress TLS debugging, mTLS problems |55| `snapshot_socket` | snapshot | List open sockets (TCP/UDP/Unix) | Port conflicts, listening ports, connection leaks, ECONNREFUSED |56| `tcpdump` | special | Capture raw packets in pcap-ng format | Deep packet inspection, protocol-level debugging, reproducing network issues |5758#### tcpdump gadget5960Outputs raw pcap-ng data. Pipe to `tcpdump` for readable output:6162```bash63kubectl debug --profile=sysadmin node/<node-name> --attach --quiet \64--image=mcr.microsoft.com/oss/v2/inspektor-gadget/ig:<ig-version> \65-- ig run tcpdump:<ig-version> -o pcap-ng --k8s-namespace <ns> --k8s-podname <pod> \66--timeout 30 --filter "port 80" \67| tcpdump -nvr -68```6970Use `--filter "<expr>"` for tcpdump filters (e.g., `port 80`, `host 10.0.0.1`). Output must be `-o pcap-ng` (not `-o json`).7172### Process & Workload7374| Gadget | Type | What It Does | When To Use |75|---|---|---|---|76| `snapshot_process` | snapshot | List running processes in pod/node | PID pressure, unknown processes, verifying entrypoint, CrashLoopBackOff |77| `trace_exec` | trace | Trace process execution (execve calls) | CrashLoopBackOff (what actually runs), unexpected child processes, security audit |78| `trace_oomkill` | trace | Trace OOM kill events with victim details | OOMKilled pods — see which process was killed, memory usage at kill time |79| `trace_signal` | trace | Trace signals delivered to processes | Unexpected SIGKILL/SIGTERM, liveness probe kills, graceful shutdown issues |80| `top_process` | top | Rank processes by CPU/memory usage | Identifying resource-hungry processes inside a pod or across a node |81| `profile_cpu` | profile | CPU profiling via stack sampling | High CPU usage investigation, finding hot code paths |82| `traceloop` | trace | Record syscalls as a flight recorder | Catch-all for intermittent issues. **Always use `--syscall-filters`** (e.g., `open,connect,accept`) to limit data volume |8384### File & Storage8586| Gadget | Type | What It Does | When To Use |87|---|---|---|---|88| `trace_open` | trace | Trace openat syscall | Missing config/secret files (ENOENT), permission denied (EACCES), startup failures |89| `trace_fsslower` | trace | Trace slow filesystem operations | Slow disk I/O, PVC performance issues, NFS/Azure Disk latency |90| `top_file` | top | Rank files by read/write activity | Identifying I/O-heavy files, noisy log writers, disk pressure diagnosis |9192### Security & Audit9394| Gadget | Type | What It Does | When To Use |95|---|---|---|---|96| `trace_capabilities` | trace | Trace Linux capability checks | Permission denied from dropped capabilities, SecurityContext debugging |9798## Symptom-to-Gadget Map99100| Symptom | Gadget(s) |101|---|---|102| DNS resolution failures | `trace_dns` |103| Connection refused / timeout | `trace_tcp` + `snapshot_socket` |104| Silent connection drops | `trace_tcpretrans` |105| High network latency | `trace_tcpretrans` |106| TLS / HTTPS routing issues | `trace_sni` |107| Port already in use | `trace_bind` + `snapshot_socket` |108| CrashLoopBackOff (unknown cause) | `trace_exec` + `trace_open` |109| OOMKilled pods | `trace_oomkill` + `top_process` |110| Pod killed unexpectedly | `trace_signal` |111| PID pressure on node | `snapshot_process` + `top_process` |112| "Too many open files" | `top_file` |113| Missing config / secret mount | `trace_open` |114| Slow disk / PVC performance | `trace_fsslower` + `top_file` |115| Permission denied (capabilities) | `trace_capabilities` |116| High CPU (unknown cause) | `profile_cpu` + `top_process` |117| Deep packet inspection | `tcpdump` |118| Catch-all / intermittent issues | `traceloop` (use `--syscall-filters`) |119120## Gadget Type Reference121122| Type | Behavior | IG --timeout |123|---|---|---|124| `snapshot` | Point-in-time data, returns immediately | `--timeout 5` |125| `top` | Aggregated view, returns quickly | `--timeout 5` |126| `trace` | Streams events in real-time | `--timeout 30` |127| `profile` | Samples over a duration | `--timeout 30` |128| `tcpdump` | Streams pcap-ng data, pipe to `tcpdump -nvr -` | `--timeout 30` |129130## Guardrails131132- IG gadgets are **read-only** — they do not modify cluster or application state.133- Resolve the correct node name before running any IG command.134- Always set `--timeout` to cap runtime. Prefer snapshot/top for quick checks; trace/profile for behavior over time.135- For reproduction: launch a trace gadget first, then reproduce the problem. The debug pod persists after the gadget exits, so run `kubectl logs <debug-pod>` to retrieve the captured output afterward.136