Loading source
Pulling the file list, source metadata, and syntax-aware rendering for this listing.
Source from repo
Build and deploy AI applications on Azure AI Foundry using Microsoft's model catalog and AI services
Files
Skill
Size
Entrypoint
Format
Open file
Syntax-highlighted preview of this file as included in the skill package.
foundry-agent/eval-datasets/references/trace-to-dataset.md
1# Trace-to-Dataset Pipeline — Harvest Production Traces as Test Cases23Extract production traces from App Insights using KQL, transform them into evaluation dataset format, and persist as versioned datasets. This is the core workflow for turning real-world agent failures into reproducible test cases.45## ⛔ Do NOT67- Do NOT use `parse_json(customDimensions)` — `customDimensions` is already a `dynamic` column in App Insights KQL. Access properties directly: `customDimensions["gen_ai.response.id"]`.89## Related References1011- [Eval Correlation](../../trace/references/eval-correlation.md) (in `foundry-agent/trace/references/`) — look up eval scores by response/conversation ID via `customEvents`12- [KQL Templates](../../trace/references/kql-templates.md) (in `foundry-agent/trace/references/`) — general trace query patterns and attribute mappings1314## Prerequisites1516- App Insights resource resolved (see [trace skill](../../trace/trace.md) Before Starting)17- Agent root, selected metadata file, environment, and project endpoint available from `.foundry/agent-metadata*.yaml`18- Time range confirmed with user (default: last 7 days)1920When a repo contains multiple agent roots, this workflow updates only the selected agent root's `.foundry/datasets/`, `.foundry/results/`, and metadata files. Do **not** merge sibling agent folders.2122> 💡 **Run all KQL queries** using **`monitor_resource_log_query`** (Azure MCP tool) against the App Insights resource. This is preferred over delegating to the `azure-kusto` skill.2324> ⚠️ **Always pass `subscription` explicitly** to Azure MCP tools — they don't extract it from resource IDs.2526## Overview2728```29App Insights traces30│31▼32[1] KQL Harvest Query (filter by error/latency/eval score)33│34▼35[2] Schema Transform (trace → JSONL format)36│37▼38[3] Human Review (show candidates, let user approve/edit/reject)39│40▼41[4] Persist Dataset (local JSONL files)42│43▼44[5] Sync to Foundry (optional — upload to project-connected storage)45```4647## Key Concept: Linking Evaluation Results to Traces4849> 💡 **Evaluation results live in `customEvents`, not in `dependencies`.** Foundry writes eval scores to App Insights as `customEvents` with `name == "gen_ai.evaluation.result"`. Agent traces (spans) live in `dependencies`. The link between them is **`gen_ai.response.id`** — this field appears on both tables.5051| Table | Contains | Join Key |52|-------|----------|----------|53| `dependencies` | Agent traces (spans, tool calls, LLM calls) | `customDimensions["gen_ai.response.id"]` |54| `customEvents` | Evaluation results (scores, labels, explanations) | `customDimensions["gen_ai.response.id"]` |5556**To harvest traces with eval scores**, join `customEvents` → `dependencies` on `responseId`. The [Low-Eval Harvest](#low-eval-harvest--traces-with-poor-evaluation-scores) template below shows this pattern. For standalone eval lookups, see [Eval Correlation](../../trace/references/eval-correlation.md) (in `foundry-agent/trace/references/`).5758## Step 1 — Choose a Harvest Template5960Select the appropriate KQL template based on user intent. These templates mirror common LangSmith "run rules" but offer more power through KQL's query language.6162> ⚠️ **Hosted agents:** The Foundry agent name (e.g., `hosted-agent-022-001`) only appears on `requests`, NOT on `dependencies`. For hosted agents, use the [Hosted Agent Harvest](#hosted-agent-harvest) template which joins via `requests.id` → `dependencies.operation_ParentId`. The templates below work directly for **prompt agents** where `gen_ai.agent.name` on `dependencies` matches the Foundry name.6364### Error Harvest — Failed Traces6566Captures all traces where the agent returned errors. Equivalent to LangSmith's `eq(error, True)` run rule.6768```kql69dependencies70| where timestamp > ago(7d)71| where success == false72| where isnotempty(customDimensions["gen_ai.operation.name"])73| where customDimensions["gen_ai.agent.name"] == "<agent-name>"74| extend75conversationId = tostring(customDimensions["gen_ai.conversation.id"]),76responseId = tostring(customDimensions["gen_ai.response.id"]),77operation = tostring(customDimensions["gen_ai.operation.name"]),78model = tostring(customDimensions["gen_ai.request.model"]),79errorType = tostring(customDimensions["error.type"]),80inputTokens = toint(customDimensions["gen_ai.usage.input_tokens"]),81outputTokens = toint(customDimensions["gen_ai.usage.output_tokens"])82| summarize83errorCount = count(),84errors = make_set(errorType, 5),85firstSeen = min(timestamp),86lastSeen = max(timestamp)87by conversationId, responseId, operation, model88| order by lastSeen desc89| take 10090```9192### Low-Eval Harvest — Traces with Poor Evaluation Scores9394Captures traces where evaluator scores fell below a threshold. Equivalent to LangSmith's `and(eq(feedback_key, "quality"), lt(feedback_score, 0.3))` run rule.9596```kql97let lowEvalResponses = customEvents98| where timestamp > ago(7d)99| where name == "gen_ai.evaluation.result"100| extend101score = todouble(customDimensions["gen_ai.evaluation.score.value"]),102evalName = tostring(customDimensions["gen_ai.evaluation.name"]),103responseId = tostring(customDimensions["gen_ai.response.id"]),104conversationId = tostring(customDimensions["gen_ai.conversation.id"])105| where score < <threshold>106| project responseId, conversationId, evalName, score;107lowEvalResponses108| join kind=inner (109dependencies110| where timestamp > ago(7d)111| where isnotempty(customDimensions["gen_ai.response.id"])112| extend responseId = tostring(customDimensions["gen_ai.response.id"])113) on responseId114| extend115operation = tostring(customDimensions["gen_ai.operation.name"]),116model = tostring(customDimensions["gen_ai.request.model"]),117inputTokens = toint(customDimensions["gen_ai.usage.input_tokens"]),118outputTokens = toint(customDimensions["gen_ai.usage.output_tokens"])119| project timestamp, conversationId, responseId, evalName, score, operation, model, duration120| order by score asc121| take 100122```123124> 💡 **Tip:** Replace `<threshold>` with the pass threshold from your evaluator config. Common values: `3.0` for 1–5 ordinal scales, `0.5` for 0–1 continuous scales.125126### Latency Harvest — Slow Responses127128Captures traces where response latency exceeds a threshold. Equivalent to LangSmith's `gt(latency, 5000)` run rule.129130```kql131dependencies132| where timestamp > ago(7d)133| where duration > <threshold_ms>134| where isnotempty(customDimensions["gen_ai.operation.name"])135| where customDimensions["gen_ai.agent.name"] == "<agent-name>"136| extend137conversationId = tostring(customDimensions["gen_ai.conversation.id"]),138responseId = tostring(customDimensions["gen_ai.response.id"]),139operation = tostring(customDimensions["gen_ai.operation.name"]),140model = tostring(customDimensions["gen_ai.request.model"]),141inputTokens = toint(customDimensions["gen_ai.usage.input_tokens"]),142outputTokens = toint(customDimensions["gen_ai.usage.output_tokens"])143| summarize144avgDuration = avg(duration),145maxDuration = max(duration),146spanCount = count()147by conversationId, responseId, operation, model148| order by maxDuration desc149| take 100150```151152> 💡 **Tip:** Replace `<threshold_ms>` with the latency threshold in milliseconds. Common values: `5000` (5s), `10000` (10s), `30000` (30s).153154### Combined Harvest — Multi-Criteria Filter155156Combines multiple filters in a single query. Equivalent to LangSmith's compound rule: `and(gt(latency, 2000), eq(error, true), has(tags, "prod"))`.157158```kql159dependencies160| where timestamp > ago(7d)161| where customDimensions["gen_ai.agent.name"] == "<agent-name>"162| where isnotempty(customDimensions["gen_ai.operation.name"])163| where success == false or duration > <threshold_ms>164| extend165conversationId = tostring(customDimensions["gen_ai.conversation.id"]),166responseId = tostring(customDimensions["gen_ai.response.id"]),167operation = tostring(customDimensions["gen_ai.operation.name"]),168model = tostring(customDimensions["gen_ai.request.model"]),169errorType = tostring(customDimensions["error.type"]),170inputTokens = toint(customDimensions["gen_ai.usage.input_tokens"]),171outputTokens = toint(customDimensions["gen_ai.usage.output_tokens"])172| summarize173errorCount = countif(success == false),174avgDuration = avg(duration),175maxDuration = max(duration),176spanCount = count()177by conversationId, responseId, operation, model178| order by errorCount desc, maxDuration desc179| take 100180```181182### Sampling — Control Dataset Size183184Add `| sample <N>` or `| take <N>` to any harvest query to control the number of traces extracted. Equivalent to LangSmith's `sampling_rate` parameter.185186```kql187// Random sample of 50 traces from the harvest188... | sample 50189190// Top 50 most recent traces191... | order by timestamp desc | take 50192193// Stratified sample: 20 errors + 20 slow + 10 low-eval194// Run each harvest separately and combine195```196197### Hosted Agent Harvest — Two-Step Join Pattern198199For hosted agents, the Foundry agent name lives on `requests`, not `dependencies`. Use this two-step pattern:200201```kql202let reqIds = requests203| where timestamp > ago(7d)204| where customDimensions["gen_ai.agent.name"] == "<foundry-agent-name>"205| distinct id;206dependencies207| where timestamp > ago(7d)208| where operation_ParentId in (reqIds)209| where customDimensions["gen_ai.operation.name"] == "invoke_agent"210| extend211conversationId = tostring(customDimensions["gen_ai.conversation.id"]),212responseId = tostring(customDimensions["gen_ai.response.id"]),213operation = tostring(customDimensions["gen_ai.operation.name"]),214model = tostring(customDimensions["gen_ai.request.model"]),215inputTokens = toint(customDimensions["gen_ai.usage.input_tokens"]),216outputTokens = toint(customDimensions["gen_ai.usage.output_tokens"])217| project timestamp, duration, success, conversationId, responseId, operation, model, inputTokens, outputTokens218| order by timestamp desc219| take 100220```221222> 💡 **When to use this pattern:** If the direct `dependencies` filter by `gen_ai.agent.name` returns no results, the agent is likely a hosted agent where `gen_ai.agent.name` on `dependencies` holds the code-level class name (e.g., `BingSearchAgent`), not the Foundry name. Switch to this `requests` → `dependencies` join.223224## Step 2 — Schema Transform225226Transform harvested traces into JSONL dataset format. Each line in the JSONL file must contain:227228| Field | Required | Source |229|-------|----------|--------|230| `query` | ✅ | User input — extract from `gen_ai.input.messages` on `invoke_agent` dependency spans |231| `response` | Optional | Agent output — extract from `gen_ai.output.messages` on `invoke_agent` dependency spans |232| `context` | Optional | Tool results or retrieved documents from the trace |233| `ground_truth` | Optional | Expected correct answer (add during curation) |234| `metadata` | Optional | Source info: `{"source": "trace", "conversationId": "...", "harvestRule": "error"}` |235236### Extracting Input/Output from Traces237238The full input/output content lives on `invoke_agent` dependency spans in `gen_ai.input.messages` and `gen_ai.output.messages`. These contain complete message arrays:239240```json241// gen_ai.input.messages structure:242[{"role": "user", "parts": [{"type": "text", "content": "How do I reset my password?"}]}]243244// gen_ai.output.messages structure:245[{"role": "assistant", "parts": [{"type": "text", "content": "To reset your password..."}]}]246```247248Query to extract input/output for a specific conversation:249250```kql251dependencies252| where customDimensions["gen_ai.conversation.id"] == "<conversation-id>"253| where customDimensions["gen_ai.operation.name"] in ("invoke_agent", "execute_agent", "chat", "create_response")254| extend255responseId = tostring(customDimensions["gen_ai.response.id"]),256operation = tostring(customDimensions["gen_ai.operation.name"]),257inputMessages = tostring(customDimensions["gen_ai.input.messages"]),258outputMessages = tostring(customDimensions["gen_ai.output.messages"])259| order by timestamp asc260| take 10261```262263Extract the `query` from the last user-role entry in `gen_ai.input.messages` and the `response` from `gen_ai.output.messages`. Save extracted data to a local JSONL file:264265```266.foundry/datasets/<agent-name>-traces-candidates-<date>.jsonl267```268269## Step 3 — Human Review (Curation)270271> ⚠️ **MANDATORY:** Never auto-commit harvested traces to a dataset. Always show candidates to the user first.272273Present the harvested candidates as a table:274275| # | Conversation ID | Error Type | Duration | Eval Score | Query (preview) |276|---|----------------|------------|----------|------------|----------------|277| 1 | conv-abc-123 | TimeoutError | 12.3s | 2.0 | "How do I reset my..." |278| 2 | conv-def-456 | None | 8.7s | 1.5 | "What's the status of..." |279| 3 | conv-ghi-789 | ValidationError | 0.4s | 3.0 | "Can you help me with..." |280281Ask the user:282- *"Which candidates should I include in the dataset? (all / select by number / filter by criteria)"*283- *"Would you like to add ground_truth reference answers for any of these?"*284- *"What should I name this dataset version?"*285286## Step 4 — Persist Dataset (Local JSONL)287288Save approved candidates to `.foundry/datasets/<agent-name>-<source>-v<N>.jsonl`:289290```json291{"query": "How do I reset my password?", "context": "User account management", "metadata": {"source": "trace", "conversationId": "conv-abc-123", "harvestRule": "error"}}292{"query": "What's the status of my order?", "response": "...", "ground_truth": "Order #12345 shipped on...", "metadata": {"source": "trace", "conversationId": "conv-def-456", "harvestRule": "latency"}}293```294295### Update Manifest296297After persisting, update `.foundry/datasets/manifest.json` with lineage information:298299```json300{301"datasets": [302{303"name": "support-bot-prod-traces",304"file": "support-bot-prod-traces-v3.jsonl",305"version": "v3",306"source": "trace-harvest",307"harvestRule": "error+latency",308"timeRange": "2025-02-01 to 2025-02-07",309"exampleCount": 47,310"createdAt": "2025-02-08T10:00:00Z",311"reviewedBy": "user"312}313]314}315```316317## Next Steps318319After creating a dataset:320- **Sync to Foundry** → Step 5 below (recommended for shared/CI use)321- **Run evaluation** → [observe skill Step 2](../../observe/references/evaluate-step.md)322- **Version and tag** → [Dataset Versioning](dataset-versioning.md)323- **Organize into splits** → [Dataset Organization](dataset-organization.md)324325## Step 5 — Sync Local Cache with Foundry (Optional)326327Refresh or register the local cache in Foundry so it is available for server-side evaluations, shared access, and CI/CD pipelines. Reuse the local cache when it is current, and only refresh or push after user confirmation.328329### 5a. Discover Storage Connection330331Use `project_connection_list` to find an existing `AzureStorageAccount` connection on the Foundry project:332333```334project_connection_list(foundryProjectResourceId, category: "AzureStorageAccount")335```336337- **Found** → use its `connectionName` and `target` (storage account URL)338- **Not found** → proceed to 5b339340### 5b. Create Storage Connection (if needed)341342Ask the user for a storage account, then create a project connection:343344```345project_connection_create(346foundryProjectResourceId,347connectionName: "datasets-storage",348category: "AzureStorageAccount",349target: "https://<storage-account>.blob.core.windows.net",350authType: "AAD"351)352```353354> 💡 **Tip:** The storage account must be in the same subscription or the user must have access. AAD auth is preferred — it uses the caller's identity.355356### 5c. Upload JSONL to Blob Storage357358Upload the local dataset file to the same `eval-datasets` container used for seed datasets so all Foundry-registered eval datasets follow one storage pattern:359360```bash361az storage blob upload \362--account-name <storage-account> \363--container-name eval-datasets \364--name <agent-name>/<agent-name>-<source>-v<N>.jsonl \365--file .foundry/datasets/<agent-name>-<source>-v<N>.jsonl \366--auth-mode login367```368369The local dataset filename should start with the selected Foundry agent name before the source/stage/version suffixes so trace-derived datasets stay grouped with the owning agent.370371> ⚠️ **Always pass `--auth-mode login`** to use AAD credentials. If the container doesn't exist, create it first with `az storage container create`.372373### 5d. Register Dataset in Foundry374375Use `evaluation_dataset_create` with the blob URI and the `AzureStorageAccount` `connectionName` discovered in 5a or created in 5b. While `connectionName` can be optional in other MCP flows, include it in this workflow so the dataset is bound to the project-connected storage account:376377```378evaluation_dataset_create(379projectEndpoint: "<project-endpoint>",380datasetContentUri: "https://<storage-account>.blob.core.windows.net/eval-datasets/<agent-name>/<agent-name>-<source>-v<N>.jsonl",381connectionName: "datasets-storage",382datasetName: "<agent-name>-<source>",383datasetVersion: "v<N>"384)385```386387### 5e. Verify388389Confirm the dataset is registered:390391```392evaluation_dataset_get(projectEndpoint, datasetName: "<agent-name>-<source>", datasetVersion: "v<N>")393```394395Display the registered dataset details to the user. Update `.foundry/datasets/manifest.json` with `"synced": true` and the server-side dataset name/version.396