Source from repo
Microsoft Foundry Skill

Build and deploy AI applications on Azure AI Foundry using Microsoft's model catalog and AI services
microsoftGitHub microsoftOfficialSource repo Original GitHub link Publisher page
Files
Skill
n/a
Size
564.8 KB
Entrypoint
SKILL.md
Format
git-repo
Open file
foundry-agent/eval-datasets/references/trace-to-dataset.md

Syntax-highlighted preview of this file as included in the skill package.
Rendered Source
markdown396 linesFree
foundry-agent/eval-datasets/references/trace-to-dataset.md
1# Trace-to-Dataset Pipeline — Harvest Production Traces as Test Cases
2 
3Extract production traces from App Insights using KQL, transform them into evaluation dataset format, and persist as versioned datasets. This is the core workflow for turning real-world agent failures into reproducible test cases.
4 
5## ⛔ Do NOT
6 
7- Do NOT use `parse_json(customDimensions)` — `customDimensions` is already a `dynamic` column in App Insights KQL. Access properties directly: `customDimensions["gen_ai.response.id"]`.
8 
9## Related References
10 
11- [Eval Correlation](../../trace/references/eval-correlation.md) (in `foundry-agent/trace/references/`) — look up eval scores by response/conversation ID via `customEvents`
12- [KQL Templates](../../trace/references/kql-templates.md) (in `foundry-agent/trace/references/`) — general trace query patterns and attribute mappings
13 
14## Prerequisites
15 
16- App Insights resource resolved (see [trace skill](../../trace/trace.md) Before Starting)
17- Agent root, selected metadata file, environment, and project endpoint available from `.foundry/agent-metadata*.yaml`
18- Time range confirmed with user (default: last 7 days)
19 
20When a repo contains multiple agent roots, this workflow updates only the selected agent root's `.foundry/datasets/`, `.foundry/results/`, and metadata files. Do **not** merge sibling agent folders.
21 
22> 💡 **Run all KQL queries** using **`monitor_resource_log_query`** (Azure MCP tool) against the App Insights resource. This is preferred over delegating to the `azure-kusto` skill.
23 
24> ⚠️ **Always pass `subscription` explicitly** to Azure MCP tools — they don't extract it from resource IDs.
25 
26## Overview
27 
28```
29App Insights traces
30    │
31    ▼
32[1] KQL Harvest Query (filter by error/latency/eval score)
33    │
34    ▼
35[2] Schema Transform (trace → JSONL format)
36    │
37    ▼
38[3] Human Review (show candidates, let user approve/edit/reject)
39    │
40    ▼
41[4] Persist Dataset (local JSONL files)
42    │
43    ▼
44[5] Sync to Foundry (optional — upload to project-connected storage)
45```
46 
47## Key Concept: Linking Evaluation Results to Traces
48 
49> 💡 **Evaluation results live in `customEvents`, not in `dependencies`.** Foundry writes eval scores to App Insights as `customEvents` with `name == "gen_ai.evaluation.result"`. Agent traces (spans) live in `dependencies`. The link between them is **`gen_ai.response.id`** — this field appears on both tables.
50 
51| Table | Contains | Join Key |
52|-------|----------|----------|
53| `dependencies` | Agent traces (spans, tool calls, LLM calls) | `customDimensions["gen_ai.response.id"]` |
54| `customEvents` | Evaluation results (scores, labels, explanations) | `customDimensions["gen_ai.response.id"]` |
55 
56**To harvest traces with eval scores**, join `customEvents` → `dependencies` on `responseId`. The [Low-Eval Harvest](#low-eval-harvest--traces-with-poor-evaluation-scores) template below shows this pattern. For standalone eval lookups, see [Eval Correlation](../../trace/references/eval-correlation.md) (in `foundry-agent/trace/references/`).
57 
58## Step 1 — Choose a Harvest Template
59 
60Select the appropriate KQL template based on user intent. These templates mirror common LangSmith "run rules" but offer more power through KQL's query language.
61 
62> ⚠️ **Hosted agents:** The Foundry agent name (e.g., `hosted-agent-022-001`) only appears on `requests`, NOT on `dependencies`. For hosted agents, use the [Hosted Agent Harvest](#hosted-agent-harvest) template which joins via `requests.id` → `dependencies.operation_ParentId`. The templates below work directly for **prompt agents** where `gen_ai.agent.name` on `dependencies` matches the Foundry name.
63 
64### Error Harvest — Failed Traces
65 
66Captures all traces where the agent returned errors. Equivalent to LangSmith's `eq(error, True)` run rule.
67 
68```kql
69dependencies
70| where timestamp > ago(7d)
71| where success == false
72| where isnotempty(customDimensions["gen_ai.operation.name"])
73| where customDimensions["gen_ai.agent.name"] == "<agent-name>"
74| extend
75    conversationId = tostring(customDimensions["gen_ai.conversation.id"]),
76    responseId = tostring(customDimensions["gen_ai.response.id"]),
77    operation = tostring(customDimensions["gen_ai.operation.name"]),
78    model = tostring(customDimensions["gen_ai.request.model"]),
79    errorType = tostring(customDimensions["error.type"]),
80    inputTokens = toint(customDimensions["gen_ai.usage.input_tokens"]),
81    outputTokens = toint(customDimensions["gen_ai.usage.output_tokens"])
82| summarize
83    errorCount = count(),
84    errors = make_set(errorType, 5),
85    firstSeen = min(timestamp),
86    lastSeen = max(timestamp)
87    by conversationId, responseId, operation, model
88| order by lastSeen desc
89| take 100
90```
91 
92### Low-Eval Harvest — Traces with Poor Evaluation Scores
93 
94Captures traces where evaluator scores fell below a threshold. Equivalent to LangSmith's `and(eq(feedback_key, "quality"), lt(feedback_score, 0.3))` run rule.
95 
96```kql
97let lowEvalResponses = customEvents
98| where timestamp > ago(7d)
99| where name == "gen_ai.evaluation.result"
100| extend
101    score = todouble(customDimensions["gen_ai.evaluation.score.value"]),
102    evalName = tostring(customDimensions["gen_ai.evaluation.name"]),
103    responseId = tostring(customDimensions["gen_ai.response.id"]),
104    conversationId = tostring(customDimensions["gen_ai.conversation.id"])
105| where score < <threshold>
106| project responseId, conversationId, evalName, score;
107lowEvalResponses
108| join kind=inner (
109    dependencies
110    | where timestamp > ago(7d)
111    | where isnotempty(customDimensions["gen_ai.response.id"])
112    | extend responseId = tostring(customDimensions["gen_ai.response.id"])
113) on responseId
114| extend
115    operation = tostring(customDimensions["gen_ai.operation.name"]),
116    model = tostring(customDimensions["gen_ai.request.model"]),
117    inputTokens = toint(customDimensions["gen_ai.usage.input_tokens"]),
118    outputTokens = toint(customDimensions["gen_ai.usage.output_tokens"])
119| project timestamp, conversationId, responseId, evalName, score, operation, model, duration
120| order by score asc
121| take 100
122```
123 
124> 💡 **Tip:** Replace `<threshold>` with the pass threshold from your evaluator config. Common values: `3.0` for 1–5 ordinal scales, `0.5` for 0–1 continuous scales.
125 
126### Latency Harvest — Slow Responses
127 
128Captures traces where response latency exceeds a threshold. Equivalent to LangSmith's `gt(latency, 5000)` run rule.
129 
130```kql
131dependencies
132| where timestamp > ago(7d)
133| where duration > <threshold_ms>
134| where isnotempty(customDimensions["gen_ai.operation.name"])
135| where customDimensions["gen_ai.agent.name"] == "<agent-name>"
136| extend
137    conversationId = tostring(customDimensions["gen_ai.conversation.id"]),
138    responseId = tostring(customDimensions["gen_ai.response.id"]),
139    operation = tostring(customDimensions["gen_ai.operation.name"]),
140    model = tostring(customDimensions["gen_ai.request.model"]),
141    inputTokens = toint(customDimensions["gen_ai.usage.input_tokens"]),
142    outputTokens = toint(customDimensions["gen_ai.usage.output_tokens"])
143| summarize
144    avgDuration = avg(duration),
145    maxDuration = max(duration),
146    spanCount = count()
147    by conversationId, responseId, operation, model
148| order by maxDuration desc
149| take 100
150```
151 
152> 💡 **Tip:** Replace `<threshold_ms>` with the latency threshold in milliseconds. Common values: `5000` (5s), `10000` (10s), `30000` (30s).
153 
154### Combined Harvest — Multi-Criteria Filter
155 
156Combines multiple filters in a single query. Equivalent to LangSmith's compound rule: `and(gt(latency, 2000), eq(error, true), has(tags, "prod"))`.
157 
158```kql
159dependencies
160| where timestamp > ago(7d)
161| where customDimensions["gen_ai.agent.name"] == "<agent-name>"
162| where isnotempty(customDimensions["gen_ai.operation.name"])
163| where success == false or duration > <threshold_ms>
164| extend
165    conversationId = tostring(customDimensions["gen_ai.conversation.id"]),
166    responseId = tostring(customDimensions["gen_ai.response.id"]),
167    operation = tostring(customDimensions["gen_ai.operation.name"]),
168    model = tostring(customDimensions["gen_ai.request.model"]),
169    errorType = tostring(customDimensions["error.type"]),
170    inputTokens = toint(customDimensions["gen_ai.usage.input_tokens"]),
171    outputTokens = toint(customDimensions["gen_ai.usage.output_tokens"])
172| summarize
173    errorCount = countif(success == false),
174    avgDuration = avg(duration),
175    maxDuration = max(duration),
176    spanCount = count()
177    by conversationId, responseId, operation, model
178| order by errorCount desc, maxDuration desc
179| take 100
180```
181 
182### Sampling — Control Dataset Size
183 
184Add `| sample <N>` or `| take <N>` to any harvest query to control the number of traces extracted. Equivalent to LangSmith's `sampling_rate` parameter.
185 
186```kql
187// Random sample of 50 traces from the harvest
188... | sample 50
189 
190// Top 50 most recent traces
191... | order by timestamp desc | take 50
192 
193// Stratified sample: 20 errors + 20 slow + 10 low-eval
194// Run each harvest separately and combine
195```
196 
197### Hosted Agent Harvest — Two-Step Join Pattern
198 
199For hosted agents, the Foundry agent name lives on `requests`, not `dependencies`. Use this two-step pattern:
200 
201```kql
202let reqIds = requests
203| where timestamp > ago(7d)
204| where customDimensions["gen_ai.agent.name"] == "<foundry-agent-name>"
205| distinct id;
206dependencies
207| where timestamp > ago(7d)
208| where operation_ParentId in (reqIds)
209| where customDimensions["gen_ai.operation.name"] == "invoke_agent"
210| extend
211    conversationId = tostring(customDimensions["gen_ai.conversation.id"]),
212    responseId = tostring(customDimensions["gen_ai.response.id"]),
213    operation = tostring(customDimensions["gen_ai.operation.name"]),
214    model = tostring(customDimensions["gen_ai.request.model"]),
215    inputTokens = toint(customDimensions["gen_ai.usage.input_tokens"]),
216    outputTokens = toint(customDimensions["gen_ai.usage.output_tokens"])
217| project timestamp, duration, success, conversationId, responseId, operation, model, inputTokens, outputTokens
218| order by timestamp desc
219| take 100
220```
221 
222> 💡 **When to use this pattern:** If the direct `dependencies` filter by `gen_ai.agent.name` returns no results, the agent is likely a hosted agent where `gen_ai.agent.name` on `dependencies` holds the code-level class name (e.g., `BingSearchAgent`), not the Foundry name. Switch to this `requests` → `dependencies` join.
223 
224## Step 2 — Schema Transform
225 
226Transform harvested traces into JSONL dataset format. Each line in the JSONL file must contain:
227 
228| Field | Required | Source |
229|-------|----------|--------|
230| `query` | ✅ | User input — extract from `gen_ai.input.messages` on `invoke_agent` dependency spans |
231| `response` | Optional | Agent output — extract from `gen_ai.output.messages` on `invoke_agent` dependency spans |
232| `context` | Optional | Tool results or retrieved documents from the trace |
233| `ground_truth` | Optional | Expected correct answer (add during curation) |
234| `metadata` | Optional | Source info: `{"source": "trace", "conversationId": "...", "harvestRule": "error"}` |
235 
236### Extracting Input/Output from Traces
237 
238The full input/output content lives on `invoke_agent` dependency spans in `gen_ai.input.messages` and `gen_ai.output.messages`. These contain complete message arrays:
239 
240```json
241// gen_ai.input.messages structure:
242[{"role": "user", "parts": [{"type": "text", "content": "How do I reset my password?"}]}]
243 
244// gen_ai.output.messages structure:
245[{"role": "assistant", "parts": [{"type": "text", "content": "To reset your password..."}]}]
246```
247 
248Query to extract input/output for a specific conversation:
249 
250```kql
251dependencies
252| where customDimensions["gen_ai.conversation.id"] == "<conversation-id>"
253| where customDimensions["gen_ai.operation.name"] in ("invoke_agent", "execute_agent", "chat", "create_response")
254| extend
255    responseId = tostring(customDimensions["gen_ai.response.id"]),
256    operation = tostring(customDimensions["gen_ai.operation.name"]),
257    inputMessages = tostring(customDimensions["gen_ai.input.messages"]),
258    outputMessages = tostring(customDimensions["gen_ai.output.messages"])
259| order by timestamp asc
260| take 10
261```
262 
263Extract the `query` from the last user-role entry in `gen_ai.input.messages` and the `response` from `gen_ai.output.messages`. Save extracted data to a local JSONL file:
264 
265```
266.foundry/datasets/<agent-name>-traces-candidates-<date>.jsonl
267```
268 
269## Step 3 — Human Review (Curation)
270 
271> ⚠️ **MANDATORY:** Never auto-commit harvested traces to a dataset. Always show candidates to the user first.
272 
273Present the harvested candidates as a table:
274 
275| # | Conversation ID | Error Type | Duration | Eval Score | Query (preview) |
276|---|----------------|------------|----------|------------|----------------|
277| 1 | conv-abc-123 | TimeoutError | 12.3s | 2.0 | "How do I reset my..." |
278| 2 | conv-def-456 | None | 8.7s | 1.5 | "What's the status of..." |
279| 3 | conv-ghi-789 | ValidationError | 0.4s | 3.0 | "Can you help me with..." |
280 
281Ask the user:
282- *"Which candidates should I include in the dataset? (all / select by number / filter by criteria)"*
283- *"Would you like to add ground_truth reference answers for any of these?"*
284- *"What should I name this dataset version?"*
285 
286## Step 4 — Persist Dataset (Local JSONL)
287 
288Save approved candidates to `.foundry/datasets/<agent-name>-<source>-v<N>.jsonl`:
289 
290```json
291{"query": "How do I reset my password?", "context": "User account management", "metadata": {"source": "trace", "conversationId": "conv-abc-123", "harvestRule": "error"}}
292{"query": "What's the status of my order?", "response": "...", "ground_truth": "Order #12345 shipped on...", "metadata": {"source": "trace", "conversationId": "conv-def-456", "harvestRule": "latency"}}
293```
294 
295### Update Manifest
296 
297After persisting, update `.foundry/datasets/manifest.json` with lineage information:
298 
299```json
300{
301  "datasets": [
302    {
303      "name": "support-bot-prod-traces",
304      "file": "support-bot-prod-traces-v3.jsonl",
305      "version": "v3",
306      "source": "trace-harvest",
307      "harvestRule": "error+latency",
308      "timeRange": "2025-02-01 to 2025-02-07",
309      "exampleCount": 47,
310      "createdAt": "2025-02-08T10:00:00Z",
311      "reviewedBy": "user"
312    }
313  ]
314}
315```
316 
317## Next Steps
318 
319After creating a dataset:
320- **Sync to Foundry** → Step 5 below (recommended for shared/CI use)
321- **Run evaluation** → [observe skill Step 2](../../observe/references/evaluate-step.md)
322- **Version and tag** → [Dataset Versioning](dataset-versioning.md)
323- **Organize into splits** → [Dataset Organization](dataset-organization.md)
324 
325## Step 5 — Sync Local Cache with Foundry (Optional)
326 
327Refresh or register the local cache in Foundry so it is available for server-side evaluations, shared access, and CI/CD pipelines. Reuse the local cache when it is current, and only refresh or push after user confirmation.
328 
329### 5a. Discover Storage Connection
330 
331Use `project_connection_list` to find an existing `AzureStorageAccount` connection on the Foundry project:
332 
333```
334project_connection_list(foundryProjectResourceId, category: "AzureStorageAccount")
335```
336 
337- **Found** → use its `connectionName` and `target` (storage account URL)
338- **Not found** → proceed to 5b
339 
340### 5b. Create Storage Connection (if needed)
341 
342Ask the user for a storage account, then create a project connection:
343 
344```
345project_connection_create(
346  foundryProjectResourceId,
347  connectionName: "datasets-storage",
348  category: "AzureStorageAccount",
349  target: "https://<storage-account>.blob.core.windows.net",
350  authType: "AAD"
351)
352```
353 
354> 💡 **Tip:** The storage account must be in the same subscription or the user must have access. AAD auth is preferred — it uses the caller's identity.
355 
356### 5c. Upload JSONL to Blob Storage
357 
358Upload the local dataset file to the same `eval-datasets` container used for seed datasets so all Foundry-registered eval datasets follow one storage pattern:
359 
360```bash
361az storage blob upload \
362  --account-name <storage-account> \
363  --container-name eval-datasets \
364  --name <agent-name>/<agent-name>-<source>-v<N>.jsonl \
365  --file .foundry/datasets/<agent-name>-<source>-v<N>.jsonl \
366  --auth-mode login
367```
368 
369The local dataset filename should start with the selected Foundry agent name before the source/stage/version suffixes so trace-derived datasets stay grouped with the owning agent.
370 
371> ⚠️ **Always pass `--auth-mode login`** to use AAD credentials. If the container doesn't exist, create it first with `az storage container create`.
372 
373### 5d. Register Dataset in Foundry
374 
375Use `evaluation_dataset_create` with the blob URI and the `AzureStorageAccount` `connectionName` discovered in 5a or created in 5b. While `connectionName` can be optional in other MCP flows, include it in this workflow so the dataset is bound to the project-connected storage account:
376 
377```
378evaluation_dataset_create(
379  projectEndpoint: "<project-endpoint>",
380  datasetContentUri: "https://<storage-account>.blob.core.windows.net/eval-datasets/<agent-name>/<agent-name>-<source>-v<N>.jsonl",
381  connectionName: "datasets-storage",
382  datasetName: "<agent-name>-<source>",
383  datasetVersion: "v<N>"
384)
385```
386 
387### 5e. Verify
388 
389Confirm the dataset is registered:
390 
391```
392evaluation_dataset_get(projectEndpoint, datasetName: "<agent-name>-<source>", datasetVersion: "v<N>")
393```
394 
395Display the registered dataset details to the user. Update `.foundry/datasets/manifest.json` with `"synced": true` and the server-side dataset name/version.
396
Preparing the source view

Microsoft Foundry Skill

foundry-agent/eval-datasets/references/trace-to-dataset.md