Source from repo
Microsoft Foundry Skill

Build and deploy AI applications on Azure AI Foundry using Microsoft's model catalog and AI services
microsoftGitHub microsoftOfficialSource repo Original GitHub link Publisher page
Files
Skill
n/a
Size
546.7 KB
Entrypoint
SKILL.md
Format
git-repo
Open file
foundry-agent/eval-datasets/references/dataset-versioning.md

Syntax-highlighted preview of this file as included in the skill package.
Rendered Source
markdown182 linesFree
foundry-agent/eval-datasets/references/dataset-versioning.md
1# Dataset Versioning — Version Management & Tagging
2 
3Manage dataset versions with naming conventions, tagging, and version pinning for reproducible evaluations. This workflow formalizes dataset lifecycle management using existing MCP tools and local conventions.
4 
5## Naming Convention
6 
7Use the pattern `<agent-name>-<source>-v<N>`:
8 
9| Component | Values | Example |
10|-----------|--------|---------|
11| `<agent-name>` | Selected environment's `agentName` from the selected metadata file | `support-bot-prod` |
12| `<source>` | `traces`, `synthetic`, `manual`, `combined` | `traces` |
13| `v<N>` | Incremental version number | `v3` |
14 
15`<agent-name>` already refers to the environment-specific deployed Foundry agent name. If that value includes the environment key, do **not** append the environment again.
16 
17**Full examples:**
18- `support-bot-prod-traces-v1` — first production dataset from trace harvesting
19- `support-bot-dev-synthetic-v2` — second synthetic dataset
20- `support-bot-prod-combined-v5` — fifth production dataset combining traces + manual examples
21 
22## Tagging Conventions
23 
24Tags are stored in `.foundry/datasets/manifest.json` alongside dataset metadata:
25 
26| Tag | Meaning | When to Apply |
27|-----|---------|---------------|
28| `baseline` | Reference dataset for comparison | When establishing a new evaluation baseline |
29| `prod` | Dataset used for current production evaluation | After successful deployment |
30| `canary` | Dataset for canary/staging evaluation | During staged rollout |
31| `regression-<date>` | Dataset that caught a regression | When a regression is detected |
32| `deprecated` | Dataset no longer in active use | When replaced by a newer version |
33 
34## Version Pinning
35 
36Pin evaluations to a specific dataset version to ensure reproducible, comparable results:
37 
38### Local Pinning (JSONL Datasets)
39 
40When using local JSONL files, reference the exact filename in evaluation runs:
41 
42```
43.foundry/datasets/support-bot-prod-traces-v3.jsonl  ← pinned by filename
44```
45 
46Pass the contents via `inputData` parameter in **`evaluation_agent_batch_eval_create`**.
47 
48### Server-Side Version Discovery
49 
50Use `evaluation_dataset_versions_get` to list all versions of a dataset registered in Foundry:
51 
52```
53evaluation_dataset_versions_get(projectEndpoint, datasetName: "<agent-name>-<source>")
54```
55 
56Use `evaluation_dataset_get` without a name to list all datasets in the project:
57 
58```
59evaluation_dataset_get(projectEndpoint)
60```
61 
62> 💡 **Tip:** Server-side versions are available after syncing via [Trace-to-Dataset → Step 5](trace-to-dataset.md#step-5--sync-local-cache-with-foundry-optional). Local `manifest.json` remains useful for lineage metadata (source, harvestRule, reviewedBy) not stored server-side.
63 
64## Manifest File
65 
66Track all dataset versions, required dataset metadata, tags, and lineage in `.foundry/datasets/manifest.json`:
67 
68```json
69{
70  "datasets": [
71    {
72      "name": "support-bot-prod-traces",
73      "file": "support-bot-prod-traces-v1.jsonl",
74      "version": "v1",
75      "agent": "support-bot-prod",
76      "stage": "traces",
77      "datasetUri": "<foundry-dataset-uri-v1>",
78      "tag": "deprecated",
79      "source": "trace-harvest",
80      "harvestRule": "error",
81      "timeRange": "2025-01-01 to 2025-01-07",
82      "exampleCount": 32,
83      "createdAt": "2025-01-08T10:00:00Z",
84      "evalRunIds": ["run-abc-123"]
85    },
86    {
87      "name": "support-bot-prod-traces",
88      "file": "support-bot-prod-traces-v2.jsonl",
89      "version": "v2",
90      "agent": "support-bot-prod",
91      "stage": "traces",
92      "datasetUri": "<foundry-dataset-uri-v2>",
93      "tag": "baseline",
94      "source": "trace-harvest",
95      "harvestRule": "error+latency",
96      "timeRange": "2025-01-15 to 2025-01-21",
97      "exampleCount": 47,
98      "createdAt": "2025-01-22T10:00:00Z",
99      "evalRunIds": ["run-def-456", "run-ghi-789"]
100    },
101    {
102      "name": "support-bot-prod-traces",
103      "file": "support-bot-prod-traces-v3.jsonl",
104      "version": "v3",
105      "agent": "support-bot-prod",
106      "stage": "traces",
107      "datasetUri": "<foundry-dataset-uri-v3>",
108      "tag": "prod",
109      "source": "trace-harvest",
110      "harvestRule": "error+latency+low-eval",
111      "timeRange": "2025-02-01 to 2025-02-07",
112      "exampleCount": 63,
113      "createdAt": "2025-02-08T10:00:00Z",
114      "evalRunIds": []
115    }
116  ]
117}
118```
119 
120Keep `stage` stable for the dataset family (`seed`, `traces`, `curated`, or `prod`) and use `tag` for mutable lifecycle labels such as `baseline`, `prod`, or `deprecated`. Persist `datasetUri` as the Foundry-returned dataset reference so deploy and observe workflows can resolve the registered dataset directly.
121 
122## Creating a New Version
123 
1241. **Check existing versions**: Read `.foundry/datasets/manifest.json` to find the latest version number
1252. **Increment version**: Use `v<N+1>` as the new version
1263. **Create dataset**: Via [Trace-to-Dataset](trace-to-dataset.md) or manual JSONL creation
1274. **Update manifest**: Add the new entry with metadata
1285. **Tag appropriately**: Apply `baseline`, `prod`, or other tags as needed
1296. **Deprecate old**: Optionally mark previous versions as `deprecated`
130 
131> ⚠️ **DO NOT stop here.** After creating a new dataset version, continue to the Dataset Update Loop below.
132 
133## Dataset Update Loop — Eval → Analyze → Optimize → Re-Eval
134 
135When a dataset is updated (new rows, better coverage, new failure modes), run this loop to validate the agent against the harder test suite:
136 
137```
138[1] Eval with new dataset (v2) using same agent version
139    │
140    ▼
141[2] Compare: eval on v1 vs eval on v2 (same agent, different datasets)
142    │
143    ▼
144[3] Analyze score changes — expect some drops (harder tests ≠ worse agent)
145    │
146    ▼
147[4] Optimize agent prompt based on NEW failure patterns only
148    │
149    ▼
150[5] Re-eval optimized agent on v2 dataset → compare to pre-optimization
151    │
152    ▼
153[6] If satisfied → tag v2 as `prod`, archive v1
154```
155 
156### ⛔ Guardrails for This Loop
157 
158- **Never remove dataset rows to recover scores.** If eval scores drop after a dataset update, the dataset is likely exposing real gaps. Removing hard cases defeats the purpose.
159- **Never weaken evaluators to recover scores.** Do not lower thresholds, remove evaluators, or switch to easier scoring when scores drop on an expanded dataset.
160- **Distinguish dataset difficulty from agent regression.** A score drop on a harder dataset is expected and healthy — it means test coverage improved. Only flag as regression when the same dataset + same evaluators produce worse scores on a new agent version.
161- **Optimize for NEW failure patterns only.** When optimizing the agent prompt after a dataset update, target the newly added test cases. Do not re-optimize for cases that were already passing.
162 
163## Comparing Versions
164 
165To understand how a dataset evolved between versions:
166 
167```bash
168# Count examples per version
169wc -l .foundry/datasets/support-bot-prod-traces-v*.jsonl
170 
171# Diff example queries between versions
172jq -r '.query' .foundry/datasets/support-bot-prod-traces-v2.jsonl | sort > /tmp/v2-queries.txt
173jq -r '.query' .foundry/datasets/support-bot-prod-traces-v3.jsonl | sort > /tmp/v3-queries.txt
174diff /tmp/v2-queries.txt /tmp/v3-queries.txt
175```
176 
177## Next Steps
178 
179- **Organize into splits** → [Dataset Organization](dataset-organization.md)
180- **Run evaluation with pinned version** → [observe skill Step 2](../../observe/references/evaluate-step.md)
181- **Track lineage** → [Eval Lineage](eval-lineage.md)
182
Preparing the source view

Microsoft Foundry Skill

foundry-agent/eval-datasets/references/dataset-versioning.md