Source from repo
Microsoft Foundry Skill

Deploy, evaluate, and manage AI agents end-to-end on Microsoft Azure AI Foundry
microsoftGitHub microsoftOfficialSource repo Original GitHub link Publisher page
Files
Skill
n/a
Size
546.6 KB
Entrypoint
SKILL.md
Format
git-repo
Open file
foundry-agent/eval-datasets/references/dataset-versioning.md

Syntax-highlighted preview of this file as included in the skill package.
Rendered Source
markdown182 linesFree
foundry-agent/eval-datasets/references/dataset-versioning.md
1# Dataset Versioning — Version Management & Tagging
2 
3Manage dataset versions with naming conventions, tagging, and version pinning for reproducible evaluations. This workflow formalizes dataset lifecycle management using existing MCP tools and local conventions.
4 
5## Naming Convention
6 
7Use the pattern `<agent-name>-<source>-v<N>`:
8 
9| Component | Values | Example |
10|-----------|--------|---------|
11| `<agent-name>` | Selected environment's `agentName` from the selected metadata file | `support-bot-prod` |
12| `<source>` | `traces`, `synthetic`, `manual`, `combined` | `traces` |
13| `v<N>` | Incremental version number | `v3` |
14 
15`<agent-name>` already refers to the environment-specific deployed Foundry agent name. If that value includes the environment key, do **not** append the environment again.
16 
17**Full examples:**
18- `support-bot-prod-traces-v1` — first production dataset from trace harvesting
19- `support-bot-dev-synthetic-v2` — second synthetic dataset
20- `support-bot-prod-combined-v5` — fifth production dataset combining traces + manual examples
21 
22## Tagging Conventions
23 
24Tags are stored in `.foundry/datasets/manifest.json` alongside dataset metadata:
25 
26| Tag | Meaning | When to Apply |
27|-----|---------|---------------|
28| `baseline` | Reference dataset for comparison | When establishing a new evaluation baseline |
29| `prod` | Dataset used for current production evaluation | After successful deployment |
30| `canary` | Dataset for canary/staging evaluation | During staged rollout |
31| `regression-<date>` | Dataset that caught a regression | When a regression is detected |
32| `deprecated` | Dataset no longer in active use | When replaced by a newer version |
33 
34## Version Pinning
35 
36Pin evaluations to a specific dataset version to ensure reproducible, comparable results:
37 
38### Local Pinning (JSONL Datasets)
39 
40When using local JSONL files, reference the exact filename in evaluation runs:
41 
42```
43.foundry/datasets/support-bot-prod-traces-v3.jsonl  ← pinned by filename
44```
45 
46Pass the contents via `inputData` parameter in **`evaluation_agent_batch_eval_create`**.
47 
48### Server-Side Version Discovery
49 
50Use `evaluation_dataset_versions_get` to list all versions of a dataset registered in Foundry:
51 
52```
53evaluation_dataset_versions_get(projectEndpoint, datasetName: "<agent-name>-<source>")
54```
55 
56Use `evaluation_dataset_get` without a name to list all datasets in the project:
57 
58```
59evaluation_dataset_get(projectEndpoint)
60```
61 
62> 💡 **Tip:** Server-side versions are available after syncing via [Trace-to-Dataset → Step 5](trace-to-dataset.md#step-5--sync-local-cache-with-foundry-optional). Local `manifest.json` remains useful for lineage metadata (source, harvestRule, reviewedBy) not stored server-side.
63 
64## Manifest File
65 
66Track all dataset versions, required dataset metadata, tags, and lineage in `.foundry/datasets/manifest.json`:
67 
68```json
69{
70  "datasets": [
71    {
72      "name": "support-bot-prod-traces",
73      "file": "support-bot-prod-traces-v1.jsonl",
74      "version": "v1",
75      "agent": "support-bot-prod",
76      "stage": "traces",
77      "datasetUri": "<foundry-dataset-uri-v1>",
78      "tag": "deprecated",
79      "source": "trace-harvest",
80      "harvestRule": "error",
81      "timeRange": "2025-01-01 to 2025-01-07",
82      "exampleCount": 32,
83      "createdAt": "2025-01-08T10:00:00Z",
84      "evalRunIds": ["run-abc-123"]
85    },
86    {
87      "name": "support-bot-prod-traces",
88      "file": "support-bot-prod-traces-v2.jsonl",
89      "version": "v2",
90      "agent": "support-bot-prod",
91      "stage": "traces",
92      "datasetUri": "<foundry-dataset-uri-v2>",
93      "tag": "baseline",
94      "source": "trace-harvest",
95      "harvestRule": "error+latency",
96      "timeRange": "2025-01-15 to 2025-01-21",
97      "exampleCount": 47,
98      "createdAt": "2025-01-22T10:00:00Z",
99      "evalRunIds": ["run-def-456", "run-ghi-789"]
100    },
101    {
102      "name": "support-bot-prod-traces",
103      "file": "support-bot-prod-traces-v3.jsonl",
104      "version": "v3",
105      "agent": "support-bot-prod",
106      "stage": "traces",
107      "datasetUri": "<foundry-dataset-uri-v3>",
108      "tag": "prod",
109      "source": "trace-harvest",
110      "harvestRule": "error+latency+low-eval",
111      "timeRange": "2025-02-01 to 2025-02-07",
112      "exampleCount": 63,
113      "createdAt": "2025-02-08T10:00:00Z",
114      "evalRunIds": []
115    }
116  ]
117}
118```
119 
120Keep `stage` stable for the dataset family (`seed`, `traces`, `curated`, or `prod`) and use `tag` for mutable lifecycle labels such as `baseline`, `prod`, or `deprecated`. Persist `datasetUri` as the Foundry-returned dataset reference so deploy and observe workflows can resolve the registered dataset directly.
121 
122## Creating a New Version
123 
1241. **Check existing versions**: Read `.foundry/datasets/manifest.json` to find the latest version number
1252. **Increment version**: Use `v<N+1>` as the new version
1263. **Create dataset**: Via [Trace-to-Dataset](trace-to-dataset.md) or manual JSONL creation
1274. **Update manifest**: Add the new entry with metadata
1285. **Tag appropriately**: Apply `baseline`, `prod`, or other tags as needed
1296. **Deprecate old**: Optionally mark previous versions as `deprecated`
130 
131> ⚠️ **DO NOT stop here.** After creating a new dataset version, continue to the Dataset Update Loop below.
132 
133## Dataset Update Loop — Eval → Analyze → Optimize → Re-Eval
134 
135When a dataset is updated (new rows, better coverage, new failure modes), run this loop to validate the agent against the harder test suite:
136 
137```
138[1] Eval with new dataset (v2) using same agent version
139    │
140    ▼
141[2] Compare: eval on v1 vs eval on v2 (same agent, different datasets)
142    │
143    ▼
144[3] Analyze score changes — expect some drops (harder tests ≠ worse agent)
145    │
146    ▼
147[4] Optimize agent prompt based on NEW failure patterns only
148    │
149    ▼
150[5] Re-eval optimized agent on v2 dataset → compare to pre-optimization
151    │
152    ▼
153[6] If satisfied → tag v2 as `prod`, archive v1
154```
155 
156### ⛔ Guardrails for This Loop
157 
158- **Never remove dataset rows to recover scores.** If eval scores drop after a dataset update, the dataset is likely exposing real gaps. Removing hard cases defeats the purpose.
159- **Never weaken evaluators to recover scores.** Do not lower thresholds, remove evaluators, or switch to easier scoring when scores drop on an expanded dataset.
160- **Distinguish dataset difficulty from agent regression.** A score drop on a harder dataset is expected and healthy — it means test coverage improved. Only flag as regression when the same dataset + same evaluators produce worse scores on a new agent version.
161- **Optimize for NEW failure patterns only.** When optimizing the agent prompt after a dataset update, target the newly added test cases. Do not re-optimize for cases that were already passing.
162 
163## Comparing Versions
164 
165To understand how a dataset evolved between versions:
166 
167```bash
168# Count examples per version
169wc -l .foundry/datasets/support-bot-prod-traces-v*.jsonl
170 
171# Diff example queries between versions
172jq -r '.query' .foundry/datasets/support-bot-prod-traces-v2.jsonl | sort > /tmp/v2-queries.txt
173jq -r '.query' .foundry/datasets/support-bot-prod-traces-v3.jsonl | sort > /tmp/v3-queries.txt
174diff /tmp/v2-queries.txt /tmp/v3-queries.txt
175```
176 
177## Next Steps
178 
179- **Organize into splits** → [Dataset Organization](dataset-organization.md)
180- **Run evaluation with pinned version** → [observe skill Step 2](../../observe/references/evaluate-step.md)
181- **Track lineage** → [Eval Lineage](eval-lineage.md)
182
Preparing the source view

Microsoft Foundry Skill

foundry-agent/eval-datasets/references/dataset-versioning.md