Loading source
Pulling the file list, source metadata, and syntax-aware rendering for this listing.
Source from repo
Deploy, evaluate, and manage AI agents end-to-end on Microsoft Azure AI Foundry
Files
Skill
Size
Entrypoint
Format
Open file
Syntax-highlighted preview of this file as included in the skill package.
foundry-agent/eval-datasets/references/dataset-versioning.md
1# Dataset Versioning — Version Management & Tagging23Manage dataset versions with naming conventions, tagging, and version pinning for reproducible evaluations. This workflow formalizes dataset lifecycle management using existing MCP tools and local conventions.45## Naming Convention67Use the pattern `<agent-name>-<source>-v<N>`:89| Component | Values | Example |10|-----------|--------|---------|11| `<agent-name>` | Selected environment's `agentName` from the selected metadata file | `support-bot-prod` |12| `<source>` | `traces`, `synthetic`, `manual`, `combined` | `traces` |13| `v<N>` | Incremental version number | `v3` |1415`<agent-name>` already refers to the environment-specific deployed Foundry agent name. If that value includes the environment key, do **not** append the environment again.1617**Full examples:**18- `support-bot-prod-traces-v1` — first production dataset from trace harvesting19- `support-bot-dev-synthetic-v2` — second synthetic dataset20- `support-bot-prod-combined-v5` — fifth production dataset combining traces + manual examples2122## Tagging Conventions2324Tags are stored in `.foundry/datasets/manifest.json` alongside dataset metadata:2526| Tag | Meaning | When to Apply |27|-----|---------|---------------|28| `baseline` | Reference dataset for comparison | When establishing a new evaluation baseline |29| `prod` | Dataset used for current production evaluation | After successful deployment |30| `canary` | Dataset for canary/staging evaluation | During staged rollout |31| `regression-<date>` | Dataset that caught a regression | When a regression is detected |32| `deprecated` | Dataset no longer in active use | When replaced by a newer version |3334## Version Pinning3536Pin evaluations to a specific dataset version to ensure reproducible, comparable results:3738### Local Pinning (JSONL Datasets)3940When using local JSONL files, reference the exact filename in evaluation runs:4142```43.foundry/datasets/support-bot-prod-traces-v3.jsonl ← pinned by filename44```4546Pass the contents via `inputData` parameter in **`evaluation_agent_batch_eval_create`**.4748### Server-Side Version Discovery4950Use `evaluation_dataset_versions_get` to list all versions of a dataset registered in Foundry:5152```53evaluation_dataset_versions_get(projectEndpoint, datasetName: "<agent-name>-<source>")54```5556Use `evaluation_dataset_get` without a name to list all datasets in the project:5758```59evaluation_dataset_get(projectEndpoint)60```6162> 💡 **Tip:** Server-side versions are available after syncing via [Trace-to-Dataset → Step 5](trace-to-dataset.md#step-5--sync-local-cache-with-foundry-optional). Local `manifest.json` remains useful for lineage metadata (source, harvestRule, reviewedBy) not stored server-side.6364## Manifest File6566Track all dataset versions, required dataset metadata, tags, and lineage in `.foundry/datasets/manifest.json`:6768```json69{70"datasets": [71{72"name": "support-bot-prod-traces",73"file": "support-bot-prod-traces-v1.jsonl",74"version": "v1",75"agent": "support-bot-prod",76"stage": "traces",77"datasetUri": "<foundry-dataset-uri-v1>",78"tag": "deprecated",79"source": "trace-harvest",80"harvestRule": "error",81"timeRange": "2025-01-01 to 2025-01-07",82"exampleCount": 32,83"createdAt": "2025-01-08T10:00:00Z",84"evalRunIds": ["run-abc-123"]85},86{87"name": "support-bot-prod-traces",88"file": "support-bot-prod-traces-v2.jsonl",89"version": "v2",90"agent": "support-bot-prod",91"stage": "traces",92"datasetUri": "<foundry-dataset-uri-v2>",93"tag": "baseline",94"source": "trace-harvest",95"harvestRule": "error+latency",96"timeRange": "2025-01-15 to 2025-01-21",97"exampleCount": 47,98"createdAt": "2025-01-22T10:00:00Z",99"evalRunIds": ["run-def-456", "run-ghi-789"]100},101{102"name": "support-bot-prod-traces",103"file": "support-bot-prod-traces-v3.jsonl",104"version": "v3",105"agent": "support-bot-prod",106"stage": "traces",107"datasetUri": "<foundry-dataset-uri-v3>",108"tag": "prod",109"source": "trace-harvest",110"harvestRule": "error+latency+low-eval",111"timeRange": "2025-02-01 to 2025-02-07",112"exampleCount": 63,113"createdAt": "2025-02-08T10:00:00Z",114"evalRunIds": []115}116]117}118```119120Keep `stage` stable for the dataset family (`seed`, `traces`, `curated`, or `prod`) and use `tag` for mutable lifecycle labels such as `baseline`, `prod`, or `deprecated`. Persist `datasetUri` as the Foundry-returned dataset reference so deploy and observe workflows can resolve the registered dataset directly.121122## Creating a New Version1231241. **Check existing versions**: Read `.foundry/datasets/manifest.json` to find the latest version number1252. **Increment version**: Use `v<N+1>` as the new version1263. **Create dataset**: Via [Trace-to-Dataset](trace-to-dataset.md) or manual JSONL creation1274. **Update manifest**: Add the new entry with metadata1285. **Tag appropriately**: Apply `baseline`, `prod`, or other tags as needed1296. **Deprecate old**: Optionally mark previous versions as `deprecated`130131> ⚠️ **DO NOT stop here.** After creating a new dataset version, continue to the Dataset Update Loop below.132133## Dataset Update Loop — Eval → Analyze → Optimize → Re-Eval134135When a dataset is updated (new rows, better coverage, new failure modes), run this loop to validate the agent against the harder test suite:136137```138[1] Eval with new dataset (v2) using same agent version139│140▼141[2] Compare: eval on v1 vs eval on v2 (same agent, different datasets)142│143▼144[3] Analyze score changes — expect some drops (harder tests ≠ worse agent)145│146▼147[4] Optimize agent prompt based on NEW failure patterns only148│149▼150[5] Re-eval optimized agent on v2 dataset → compare to pre-optimization151│152▼153[6] If satisfied → tag v2 as `prod`, archive v1154```155156### ⛔ Guardrails for This Loop157158- **Never remove dataset rows to recover scores.** If eval scores drop after a dataset update, the dataset is likely exposing real gaps. Removing hard cases defeats the purpose.159- **Never weaken evaluators to recover scores.** Do not lower thresholds, remove evaluators, or switch to easier scoring when scores drop on an expanded dataset.160- **Distinguish dataset difficulty from agent regression.** A score drop on a harder dataset is expected and healthy — it means test coverage improved. Only flag as regression when the same dataset + same evaluators produce worse scores on a new agent version.161- **Optimize for NEW failure patterns only.** When optimizing the agent prompt after a dataset update, target the newly added test cases. Do not re-optimize for cases that were already passing.162163## Comparing Versions164165To understand how a dataset evolved between versions:166167```bash168# Count examples per version169wc -l .foundry/datasets/support-bot-prod-traces-v*.jsonl170171# Diff example queries between versions172jq -r '.query' .foundry/datasets/support-bot-prod-traces-v2.jsonl | sort > /tmp/v2-queries.txt173jq -r '.query' .foundry/datasets/support-bot-prod-traces-v3.jsonl | sort > /tmp/v3-queries.txt174diff /tmp/v2-queries.txt /tmp/v3-queries.txt175```176177## Next Steps178179- **Organize into splits** → [Dataset Organization](dataset-organization.md)180- **Run evaluation with pinned version** → [observe skill Step 2](../../observe/references/evaluate-step.md)181- **Track lineage** → [Eval Lineage](eval-lineage.md)182