Loading source
Pulling the file list, source metadata, and syntax-aware rendering for this listing.
Source from repo
Build and deploy AI applications on Azure AI Foundry using Microsoft's model catalog and AI services
Files
Skill
Size
Entrypoint
Format
Open file
Syntax-highlighted preview of this file as included in the skill package.
foundry-agent/eval-datasets/references/dataset-versioning.md
1# Dataset Versioning — Version Management & Tagging23Manage dataset versions with naming conventions, tagging, and version pinning for reproducible evaluations. This workflow formalizes dataset lifecycle management using existing MCP tools and local conventions.45## Naming Convention67Use the pattern `<agent-name>-<source>-v<N>`:89| Component | Values | Example |10|-----------|--------|---------|11| `<agent-name>` | Selected environment's `agentName` from the selected metadata file | `support-bot-prod` |12| `<source>` | `traces`, `synthetic`, `manual`, `combined` | `traces` |13| `v<N>` | Incremental version number | `v3` |1415`<agent-name>` already refers to the environment-specific deployed Foundry agent name. If that value includes the environment key, do **not** append the environment again.1617**Full examples:**18- `support-bot-prod-traces-v1` — first production dataset from trace harvesting19- `support-bot-dev-synthetic-v2` — second synthetic dataset20- `support-bot-prod-combined-v5` — fifth production dataset combining traces + manual examples2122## Tagging Conventions2324Tags are stored in `.foundry/datasets/manifest.json` alongside dataset metadata:2526| Tag | Meaning | When to Apply |27|-----|---------|---------------|28| `baseline` | Reference dataset for comparison | When establishing a new evaluation baseline |29| `prod` | Dataset used for current production evaluation | After successful deployment |30| `canary` | Dataset for canary/staging evaluation | During staged rollout |31| `regression-<date>` | Dataset that caught a regression | When a regression is detected |32| `deprecated` | Dataset no longer in active use | When replaced by a newer version |3334## Version Pinning3536Pin evaluations to a specific dataset version to ensure reproducible, comparable results:3738### Local Pinning (JSONL Datasets)3940When using local JSONL files, reference the exact filename in evaluation runs:4142```43.foundry/datasets/support-bot-prod-traces-v3.jsonl ← pinned by filename44```4546Pass the contents via `inputData` parameter in **`evaluation_agent_batch_eval_create`**.4748### Server-Side Version Discovery4950Use `evaluation_dataset_versions_get` to list all versions of a dataset registered in Foundry:5152```53evaluation_dataset_versions_get(projectEndpoint, datasetName: "<agent-name>-<source>")54```5556Use `evaluation_dataset_get` without a name to list all datasets in the project:5758```59evaluation_dataset_get(projectEndpoint)60```6162> 💡 **Tip:** Server-side versions are available after syncing via [Trace-to-Dataset → Step 5](trace-to-dataset.md#step-5--sync-local-cache-with-foundry-optional). Local `manifest.json` remains useful for lineage metadata (source, harvestRule, reviewedBy) not stored server-side.6364## Manifest File6566Track all dataset versions, required dataset metadata, tags, and lineage in `.foundry/datasets/manifest.json`:6768```json69{70"datasets": [71{72"name": "support-bot-prod-traces",73"file": "support-bot-prod-traces-v1.jsonl",74"version": "v1",75"agent": "support-bot-prod",76"stage": "traces",77"datasetUri": "<foundry-dataset-uri-v1>",78"tag": "deprecated",79"source": "trace-harvest",80"harvestRule": "error",81"timeRange": "2025-01-01 to 2025-01-07",82"exampleCount": 32,83"createdAt": "2025-01-08T10:00:00Z",84"evalRunIds": ["run-abc-123"]85},86{87"name": "support-bot-prod-traces",88"file": "support-bot-prod-traces-v2.jsonl",89"version": "v2",90"agent": "support-bot-prod",91"stage": "traces",92"datasetUri": "<foundry-dataset-uri-v2>",93"tag": "baseline",94"source": "trace-harvest",95"harvestRule": "error+latency",96"timeRange": "2025-01-15 to 2025-01-21",97"exampleCount": 47,98"createdAt": "2025-01-22T10:00:00Z",99"evalRunIds": ["run-def-456", "run-ghi-789"]100},101{102"name": "support-bot-prod-traces",103"file": "support-bot-prod-traces-v3.jsonl",104"version": "v3",105"agent": "support-bot-prod",106"stage": "traces",107"datasetUri": "<foundry-dataset-uri-v3>",108"tag": "prod",109"source": "trace-harvest",110"harvestRule": "error+latency+low-eval",111"timeRange": "2025-02-01 to 2025-02-07",112"exampleCount": 63,113"createdAt": "2025-02-08T10:00:00Z",114"evalRunIds": []115}116]117}118```119120Keep `stage` stable for the dataset family (`seed`, `traces`, `curated`, or `prod`) and use `tag` for mutable lifecycle labels such as `baseline`, `prod`, or `deprecated`. Persist `datasetUri` as the Foundry-returned dataset reference so deploy and observe workflows can resolve the registered dataset directly.121122## Creating a New Version1231241. **Check existing versions**: Read `.foundry/datasets/manifest.json` to find the latest version number1252. **Increment version**: Use `v<N+1>` as the new version1263. **Create dataset**: Via [Trace-to-Dataset](trace-to-dataset.md) or manual JSONL creation1274. **Update manifest**: Add the new entry with metadata1285. **Tag appropriately**: Apply `baseline`, `prod`, or other tags as needed1296. **Deprecate old**: Optionally mark previous versions as `deprecated`130131> ⚠️ **DO NOT stop here.** After creating a new dataset version, continue to the Dataset Update Loop below.132133## Dataset Update Loop — Eval → Analyze → Optimize → Re-Eval134135When a dataset is updated (new rows, better coverage, new failure modes), run this loop to validate the agent against the harder test suite:136137```138[1] Eval with new dataset (v2) using same agent version139│140▼141[2] Compare: eval on v1 vs eval on v2 (same agent, different datasets)142│143▼144[3] Analyze score changes — expect some drops (harder tests ≠ worse agent)145│146▼147[4] Optimize agent prompt based on NEW failure patterns only148│149▼150[5] Re-eval optimized agent on v2 dataset → compare to pre-optimization151│152▼153[6] If satisfied → tag v2 as `prod`, archive v1154```155156### ⛔ Guardrails for This Loop157158- **Never remove dataset rows to recover scores.** If eval scores drop after a dataset update, the dataset is likely exposing real gaps. Removing hard cases defeats the purpose.159- **Never weaken evaluators to recover scores.** Do not lower thresholds, remove evaluators, or switch to easier scoring when scores drop on an expanded dataset.160- **Distinguish dataset difficulty from agent regression.** A score drop on a harder dataset is expected and healthy — it means test coverage improved. Only flag as regression when the same dataset + same evaluators produce worse scores on a new agent version.161- **Optimize for NEW failure patterns only.** When optimizing the agent prompt after a dataset update, target the newly added test cases. Do not re-optimize for cases that were already passing.162163## Comparing Versions164165To understand how a dataset evolved between versions:166167```bash168# Count examples per version169wc -l .foundry/datasets/support-bot-prod-traces-v*.jsonl170171# Diff example queries between versions172jq -r '.query' .foundry/datasets/support-bot-prod-traces-v2.jsonl | sort > /tmp/v2-queries.txt173jq -r '.query' .foundry/datasets/support-bot-prod-traces-v3.jsonl | sort > /tmp/v3-queries.txt174diff /tmp/v2-queries.txt /tmp/v3-queries.txt175```176177## Next Steps178179- **Organize into splits** → [Dataset Organization](dataset-organization.md)180- **Run evaluation with pinned version** → [observe skill Step 2](../../observe/references/evaluate-step.md)181- **Track lineage** → [Eval Lineage](eval-lineage.md)182