Dataset Versioning — Version Management & Tagging

Manage dataset versions with naming conventions, tagging, and version pinning for reproducible evaluations. This workflow formalizes dataset lifecycle management using existing MCP tools and local conventions.

Naming Convention

Use the pattern <agent-name>-<source>-v<N>:

Component	Values	Example
`<agent-name>`	Selected environment's `agentName` from the selected metadata file	`support-bot-prod`
`<source>`	`traces`, `synthetic`, `manual`, `combined`	`traces`
`v<N>`	Incremental version number	`v3`

<agent-name> already refers to the environment-specific deployed Foundry agent name. If that value includes the environment key, do not append the environment again.

Full examples:

support-bot-prod-traces-v1 — first production dataset from trace harvesting
support-bot-dev-synthetic-v2 — second synthetic dataset
support-bot-prod-combined-v5 — fifth production dataset combining traces + manual examples

Tagging Conventions

Tags are stored in .foundry/datasets/manifest.json alongside dataset metadata:

Tag	Meaning	When to Apply
`baseline`	Reference dataset for comparison	When establishing a new evaluation baseline
`prod`	Dataset used for current production evaluation	After successful deployment
`canary`	Dataset for canary/staging evaluation	During staged rollout
`regression-<date>`	Dataset that caught a regression	When a regression is detected
`deprecated`	Dataset no longer in active use	When replaced by a newer version

Version Pinning

Pin evaluations to a specific dataset version to ensure reproducible, comparable results:

Local Pinning (JSONL Datasets)

When using local JSONL files, reference the exact filename in evaluation runs:

.foundry/datasets/support-bot-prod-traces-v3.jsonl  ← pinned by filename

Pass the contents via inputData parameter in evaluation_agent_batch_eval_create.

Server-Side Version Discovery

Use evaluation_dataset_versions_get to list all versions of a dataset registered in Foundry:

evaluation_dataset_versions_get(projectEndpoint, datasetName: "<agent-name>-<source>")

Use evaluation_dataset_get without a name to list all datasets in the project:

evaluation_dataset_get(projectEndpoint)

💡 Tip: Server-side versions are available after syncing via Trace-to-Dataset → Step 5. Local manifest.json remains useful for lineage metadata (source, harvestRule, reviewedBy) not stored server-side.

Manifest File

Track all dataset versions, required dataset metadata, tags, and lineage in .foundry/datasets/manifest.json:

{
  "datasets": [
    {
      "name": "support-bot-prod-traces",
      "file": "support-bot-prod-traces-v1.jsonl",
      "version": "v1",
      "agent": "support-bot-prod",
      "stage": "traces",
      "datasetUri": "<foundry-dataset-uri-v1>",
      "tag": "deprecated",
      "source": "trace-harvest",
      "harvestRule": "error",
      "timeRange": "2025-01-01 to 2025-01-07",
      "exampleCount": 32,
      "createdAt": "2025-01-08T10:00:00Z",
      "evalRunIds": ["run-abc-123"]
    },
    {
      "name": "support-bot-prod-traces",
      "file": "support-bot-prod-traces-v2.jsonl",
      "version": "v2",
      "agent": "support-bot-prod",
      "stage": "traces",
      "datasetUri": "<foundry-dataset-uri-v2>",
      "tag": "baseline",
      "source": "trace-harvest",
      "harvestRule": "error+latency",
      "timeRange": "2025-01-15 to 2025-01-21",
      "exampleCount": 47,
      "createdAt": "2025-01-22T10:00:00Z",
      "evalRunIds": ["run-def-456", "run-ghi-789"]
    },
    {
      "name": "support-bot-prod-traces",
      "file": "support-bot-prod-traces-v3.jsonl",
      "version": "v3",
      "agent": "support-bot-prod",
      "stage": "traces",
      "datasetUri": "<foundry-dataset-uri-v3>",
      "tag": "prod",
      "source": "trace-harvest",
      "harvestRule": "error+latency+low-eval",
      "timeRange": "2025-02-01 to 2025-02-07",
      "exampleCount": 63,
      "createdAt": "2025-02-08T10:00:00Z",
      "evalRunIds": []
    }
  ]
}

Keep stage stable for the dataset family (seed, traces, curated, or prod) and use tag for mutable lifecycle labels such as baseline, prod, or deprecated. Persist datasetUri as the Foundry-returned dataset reference so deploy and observe workflows can resolve the registered dataset directly.

Creating a New Version

Check existing versions: Read .foundry/datasets/manifest.json to find the latest version number
Increment version: Use v<N+1> as the new version
Create dataset: Via Trace-to-Dataset or manual JSONL creation
Update manifest: Add the new entry with metadata
Tag appropriately: Apply baseline, prod, or other tags as needed
Deprecate old: Optionally mark previous versions as deprecated

⚠️ DO NOT stop here. After creating a new dataset version, continue to the Dataset Update Loop below.

Dataset Update Loop — Eval → Analyze → Optimize → Re-Eval

When a dataset is updated (new rows, better coverage, new failure modes), run this loop to validate the agent against the harder test suite:

[1] Eval with new dataset (v2) using same agent version
    │
    ▼
[2] Compare: eval on v1 vs eval on v2 (same agent, different datasets)
    │
    ▼
[3] Analyze score changes — expect some drops (harder tests ≠ worse agent)
    │
    ▼
[4] Optimize agent prompt based on NEW failure patterns only
    │
    ▼
[5] Re-eval optimized agent on v2 dataset → compare to pre-optimization
    │
    ▼
[6] If satisfied → tag v2 as `prod`, archive v1

⛔ Guardrails for This Loop

Never remove dataset rows to recover scores. If eval scores drop after a dataset update, the dataset is likely exposing real gaps. Removing hard cases defeats the purpose.
Never weaken evaluators to recover scores. Do not lower thresholds, remove evaluators, or switch to easier scoring when scores drop on an expanded dataset.
Distinguish dataset difficulty from agent regression. A score drop on a harder dataset is expected and healthy — it means test coverage improved. Only flag as regression when the same dataset + same evaluators produce worse scores on a new agent version.
Optimize for NEW failure patterns only. When optimizing the agent prompt after a dataset update, target the newly added test cases. Do not re-optimize for cases that were already passing.

Comparing Versions

To understand how a dataset evolved between versions:

# Count examples per version
wc -l .foundry/datasets/support-bot-prod-traces-v*.jsonl

# Diff example queries between versions
jq -r '.query' .foundry/datasets/support-bot-prod-traces-v2.jsonl | sort > /tmp/v2-queries.txt
jq -r '.query' .foundry/datasets/support-bot-prod-traces-v3.jsonl | sort > /tmp/v3-queries.txt
diff /tmp/v2-queries.txt /tmp/v3-queries.txt

Next Steps

Organize into splits → Dataset Organization
Run evaluation with pinned version → observe skill Step 2
Track lineage → Eval Lineage

Preparing the source view

Microsoft Foundry Skill

foundry-agent/eval-datasets/references/dataset-versioning.md