Loading source
Pulling the file list, source metadata, and syntax-aware rendering for this listing.
Source from repo
Build and deploy AI applications on Azure AI Foundry using Microsoft's model catalog and AI services
Files
Skill
Size
Entrypoint
Format
Open file
Syntax-highlighted preview of this file as included in the skill package.
foundry-agent/eval-datasets/eval-datasets.md
1# Evaluation Datasets โ Trace-to-Dataset Pipeline & Lifecycle Management23Manage the full lifecycle of evaluation datasets for a Foundry agent: harvesting production traces into the selected agent root's local `.foundry` cache, curating versioned test datasets, tracking evaluation quality over time, and syncing approved updates back to Foundry when needed.45## When to Use This Skill67USE FOR: create dataset from traces, harvest traces into dataset, build test dataset, dataset versioning, version my dataset, tag dataset, pin dataset version, organize datasets, dataset splits, curate test cases, review trace candidates, evaluation trending, metrics over time, eval regression, regression detection, compare evaluations over time, dataset comparison, evaluation lineage, trace to dataset pipeline, annotation review, production traces to test cases.89> โ ๏ธ **DO NOT manually run** KQL queries to extract datasets or call `evaluation_dataset_create` **without reading this skill first.** This skill defines the correct trace extraction patterns, schema transformation, cache rules, versioning conventions, and quality gates that raw tools do not enforce.1011> ๐ก **Tip:** This skill complements the [observe skill](../observe/observe.md) (eval-driven optimization loop) and the [trace skill](../trace/trace.md) (production trace analysis). Use this skill when you need to bridge traces and evaluations: turning production data into test cases and tracking evaluation quality over time.1213## Quick Reference1415| Property | Value |16|----------|-------|17| MCP server | `azure` |18| Key Foundry MCP tools | `evaluation_dataset_create`, `evaluation_dataset_get`, `evaluation_dataset_versions_get`, `evaluation_get`, `evaluation_comparison_create`, `evaluation_comparison_get` |19| Storage tools | `project_connection_list` (discover `AzureStorageAccount` connection), `project_connection_create` (add storage connection) |20| Azure services | Application Insights (via `monitor_resource_log_query`), Azure Blob Storage (dataset sync) |21| Prerequisites | Agent deployed, selected `.foundry/agent-metadata*.yaml` file available, App Insights connected |22| Local cache | `.foundry/datasets/`, `.foundry/results/`, `.foundry/evaluators/` |2324## Entry Points2526| User Intent | Start At |27|-------------|----------|28| "Create dataset from production traces" / "Harvest traces" | [Trace-to-Dataset Pipeline](references/trace-to-dataset.md) |29| "Version my dataset" / "Tag dataset" / "Pin dataset version" | [Dataset Versioning](references/dataset-versioning.md) |30| "Organize my datasets" / "Dataset splits" / "Filter datasets" | [Dataset Organization](references/dataset-organization.md) |31| "Review trace candidates" / "Curate test cases" | [Dataset Curation](references/dataset-curation.md) |32| "Show eval metrics over time" / "Evaluation trending" | [Eval Trending](references/eval-trending.md) |33| "Did my agent regress?" / "Regression detection" | [Eval Regression](references/eval-regression.md) |34| "Compare datasets" / "Experiment comparison" / "A/B test" | [Dataset Comparison](references/dataset-comparison.md) |35| "Sync dataset to Foundry" / "Refresh local dataset cache" | [Trace-to-Dataset Pipeline -> Step 5](references/trace-to-dataset.md#step-5--sync-local-cache-with-foundry-optional) |36| "Trace my evaluation lineage" / "Audit eval history" | [Eval Lineage](references/eval-lineage.md) |37| "Generate eval dataset" / "Create seed dataset" / "Generate test cases for my agent" | [Generate Seed Dataset](references/generate-seed-dataset.md) |3839## Before Starting โ Detect Current State40411. Resolve the target agent root, selected metadata file, and environment from `.foundry/agent-metadata*.yaml`.422. Confirm the selected environment's `projectEndpoint`, `agentName`, and observability settings.433. Check `.foundry/datasets/`, `.foundry/results/`, and `.foundry/datasets/manifest.json` in the selected agent root only.444. Check whether `evaluation_dataset_get` returns server-side datasets for the same environment.455. Route to the appropriate entry point based on user intent.4647## The Foundry Flywheel4849```text50Production Agent -> [1] Trace (App Insights + OTel)51-> [2] Harvest (KQL extraction)52-> [3] Curate (human review)53-> [4] Dataset Cache (.foundry/datasets, versioned)54-> [5] Sync to Foundry (optional refresh/push)55-> [6] Evaluate (batch eval)56-> [7] Analyze (trending + regression)57-> [8] Compare (agent versions OR dataset versions)58-> [9] Deploy -> back to [1]59```6061Each cycle makes the test suite harder and more representative. Production failures from release N become regression tests for release N+1.6263## Behavioral Rules64651. **Always show KQL queries.** Before executing any trace extraction query, display it in a code block. Never run queries silently.662. **Scope to time ranges.** Always include a time range in KQL queries (default: last 7 days for trace harvesting). Ask the user for the range if not specified.673. **Require human review.** Never auto-commit harvested traces to a dataset without showing candidates to the user first. The curation step is mandatory.684. **Use dataset naming conventions.** Follow the naming conventions below and keep local filenames aligned with the registered Foundry dataset name/version.695. **Treat local files as cache.** Reuse `.foundry/datasets/` and `.foundry/evaluators/` when they already match the selected environment in the selected agent root. Offer refresh when the user asks or when remote state has changed.706. **Stay inside the selected agent root.** After resolving the agent root, inspect only that folder's `.foundry/` cache and source context. Never merge sibling agent folders.717. **Persist artifacts.** Save datasets to `.foundry/datasets/`, evaluation results to `.foundry/results/`, and track lineage in `.foundry/datasets/manifest.json`.728. **Keep evaluation suites aligned.** Update the selected environment's `evaluationSuites[]` in the selected metadata file whenever a dataset version, evaluator set, or suite tags change. Local flows should default to `agent-metadata.yaml`; prod or CI-targeted flows can use `agent-metadata.<env>.yaml`. If the environment still uses older `testSuites[]` or legacy `testCases[]`, treat that list as the current suite source for this session and rewrite it as `evaluationSuites[]` on the next metadata save.739. **Confirm before overwriting.** If a dataset version or cache file already exists, warn the user and ask for confirmation before replacing or refreshing it.7410. **Sync to Foundry when requested or needed.** After saving datasets locally, refresh or register them in Foundry only when the user asks or the workflow needs shared/CI usage.7511. **Never remove dataset rows or weaken evaluators to recover scores.** Score drops after a dataset update are expected - harder tests expose real gaps. Optimize the agent for new failure patterns; do not shrink the test suite.7612. **Match eval parameter names exactly.** Use `evaluationId` when creating grouped runs, but use `evalId` for `evaluation_get` and comparison/trending lookups.7778## Dataset Naming and Metadata Conventions7980| Dataset type | Foundry dataset name | Foundry dataset version | Typical local file | Metadata stage |81|--------------|----------------------|-------------------------|--------------------|----------------|82| Seed dataset | `<agent-name>-eval-seed` | `v1` | `.foundry/datasets/<agent-name>-eval-seed-v1.jsonl` | `seed` |83| Trace-harvested dataset | `<agent-name>-traces` | `v<N>` | `.foundry/datasets/<agent-name>-traces-v<N>.jsonl` | `traces` |84| Curated/refined dataset | `<agent-name>-curated` | `v<N>` | `.foundry/datasets/<agent-name>-curated-v<N>.jsonl` | `curated` |85| Production-ready dataset | `<agent-name>-prod` | `v<N>` | `.foundry/datasets/<agent-name>-prod-v<N>.jsonl` | `prod` |8687Here `<agent-name>` means the selected environment's `environments.<env>.agentName` from the selected metadata file. If that deployed agent name already includes the environment (for example, `support-agent-dev`), do **not** append the environment key a second time.8889Local dataset filenames must start with the selected Foundry agent name (`environments.<env>.agentName` in the selected metadata file). Put stage and version suffixes **after** that prefix so cache files sort and group by agent first.9091Keep the Foundry dataset name stable across versions. Store the version only in `datasetVersion` (or manifest `version`) using the `v<N>` format, while local filenames keep the `-v<N>` suffix for cache readability.9293Required metadata to track with every registered dataset:9495- `agent`: the agent name (for example, `hosted-agent-051-001`)96- `stage`: `seed`, `traces`, `curated`, or `prod`97- `version`: version string such as `v1`, `v2`, or `v3`98- `datasetUri`: always persist the Foundry dataset URI in the selected metadata file alongside the local `datasetFile`, dataset name, and version99100> ๐ก **Tip:** `evaluation_dataset_create` does not expose a first-class `tags` parameter in the current MCP surface. Persist `agent`, `stage`, and `version` in local metadata (the selected metadata file plus `.foundry/datasets/manifest.json`) so Foundry-side references stay aligned with the cache.101102## Related Skills103104| User Intent | Skill |105|-------------|-------|106| "Run an evaluation" / "Optimize my agent" | [observe skill](../observe/observe.md) |107| "Search traces" / "Analyze failures" / "Latency analysis" | [trace skill](../trace/trace.md) |108| "Find eval scores for a response ID" / "Link eval results to traces" | [trace skill -> Eval Correlation](../trace/references/eval-correlation.md) |109| "Deploy my agent" | [deploy skill](../deploy/deploy.md) |110| "Debug container issues" | [troubleshoot skill](../troubleshoot/troubleshoot.md) |111| "Review metadata schema" | [Agent Metadata Contract](../../references/agent-metadata-contract.md) |112