Source from repo

Microsoft Foundry Skill

Build and deploy AI applications on Azure AI Foundry using Microsoft's model catalog and AI services

microsoftGitHub microsoftOfficialSource repo Original GitHub link Publisher page

Files

152

Skill

n/a

Size

945.4 KB

Entrypoint

SKILL.md

Format

git-repo

Open file

foundry-agent/eval-datasets/eval-datasets.md

Syntax-highlighted preview of this file as included in the skill package.

Rendered Source

markdown116 linesFree

foundry-agent/eval-datasets/eval-datasets.md

1# Evaluation Datasets — Trace-to-Dataset Pipeline & Lifecycle Management
2 
3Manage the full lifecycle of evaluation datasets for a Foundry agent: generating or regenerating data, harvesting production traces into the selected agent root's local `.foundry` cache, curating versioned test datasets, tracking evaluation quality over time, and syncing approved updates back to Foundry when needed.
4 
5## When to Use This Skill
6 
7USE FOR: create dataset from traces, harvest traces into dataset, build test dataset, dataset versioning, version my dataset, tag dataset, pin dataset version, organize datasets, dataset splits, curate test cases, review trace candidates, evaluation trending, metrics over time, eval regression, regression detection, compare evaluations over time, dataset comparison, evaluation lineage, trace to dataset pipeline, annotation review, production traces to test cases.
8 
9> ⚠️ **DO NOT manually run** KQL queries to extract datasets or call `evaluation_dataset_create` **without reading this skill first.** This skill defines the correct trace extraction patterns, schema transformation, cache rules, versioning conventions, and quality gates that raw tools do not enforce.
10 
11> 💡 **Tip:** This skill complements the [observe skill](../observe/observe.md) (eval-driven optimization loop) and the [trace skill](../trace/trace.md) (production trace analysis). Use this skill when you need to bridge traces and evaluations: turning production data into test cases and tracking evaluation quality over time.
12 
13## Quick Reference
14 
15| Property | Value |
16|----------|-------|
17| MCP server | `azure` |
18| Key Foundry MCP tools | `data_generation_job_create`, `data_generation_job_get`, `evaluation_dataset_create`, `evaluation_dataset_get`, `evaluation_dataset_versions_get`, `evaluation_suite_create`, `evaluation_suite_get`, `evaluation_get`, `evaluation_comparison_create`, `evaluation_comparison_get` |
19| Storage tools | `project_connection_list` (discover `AzureStorageAccount` connection), `project_connection_create` (add storage connection) |
20| Azure services | Application Insights (via `monitor_resource_log_query`), Azure Blob Storage (dataset sync) |
21| Prerequisites | Agent deployed, effective context resolved from azd or metadata overlay, App Insights connected |
22| Local cache | `.foundry/datasets/`, `.foundry/results/`, `.foundry/evaluators/` |
23 
24## Entry Points
25 
26| User Intent | Start At |
27|-------------|----------|
28| "Create dataset from production traces" / "Harvest traces" | [Trace-to-Dataset Pipeline](references/trace-to-dataset.md) |
29| "Version my dataset" / "Tag dataset" / "Pin dataset version" | [Dataset Versioning](references/dataset-versioning.md) |
30| "Organize my datasets" / "Dataset splits" / "Filter datasets" | [Dataset Organization](references/dataset-organization.md) |
31| "Review trace candidates" / "Curate test cases" | [Dataset Curation](references/dataset-curation.md) |
32| "Show eval metrics over time" / "Evaluation trending" | [Eval Trending](references/eval-trending.md) |
33| "Did my agent regress?" / "Regression detection" | [Eval Regression](references/eval-regression.md) |
34| "Compare datasets" / "Experiment comparison" / "A/B test" | [Dataset Comparison](references/dataset-comparison.md) |
35| "Sync dataset to Foundry" / "Refresh local dataset cache" | [Trace-to-Dataset Pipeline -> Step 5](references/trace-to-dataset.md#step-5--sync-local-cache-with-foundry-optional) |
36| "Trace my evaluation lineage" / "Audit eval history" | [Eval Lineage](references/eval-lineage.md) |
37| "Generate eval dataset" / "Create seed dataset" / "Generate test cases for my agent" | [Generate Seed Dataset](references/generate-seed-dataset.md) |
38| "Regenerate dataset" / "Refresh synthetic data" / "Generate from traces without full suite" | [Generated Data Refresh](../observe/references/evaluation-suite-generation.md#regenerate-one-artifact) |
39 
40## Before Starting — Detect Current State
41 
421. Resolve the target agent root, environment, effective deployment context, and selected metadata overlay using [Common Project Context Resolution](../../SKILL.md#agent-common-project-context-resolution).
432. Confirm the selected environment's `projectEndpoint`, `agentName`, and observability settings from azd first, then metadata overrides.
443. Check `.foundry/datasets/`, `.foundry/results/`, `.foundry/datasets/manifest.json`, and `eval.yaml` in the selected agent root only.
454. Check whether `evaluation_dataset_get` returns server-side datasets for the same environment and whether `evaluationSuites[]` contains `suiteName`/`suiteVersion` references.
465. Route to the appropriate entry point based on user intent.
47 
48## The Foundry Flywheel
49 
50```text
51Production Agent -> [1] Trace (App Insights + OTel)
52                -> [2] Harvest (KQL extraction)
53                -> [3] Curate (human review)
54                -> [4] Dataset Cache (.foundry/datasets, versioned)
55                -> [5] Sync to Foundry (optional refresh/push)
56                -> [6] Evaluate (batch eval)
57                -> [7] Analyze (trending + regression)
58                -> [8] Compare (agent versions OR dataset versions)
59                -> [9] Deploy -> back to [1]
60```
61 
62Each cycle makes the test suite harder and more representative. Production failures from release N become regression tests for release N+1.
63 
64## Behavioral Rules
65 
661. **Always show KQL queries.** Before executing any trace extraction query, display it in a code block. Never run queries silently.
672. **Scope to time ranges.** Always include a time range in KQL queries (default: last 7 days for trace harvesting). Ask the user for the range if not specified.
683. **Require human review.** Never auto-commit harvested traces to a dataset without showing candidates to the user first. The curation step is mandatory.
694. **Use dataset naming conventions.** Follow the naming conventions below and keep local filenames aligned with the registered Foundry dataset name/version.
705. **Treat local files as cache.** Reuse `.foundry/datasets/` and `.foundry/evaluators/` when they already match the selected environment in the selected agent root. Offer refresh when the user asks or when remote state has changed.
716. **Use generated data when requested.** Prefer `data_generation_job_create` for standalone dataset regeneration from agent, dataset, prompt, file, or trace context. Poll with `data_generation_job_get`; if generation fails or returns incomplete artifacts, explain the failure and fall back to the local/manual dataset flow.
727. **Stay inside the selected agent root.** After resolving the agent root, inspect only that folder's `.foundry/` cache and source context. Never merge sibling agent folders.
738. **Persist artifacts.** Save datasets to `.foundry/datasets/`, evaluation results to `.foundry/results/`, and track lineage in `.foundry/datasets/manifest.json`.
749. **Keep evaluation suites aligned.** Update the selected environment's `evaluationSuites[]` in the selected metadata file whenever a dataset version, evaluator set, suite version, or suite tags change. Local flows should default to `agent-metadata.yaml`; prod or CI-targeted flows can use `agent-metadata.<env>.yaml`. If the environment still uses older `testSuites[]` or legacy `testCases[]`, treat that list as the current suite source for this session and rewrite it as `evaluationSuites[]` on the next metadata save.
7510. **Confirm before overwriting.** If a dataset version or cache file already exists, warn the user and ask for confirmation before replacing or refreshing it.
7611. **Sync to Foundry when requested or needed.** After saving datasets locally, refresh or register them in Foundry only when the user asks or the workflow needs shared/CI usage. Use `evaluation_suite_create` for reviewed suite versions that combine the updated dataset and evaluator set.
7712. **Never remove dataset rows or weaken evaluators to recover scores.** Score drops after a dataset update are expected - harder tests expose real gaps. Optimize the agent for new failure patterns; do not shrink the test suite.
7813. **Match eval parameter names exactly.** Use `evaluation_agent_batch_eval_create` for agent-target batch eval, including suites that have `suiteName`; call `evaluation_suite_get` only to resolve suite metadata. Use `evaluationId` when creating grouped batch runs, but use `evalId` for `evaluation_get` and comparison/trending lookups.
79 
80## Dataset Naming and Metadata Conventions
81 
82| Dataset type | Foundry dataset name | Foundry dataset version | Typical local file | Metadata stage |
83|--------------|----------------------|-------------------------|--------------------|----------------|
84| Seed dataset | `<agent-name>-eval-seed` | `v1` | `.foundry/datasets/<agent-name>-eval-seed-v1.jsonl` | `seed` |
85| Trace-harvested dataset | `<agent-name>-traces` | `v<N>` | `.foundry/datasets/<agent-name>-traces-v<N>.jsonl` | `traces` |
86| Curated/refined dataset | `<agent-name>-curated` | `v<N>` | `.foundry/datasets/<agent-name>-curated-v<N>.jsonl` | `curated` |
87| Production-ready dataset | `<agent-name>-prod` | `v<N>` | `.foundry/datasets/<agent-name>-prod-v<N>.jsonl` | `prod` |
88 
89Here `<agent-name>` means the effective selected Foundry agent name from azd or metadata. If that deployed agent name already includes the environment (for example, `support-agent-dev`), do **not** append the environment key a second time.
90 
91Local dataset filenames must start with the effective selected Foundry agent name. Put stage and version suffixes **after** that prefix so cache files sort and group by agent first.
92 
93Keep the Foundry dataset name stable across versions. Store the version only in `datasetVersion` (or manifest `version`) using the `v<N>` format, while local filenames keep the `-v<N>` suffix for cache readability.
94 
95Required metadata to track with every registered or generated dataset:
96 
97- `agent`: the agent name (for example, `hosted-agent-051-001`)
98- `stage`: `seed`, `traces`, `curated`, or `prod`
99- `version`: version string such as `v1`, `v2`, or `v3`
100- `datasetUri`: always persist the Foundry dataset URI in the selected metadata file alongside the local `datasetFile`, dataset name, and version
101 
102> 💡 **Tip:** `evaluation_dataset_create` does not expose a first-class `tags` parameter in the current MCP surface. Persist `agent`, `stage`, and `version` in local metadata (the selected metadata file plus `.foundry/datasets/manifest.json`) so Foundry-side references stay aligned with the cache.
103 
104When a dataset belongs to a generated suite, keep the selected environment's suite metadata aligned with `suiteName`, `suiteVersion`, `generationJobId`, and `generationSource`. Dataset regeneration with `data_generation_job_create` should create a new local dataset version and a reviewed suite version; do not mutate old suite versions in place.
105 
106## Related Skills
107 
108| User Intent | Skill |
109|-------------|-------|
110| "Run an evaluation" / "Optimize my agent" | [observe skill](../observe/observe.md) |
111| "Search traces" / "Analyze failures" / "Latency analysis" | [trace skill](../trace/trace.md) |
112| "Find eval scores for a response ID" / "Link eval results to traces" | [trace skill -> Eval Correlation](../trace/references/eval-correlation.md) |
113| "Deploy my agent" | [deploy skill](../deploy/deploy.md) |
114| "Debug container issues" | [troubleshoot skill](../troubleshoot/troubleshoot.md) |
115| "Review metadata schema" | [Agent Metadata Contract](../../references/agent-metadata-contract.md) |
116

Marketplace

Source from repo

Microsoft Foundry Skill

Build and deploy AI applications on Azure AI Foundry using Microsoft's model catalog and AI services

microsoftGitHub microsoftOfficialSource repo Original GitHub link Publisher page

Files

152

Skill

n/a

Size

945.4 KB

Entrypoint

SKILL.md

Format

git-repo

Open file

foundry-agent/eval-datasets/eval-datasets.md

Syntax-highlighted preview of this file as included in the skill package.

Rendered Source

markdown116 linesFree

foundry-agent/eval-datasets/eval-datasets.md

1# Evaluation Datasets — Trace-to-Dataset Pipeline & Lifecycle Management
2 
3Manage the full lifecycle of evaluation datasets for a Foundry agent: generating or regenerating data, harvesting production traces into the selected agent root's local `.foundry` cache, curating versioned test datasets, tracking evaluation quality over time, and syncing approved updates back to Foundry when needed.
4 
5## When to Use This Skill
6 
7USE FOR: create dataset from traces, harvest traces into dataset, build test dataset, dataset versioning, version my dataset, tag dataset, pin dataset version, organize datasets, dataset splits, curate test cases, review trace candidates, evaluation trending, metrics over time, eval regression, regression detection, compare evaluations over time, dataset comparison, evaluation lineage, trace to dataset pipeline, annotation review, production traces to test cases.
8 
9> ⚠️ **DO NOT manually run** KQL queries to extract datasets or call `evaluation_dataset_create` **without reading this skill first.** This skill defines the correct trace extraction patterns, schema transformation, cache rules, versioning conventions, and quality gates that raw tools do not enforce.
10 
11> 💡 **Tip:** This skill complements the [observe skill](../observe/observe.md) (eval-driven optimization loop) and the [trace skill](../trace/trace.md) (production trace analysis). Use this skill when you need to bridge traces and evaluations: turning production data into test cases and tracking evaluation quality over time.
12 
13## Quick Reference
14 
15| Property | Value |
16|----------|-------|
17| MCP server | `azure` |
18| Key Foundry MCP tools | `data_generation_job_create`, `data_generation_job_get`, `evaluation_dataset_create`, `evaluation_dataset_get`, `evaluation_dataset_versions_get`, `evaluation_suite_create`, `evaluation_suite_get`, `evaluation_get`, `evaluation_comparison_create`, `evaluation_comparison_get` |
19| Storage tools | `project_connection_list` (discover `AzureStorageAccount` connection), `project_connection_create` (add storage connection) |
20| Azure services | Application Insights (via `monitor_resource_log_query`), Azure Blob Storage (dataset sync) |
21| Prerequisites | Agent deployed, effective context resolved from azd or metadata overlay, App Insights connected |
22| Local cache | `.foundry/datasets/`, `.foundry/results/`, `.foundry/evaluators/` |
23 
24## Entry Points
25 
26| User Intent | Start At |
27|-------------|----------|
28| "Create dataset from production traces" / "Harvest traces" | [Trace-to-Dataset Pipeline](references/trace-to-dataset.md) |
29| "Version my dataset" / "Tag dataset" / "Pin dataset version" | [Dataset Versioning](references/dataset-versioning.md) |
30| "Organize my datasets" / "Dataset splits" / "Filter datasets" | [Dataset Organization](references/dataset-organization.md) |
31| "Review trace candidates" / "Curate test cases" | [Dataset Curation](references/dataset-curation.md) |
32| "Show eval metrics over time" / "Evaluation trending" | [Eval Trending](references/eval-trending.md) |
33| "Did my agent regress?" / "Regression detection" | [Eval Regression](references/eval-regression.md) |
34| "Compare datasets" / "Experiment comparison" / "A/B test" | [Dataset Comparison](references/dataset-comparison.md) |
35| "Sync dataset to Foundry" / "Refresh local dataset cache" | [Trace-to-Dataset Pipeline -> Step 5](references/trace-to-dataset.md#step-5--sync-local-cache-with-foundry-optional) |
36| "Trace my evaluation lineage" / "Audit eval history" | [Eval Lineage](references/eval-lineage.md) |
37| "Generate eval dataset" / "Create seed dataset" / "Generate test cases for my agent" | [Generate Seed Dataset](references/generate-seed-dataset.md) |
38| "Regenerate dataset" / "Refresh synthetic data" / "Generate from traces without full suite" | [Generated Data Refresh](../observe/references/evaluation-suite-generation.md#regenerate-one-artifact) |
39 
40## Before Starting — Detect Current State
41 
421. Resolve the target agent root, environment, effective deployment context, and selected metadata overlay using [Common Project Context Resolution](../../SKILL.md#agent-common-project-context-resolution).
432. Confirm the selected environment's `projectEndpoint`, `agentName`, and observability settings from azd first, then metadata overrides.
443. Check `.foundry/datasets/`, `.foundry/results/`, `.foundry/datasets/manifest.json`, and `eval.yaml` in the selected agent root only.
454. Check whether `evaluation_dataset_get` returns server-side datasets for the same environment and whether `evaluationSuites[]` contains `suiteName`/`suiteVersion` references.
465. Route to the appropriate entry point based on user intent.
47 
48## The Foundry Flywheel
49 
50```text
51Production Agent -> [1] Trace (App Insights + OTel)
52                -> [2] Harvest (KQL extraction)
53                -> [3] Curate (human review)
54                -> [4] Dataset Cache (.foundry/datasets, versioned)
55                -> [5] Sync to Foundry (optional refresh/push)
56                -> [6] Evaluate (batch eval)
57                -> [7] Analyze (trending + regression)
58                -> [8] Compare (agent versions OR dataset versions)
59                -> [9] Deploy -> back to [1]
60```
61 
62Each cycle makes the test suite harder and more representative. Production failures from release N become regression tests for release N+1.
63 
64## Behavioral Rules
65 
661. **Always show KQL queries.** Before executing any trace extraction query, display it in a code block. Never run queries silently.
672. **Scope to time ranges.** Always include a time range in KQL queries (default: last 7 days for trace harvesting). Ask the user for the range if not specified.
683. **Require human review.** Never auto-commit harvested traces to a dataset without showing candidates to the user first. The curation step is mandatory.
694. **Use dataset naming conventions.** Follow the naming conventions below and keep local filenames aligned with the registered Foundry dataset name/version.
705. **Treat local files as cache.** Reuse `.foundry/datasets/` and `.foundry/evaluators/` when they already match the selected environment in the selected agent root. Offer refresh when the user asks or when remote state has changed.
716. **Use generated data when requested.** Prefer `data_generation_job_create` for standalone dataset regeneration from agent, dataset, prompt, file, or trace context. Poll with `data_generation_job_get`; if generation fails or returns incomplete artifacts, explain the failure and fall back to the local/manual dataset flow.
727. **Stay inside the selected agent root.** After resolving the agent root, inspect only that folder's `.foundry/` cache and source context. Never merge sibling agent folders.
738. **Persist artifacts.** Save datasets to `.foundry/datasets/`, evaluation results to `.foundry/results/`, and track lineage in `.foundry/datasets/manifest.json`.
749. **Keep evaluation suites aligned.** Update the selected environment's `evaluationSuites[]` in the selected metadata file whenever a dataset version, evaluator set, suite version, or suite tags change. Local flows should default to `agent-metadata.yaml`; prod or CI-targeted flows can use `agent-metadata.<env>.yaml`. If the environment still uses older `testSuites[]` or legacy `testCases[]`, treat that list as the current suite source for this session and rewrite it as `evaluationSuites[]` on the next metadata save.
7510. **Confirm before overwriting.** If a dataset version or cache file already exists, warn the user and ask for confirmation before replacing or refreshing it.
7611. **Sync to Foundry when requested or needed.** After saving datasets locally, refresh or register them in Foundry only when the user asks or the workflow needs shared/CI usage. Use `evaluation_suite_create` for reviewed suite versions that combine the updated dataset and evaluator set.
7712. **Never remove dataset rows or weaken evaluators to recover scores.** Score drops after a dataset update are expected - harder tests expose real gaps. Optimize the agent for new failure patterns; do not shrink the test suite.
7813. **Match eval parameter names exactly.** Use `evaluation_agent_batch_eval_create` for agent-target batch eval, including suites that have `suiteName`; call `evaluation_suite_get` only to resolve suite metadata. Use `evaluationId` when creating grouped batch runs, but use `evalId` for `evaluation_get` and comparison/trending lookups.
79 
80## Dataset Naming and Metadata Conventions
81 
82| Dataset type | Foundry dataset name | Foundry dataset version | Typical local file | Metadata stage |
83|--------------|----------------------|-------------------------|--------------------|----------------|
84| Seed dataset | `<agent-name>-eval-seed` | `v1` | `.foundry/datasets/<agent-name>-eval-seed-v1.jsonl` | `seed` |
85| Trace-harvested dataset | `<agent-name>-traces` | `v<N>` | `.foundry/datasets/<agent-name>-traces-v<N>.jsonl` | `traces` |
86| Curated/refined dataset | `<agent-name>-curated` | `v<N>` | `.foundry/datasets/<agent-name>-curated-v<N>.jsonl` | `curated` |
87| Production-ready dataset | `<agent-name>-prod` | `v<N>` | `.foundry/datasets/<agent-name>-prod-v<N>.jsonl` | `prod` |
88 
89Here `<agent-name>` means the effective selected Foundry agent name from azd or metadata. If that deployed agent name already includes the environment (for example, `support-agent-dev`), do **not** append the environment key a second time.
90 
91Local dataset filenames must start with the effective selected Foundry agent name. Put stage and version suffixes **after** that prefix so cache files sort and group by agent first.
92 
93Keep the Foundry dataset name stable across versions. Store the version only in `datasetVersion` (or manifest `version`) using the `v<N>` format, while local filenames keep the `-v<N>` suffix for cache readability.
94 
95Required metadata to track with every registered or generated dataset:
96 
97- `agent`: the agent name (for example, `hosted-agent-051-001`)
98- `stage`: `seed`, `traces`, `curated`, or `prod`
99- `version`: version string such as `v1`, `v2`, or `v3`
100- `datasetUri`: always persist the Foundry dataset URI in the selected metadata file alongside the local `datasetFile`, dataset name, and version
101 
102> 💡 **Tip:** `evaluation_dataset_create` does not expose a first-class `tags` parameter in the current MCP surface. Persist `agent`, `stage`, and `version` in local metadata (the selected metadata file plus `.foundry/datasets/manifest.json`) so Foundry-side references stay aligned with the cache.
103 
104When a dataset belongs to a generated suite, keep the selected environment's suite metadata aligned with `suiteName`, `suiteVersion`, `generationJobId`, and `generationSource`. Dataset regeneration with `data_generation_job_create` should create a new local dataset version and a reviewed suite version; do not mutate old suite versions in place.
105 
106## Related Skills
107 
108| User Intent | Skill |
109|-------------|-------|
110| "Run an evaluation" / "Optimize my agent" | [observe skill](../observe/observe.md) |
111| "Search traces" / "Analyze failures" / "Latency analysis" | [trace skill](../trace/trace.md) |
112| "Find eval scores for a response ID" / "Link eval results to traces" | [trace skill -> Eval Correlation](../trace/references/eval-correlation.md) |
113| "Deploy my agent" | [deploy skill](../deploy/deploy.md) |
114| "Debug container issues" | [troubleshoot skill](../troubleshoot/troubleshoot.md) |
115| "Review metadata schema" | [Agent Metadata Contract](../../references/agent-metadata-contract.md) |
116

Microsoft Foundry Skill

foundry-agent/eval-datasets/eval-datasets.md

Preparing the source view

Microsoft Foundry Skill

foundry-agent/eval-datasets/eval-datasets.md