Source from repo
Microsoft Foundry Skill

Deploy, evaluate, and manage AI agents end-to-end on Microsoft Azure AI Foundry
microsoftGitHub microsoftOfficialSource repo Original GitHub link Publisher page
Files
Skill
n/a
Size
560.1 KB
Entrypoint
SKILL.md
Format
git-repo
Open file
foundry-agent/eval-datasets/references/generate-seed-dataset.md

Syntax-highlighted preview of this file as included in the skill package.
Rendered Source
markdown187 linesFree
foundry-agent/eval-datasets/references/generate-seed-dataset.md
1# Generate Seed Evaluation Dataset
2 
3Generate a seed evaluation dataset for a Foundry agent by producing realistic, diverse test queries grounded in the agent's instructions and tool capabilities.
4 
5## ⛔ Do NOT
6 
7- Do NOT omit the `expected_behavior` field. It is **required** on every row, even during Phase 1 (built-in evaluators only). It pre-positions the dataset for Phase 2 custom evaluators.
8- Do NOT use `generateSyntheticData=true` on the eval API. Local generation provides reproducibility, version control, and human review before running evals.
9- Do NOT use vague `expected_behavior` values like "responds correctly". Always describe concrete actions (tool calls, sources to cite, tone, decline behavior).
10 
11## Prerequisites
12 
13- Agent deployed and running (or local `agent.yaml` available with instructions and tool definitions)
14- Selected `.foundry/agent-metadata*.yaml` file resolved with `projectEndpoint` and `agentName`
15 
16## Dataset Row Schema
17 
18> ⚠️ **MANDATORY: Every JSONL row must include both `query` and `expected_behavior`.**
19 
20| Field | Required | Purpose |
21|-------|----------|---------|
22| `query` | ✅ | Realistic user message the agent would receive |
23| `expected_behavior` | ✅ | Behavioral rubric: what the agent SHOULD do — actions, tool usage, tone, source expectations. Used by Phase 2 custom evaluators for per-query scoring. |
24| `ground_truth` | Optional | Factual reference answer for groundedness evaluators |
25| `context` | Optional | Category or scenario tag for dataset organization and coverage analysis |
26 
27Example row:
28 
29```json
30{"query": "What are the latest EU AI Act updates?", "expected_behavior": "Uses Bing search to find recent EU AI Act news; cites at least one source; mentions implementation timelines or enforcement dates", "context": "current_events", "ground_truth": "The EU AI Act was formally adopted in 2024 with phased enforcement starting 2025."}
31```
32 
33## Step 1 — Gather Agent Context
34 
35Collect the agent's full context from `agent_get` or local `agent.yaml` in the selected agent root:
36 
37- **Agent name** — from the selected metadata file
38- **Instructions** — the system prompt / instructions field
39- **Tools** — list of tools with names, descriptions, and parameter schemas
40- **Protocols** — supported protocols (responses, a2a, mcp)
41- **Example messages** — from `agent.yaml` metadata if available
42 
43## Step 2 — Generate Test Queries
44 
45> 💡 **Generate directly.** The coding agent (you) already has full context of the agent's instructions, tools, and capabilities from Step 1. Generate the JSONL rows directly — there is no need to call an external model deployment.
46 
47Using the agent context collected in Step 1, generate 20 diverse, realistic test queries that exercise the agent's full capability surface. For agents with many tools, increase count to ensure at least one query per tool.
48 
49### Coverage Requirements
50 
51Distribute queries across these categories:
52 
53| Category | Target % | Description |
54|----------|----------|-------------|
55| **Happy path** | 40% | Straightforward queries the agent is designed to handle well |
56| **Tool-specific** | 20% | Queries that specifically exercise each declared tool |
57| **Edge cases** | 15% | Ambiguous, incomplete, or unusually formatted inputs |
58| **Out-of-scope** | 10% | Requests the agent should gracefully decline or redirect |
59| **Safety boundaries** | 10% | Inputs that test responsible AI guardrails |
60| **Multi-step** | 5% | Queries requiring multiple tool calls or reasoning chains |
61 
62### Generation Rules
63 
64- Vary query length, formality, and complexity
65- Include at least one query per declared tool
66- `expected_behavior` must describe **ACTIONS** (tool calls, search, cite, decline) not just expected text output
67- Each row must conform to the [Dataset Row Schema](#dataset-row-schema) above
68- Every generated line must be valid JSON with both `query` and `expected_behavior` keys
69- Generate at least 15 rows (target 20) with at least 3 distinct `context` values
70- No two rows should have identical `query` values
71- `expected_behavior` must mention concrete actions, not vague phrases like "responds correctly"
72 
73> 💡 **No separate validation step is needed.** As long as generation follows these rules, the dataset is valid by construction. The schema may evolve over time — enforcing it at generation time (not via a separate validation pass) keeps the workflow simple and forward-compatible.
74 
75### Save
76 
77Save the generated JSONL to:
78 
79```
80.foundry/datasets/<agent-name>-eval-seed-v1.jsonl
81```
82 
83The filename must start with `agentName` from the selected metadata file, followed by `-eval-seed-v1`.
84 
85## Step 3 — Register in Foundry
86 
87Register the generated dataset in Foundry. Follow these sub-steps:
88 
891. Resolve the active Foundry project resource ID, then use `project_connection_list` with category `AzureStorageAccount` to discover the project's connected storage account.
902. Upload the JSONL file to `https://<storage-account>.blob.core.windows.net/eval-datasets/<agent-name>/<agent-name>-eval-seed-v1.jsonl`.
913. If the storage connection is key-based, use Azure CLI with the storage account key. If AAD-based, prefer `--auth-mode login`.
92 
93**Key-based upload example:**
94 
95```bash
96az storage blob upload \
97  --account-name <storage-account> \
98  --container-name eval-datasets \
99  --name <agent-name>/<agent-name>-eval-seed-v1.jsonl \
100  --file .foundry/datasets/<agent-name>-eval-seed-v1.jsonl \
101  --account-key <storage-account-key>
102```
103 
104**AAD-based upload example:**
105 
106```bash
107az storage blob upload \
108  --account-name <storage-account> \
109  --container-name eval-datasets \
110  --name <agent-name>/<agent-name>-eval-seed-v1.jsonl \
111  --file .foundry/datasets/<agent-name>-eval-seed-v1.jsonl \
112  --auth-mode login
113```
114 
1154. Register with `evaluation_dataset_create`, always including `connectionName` so the dataset is bound to the discovered `AzureStorageAccount` project connection:
116 
117```
118evaluation_dataset_create(
119  projectEndpoint: "<project-endpoint>",
120  datasetContentUri: "https://<storage-account>.blob.core.windows.net/eval-datasets/<agent-name>/<agent-name>-eval-seed-v1.jsonl",
121  connectionName: "<storage-connection-name>",
122  datasetName: "<agent-name>-eval-seed",
123  datasetVersion: "v1",
124  description: "Seed dataset for <agent-name>; <row-count> queries; covers <category-list>"
125)
126```
127 
1285. The current `evaluation_dataset_create` MCP surface does not expose a first-class `tags` parameter. Persist the required dataset tags in metadata instead:
129   - `agent`: `<agent-name>`
130   - `stage`: `seed`
131   - `version`: `v1`
1326. Save the returned `datasetUri` in both the selected metadata file (under the active evaluation suite) and `.foundry/datasets/manifest.json`.
133 
134## Step 4 — Update Metadata
135 
136Update the selected metadata file for the selected environment's `evaluationSuites[]`:
137 
138If the selected environment still uses older `testSuites[]` or legacy `testCases[]`, rewrite that environment to `evaluationSuites[]` as part of this update. Preserve dataset/evaluator fields and map legacy `priority` to `tags.tier` only when `tags.tier` is missing.
139 
140```yaml
141evaluationSuites:
142  - id: smoke-core
143    tags:
144      tier: smoke
145      purpose: baseline
146      stage: seed
147    dataset: <agent-name>-eval-seed
148    datasetVersion: v1
149    datasetFile: .foundry/datasets/<agent-name>-eval-seed-v1.jsonl
150    datasetUri: <returned-foundry-dataset-uri>
151    evaluators:
152      - name: relevance
153        threshold: 4
154      - name: task_adherence
155        threshold: 4
156      - name: intent_resolution
157        threshold: 4
158```
159 
160Update `.foundry/datasets/manifest.json` by appending a new entry to the `datasets[]` list:
161 
162```json
163{
164  "datasets": [
165    {
166      "name": "<agent-name>-eval-seed",
167      "version": "v1",
168      "stage": "seed",
169      "agent": "<agent-name>",
170      "environment": "<env>",
171      "localFile": ".foundry/datasets/<agent-name>-eval-seed-v1.jsonl",
172      "datasetUri": "<returned-foundry-dataset-uri>",
173      "rowCount": 20,
174      "categories": { ... },
175      "createdAt": "<ISO-timestamp>"
176    }
177  ]
178}
179```
180 
181## Next Steps
182 
183- **Run evaluation** → [observe skill Step 2](../../observe/references/evaluate-step.md)
184- **Curate or edit rows** → [Dataset Curation](dataset-curation.md)
185- **Version after edits** → [Dataset Versioning](dataset-versioning.md)
186- **Harvest production traces later** → [Trace-to-Dataset Pipeline](trace-to-dataset.md)
187
Preparing the source view

Microsoft Foundry Skill

foundry-agent/eval-datasets/references/generate-seed-dataset.md