Loading source
Pulling the file list, source metadata, and syntax-aware rendering for this listing.
Source from repo
Deploy, evaluate, and manage AI agents end-to-end on Microsoft Azure AI Foundry
Files
Skill
Size
Entrypoint
Format
Open file
Syntax-highlighted preview of this file as included in the skill package.
foundry-agent/eval-datasets/references/generate-seed-dataset.md
1# Generate Seed Evaluation Dataset23Generate a seed evaluation dataset for a Foundry agent by producing realistic, diverse test queries grounded in the agent's instructions and tool capabilities.45## ⛔ Do NOT67- Do NOT omit the `expected_behavior` field. It is **required** on every row, even during Phase 1 (built-in evaluators only). It pre-positions the dataset for Phase 2 custom evaluators.8- Do NOT use `generateSyntheticData=true` on the eval API. Local generation provides reproducibility, version control, and human review before running evals.9- Do NOT use vague `expected_behavior` values like "responds correctly". Always describe concrete actions (tool calls, sources to cite, tone, decline behavior).1011## Prerequisites1213- Agent deployed and running (or local `agent.yaml` available with instructions and tool definitions)14- Selected `.foundry/agent-metadata*.yaml` file resolved with `projectEndpoint` and `agentName`1516## Dataset Row Schema1718> ⚠️ **MANDATORY: Every JSONL row must include both `query` and `expected_behavior`.**1920| Field | Required | Purpose |21|-------|----------|---------|22| `query` | ✅ | Realistic user message the agent would receive |23| `expected_behavior` | ✅ | Behavioral rubric: what the agent SHOULD do — actions, tool usage, tone, source expectations. Used by Phase 2 custom evaluators for per-query scoring. |24| `ground_truth` | Optional | Factual reference answer for groundedness evaluators |25| `context` | Optional | Category or scenario tag for dataset organization and coverage analysis |2627Example row:2829```json30{"query": "What are the latest EU AI Act updates?", "expected_behavior": "Uses Bing search to find recent EU AI Act news; cites at least one source; mentions implementation timelines or enforcement dates", "context": "current_events", "ground_truth": "The EU AI Act was formally adopted in 2024 with phased enforcement starting 2025."}31```3233## Step 1 — Gather Agent Context3435Collect the agent's full context from `agent_get` or local `agent.yaml` in the selected agent root:3637- **Agent name** — from the selected metadata file38- **Instructions** — the system prompt / instructions field39- **Tools** — list of tools with names, descriptions, and parameter schemas40- **Protocols** — supported protocols (responses, a2a, mcp)41- **Example messages** — from `agent.yaml` metadata if available4243## Step 2 — Generate Test Queries4445> 💡 **Generate directly.** The coding agent (you) already has full context of the agent's instructions, tools, and capabilities from Step 1. Generate the JSONL rows directly — there is no need to call an external model deployment.4647Using the agent context collected in Step 1, generate 20 diverse, realistic test queries that exercise the agent's full capability surface. For agents with many tools, increase count to ensure at least one query per tool.4849### Coverage Requirements5051Distribute queries across these categories:5253| Category | Target % | Description |54|----------|----------|-------------|55| **Happy path** | 40% | Straightforward queries the agent is designed to handle well |56| **Tool-specific** | 20% | Queries that specifically exercise each declared tool |57| **Edge cases** | 15% | Ambiguous, incomplete, or unusually formatted inputs |58| **Out-of-scope** | 10% | Requests the agent should gracefully decline or redirect |59| **Safety boundaries** | 10% | Inputs that test responsible AI guardrails |60| **Multi-step** | 5% | Queries requiring multiple tool calls or reasoning chains |6162### Generation Rules6364- Vary query length, formality, and complexity65- Include at least one query per declared tool66- `expected_behavior` must describe **ACTIONS** (tool calls, search, cite, decline) not just expected text output67- Each row must conform to the [Dataset Row Schema](#dataset-row-schema) above68- Every generated line must be valid JSON with both `query` and `expected_behavior` keys69- Generate at least 15 rows (target 20) with at least 3 distinct `context` values70- No two rows should have identical `query` values71- `expected_behavior` must mention concrete actions, not vague phrases like "responds correctly"7273> 💡 **No separate validation step is needed.** As long as generation follows these rules, the dataset is valid by construction. The schema may evolve over time — enforcing it at generation time (not via a separate validation pass) keeps the workflow simple and forward-compatible.7475### Save7677Save the generated JSONL to:7879```80.foundry/datasets/<agent-name>-eval-seed-v1.jsonl81```8283The filename must start with `agentName` from the selected metadata file, followed by `-eval-seed-v1`.8485## Step 3 — Register in Foundry8687Register the generated dataset in Foundry. Follow these sub-steps:88891. Resolve the active Foundry project resource ID, then use `project_connection_list` with category `AzureStorageAccount` to discover the project's connected storage account.902. Upload the JSONL file to `https://<storage-account>.blob.core.windows.net/eval-datasets/<agent-name>/<agent-name>-eval-seed-v1.jsonl`.913. If the storage connection is key-based, use Azure CLI with the storage account key. If AAD-based, prefer `--auth-mode login`.9293**Key-based upload example:**9495```bash96az storage blob upload \97--account-name <storage-account> \98--container-name eval-datasets \99--name <agent-name>/<agent-name>-eval-seed-v1.jsonl \100--file .foundry/datasets/<agent-name>-eval-seed-v1.jsonl \101--account-key <storage-account-key>102```103104**AAD-based upload example:**105106```bash107az storage blob upload \108--account-name <storage-account> \109--container-name eval-datasets \110--name <agent-name>/<agent-name>-eval-seed-v1.jsonl \111--file .foundry/datasets/<agent-name>-eval-seed-v1.jsonl \112--auth-mode login113```1141154. Register with `evaluation_dataset_create`, always including `connectionName` so the dataset is bound to the discovered `AzureStorageAccount` project connection:116117```118evaluation_dataset_create(119projectEndpoint: "<project-endpoint>",120datasetContentUri: "https://<storage-account>.blob.core.windows.net/eval-datasets/<agent-name>/<agent-name>-eval-seed-v1.jsonl",121connectionName: "<storage-connection-name>",122datasetName: "<agent-name>-eval-seed",123datasetVersion: "v1",124description: "Seed dataset for <agent-name>; <row-count> queries; covers <category-list>"125)126```1271285. The current `evaluation_dataset_create` MCP surface does not expose a first-class `tags` parameter. Persist the required dataset tags in metadata instead:129- `agent`: `<agent-name>`130- `stage`: `seed`131- `version`: `v1`1326. Save the returned `datasetUri` in both the selected metadata file (under the active evaluation suite) and `.foundry/datasets/manifest.json`.133134## Step 4 — Update Metadata135136Update the selected metadata file for the selected environment's `evaluationSuites[]`:137138If the selected environment still uses older `testSuites[]` or legacy `testCases[]`, rewrite that environment to `evaluationSuites[]` as part of this update. Preserve dataset/evaluator fields and map legacy `priority` to `tags.tier` only when `tags.tier` is missing.139140```yaml141evaluationSuites:142- id: smoke-core143tags:144tier: smoke145purpose: baseline146stage: seed147dataset: <agent-name>-eval-seed148datasetVersion: v1149datasetFile: .foundry/datasets/<agent-name>-eval-seed-v1.jsonl150datasetUri: <returned-foundry-dataset-uri>151evaluators:152- name: relevance153threshold: 4154- name: task_adherence155threshold: 4156- name: intent_resolution157threshold: 4158```159160Update `.foundry/datasets/manifest.json` by appending a new entry to the `datasets[]` list:161162```json163{164"datasets": [165{166"name": "<agent-name>-eval-seed",167"version": "v1",168"stage": "seed",169"agent": "<agent-name>",170"environment": "<env>",171"localFile": ".foundry/datasets/<agent-name>-eval-seed-v1.jsonl",172"datasetUri": "<returned-foundry-dataset-uri>",173"rowCount": 20,174"categories": { ... },175"createdAt": "<ISO-timestamp>"176}177]178}179```180181## Next Steps182183- **Run evaluation** → [observe skill Step 2](../../observe/references/evaluate-step.md)184- **Curate or edit rows** → [Dataset Curation](dataset-curation.md)185- **Version after edits** → [Dataset Versioning](dataset-versioning.md)186- **Harvest production traces later** → [Trace-to-Dataset Pipeline](trace-to-dataset.md)187