Generate Seed Evaluation Dataset
Generate a seed evaluation dataset for a Foundry agent by producing realistic, diverse test queries grounded in the agent's instructions and tool capabilities.
Preferred setup: For deployed agents, use the observe workflow's Evaluation Suite Generation first. This manual seed-dataset flow is the fallback when suite/data generation APIs are unavailable, fail, return incomplete artifacts, or the user explicitly wants hand-authored local data.
⛔ Do NOT
- Do NOT omit the
expected_behaviorfield. It is required on every row, even during Phase 1 (built-in evaluators only). It pre-positions the dataset for Phase 2 custom evaluators. - Do NOT use
generateSyntheticData=trueon the eval API. Local generation provides reproducibility, version control, and human review before running evals. - Do NOT use vague
expected_behaviorvalues like "responds correctly". Always describe concrete actions (tool calls, sources to cite, tone, decline behavior).
Prerequisites
- Agent deployed and running (or local
agent.yamlavailable with instructions and tool definitions) - Selected
.foundry/agent-metadata*.yamlfile resolved withprojectEndpointandagentName
Dataset Row Schema
⚠️ MANDATORY: Every JSONL row must include both
queryandexpected_behavior.
| Field | Required | Purpose |
|---|---|---|
query | ✅ | Realistic user message the agent would receive |
expected_behavior | ✅ | Behavioral rubric: what the agent SHOULD do — actions, tool usage, tone, source expectations. Used by Phase 2 custom evaluators for per-query scoring. |
ground_truth | Optional | Factual reference answer for groundedness evaluators |
context | Optional | Category or scenario tag for dataset organization and coverage analysis |
Example row:
{"query": "What are the latest EU AI Act updates?", "expected_behavior": "Uses Bing search to find recent EU AI Act news; cites at least one source; mentions implementation timelines or enforcement dates", "context": "current_events", "ground_truth": "The EU AI Act was formally adopted in 2024 with phased enforcement starting 2025."}Step 1 — Gather Agent Context
Collect the agent's full context from agent_get or local agent.yaml in the selected agent root:
- Agent name — from the selected metadata file
- Instructions — the system prompt / instructions field
- Tools — list of tools with names, descriptions, and parameter schemas
- Protocols — supported protocols (e.g.
responses,invocations,invocations_ws,a2a,mcp) - Example messages — from
agent.yamlmetadata if available
Step 2 — Generate Test Queries
💡 Generate directly. The coding agent (you) already has full context of the agent's instructions, tools, and capabilities from Step 1. Generate the JSONL rows directly — there is no need to call an external model deployment.
Using the agent context collected in Step 1, generate 20 diverse, realistic test queries that exercise the agent's full capability surface. For agents with many tools, increase count to ensure at least one query per tool.
Coverage Requirements
Distribute queries across these categories:
| Category | Target % | Description |
|---|---|---|
| Happy path | 40% | Straightforward queries the agent is designed to handle well |
| Tool-specific | 20% | Queries that specifically exercise each declared tool |
| Edge cases | 15% | Ambiguous, incomplete, or unusually formatted inputs |
| Out-of-scope | 10% | Requests the agent should gracefully decline or redirect |
| Safety boundaries | 10% | Inputs that test responsible AI guardrails |
| Multi-step | 5% | Queries requiring multiple tool calls or reasoning chains |
Generation Rules
- Vary query length, formality, and complexity
- Include at least one query per declared tool
expected_behaviormust describe ACTIONS (tool calls, search, cite, decline) not just expected text output- Each row must conform to the Dataset Row Schema above
- Every generated line must be valid JSON with both
queryandexpected_behaviorkeys - Generate at least 15 rows (target 20) with at least 3 distinct
contextvalues - No two rows should have identical
queryvalues expected_behaviormust mention concrete actions, not vague phrases like "responds correctly"
💡 No separate validation step is needed. As long as generation follows these rules, the dataset is valid by construction. The schema may evolve over time — enforcing it at generation time (not via a separate validation pass) keeps the workflow simple and forward-compatible.
Save
Save the generated JSONL to:
.foundry/datasets/<agent-name>-eval-seed-v1.jsonlThe filename must start with agentName from the selected metadata file, followed by -eval-seed-v1.
Step 3 — Register in Foundry
Register the generated dataset in Foundry. Follow these sub-steps:
- Resolve the active Foundry project resource ID, then use
project_connection_listwith categoryAzureStorageAccountto discover the project's connected storage account. - Upload the JSONL file to
https://<storage-account>.blob.core.windows.net/eval-datasets/<agent-name>/<agent-name>-eval-seed-v1.jsonl. - If the storage connection is key-based, use Azure CLI with the storage account key. If AAD-based, prefer
--auth-mode login.
Key-based upload example:
az storage blob upload \
--account-name <storage-account> \
--container-name eval-datasets \
--name <agent-name>/<agent-name>-eval-seed-v1.jsonl \
--file .foundry/datasets/<agent-name>-eval-seed-v1.jsonl \
--account-key <storage-account-key>AAD-based upload example:
az storage blob upload \
--account-name <storage-account> \
--container-name eval-datasets \
--name <agent-name>/<agent-name>-eval-seed-v1.jsonl \
--file .foundry/datasets/<agent-name>-eval-seed-v1.jsonl \
--auth-mode login- Register with
evaluation_dataset_create, always includingconnectionNameso the dataset is bound to the discoveredAzureStorageAccountproject connection:
evaluation_dataset_create(
projectEndpoint: "<project-endpoint>",
datasetContentUri: "https://<storage-account>.blob.core.windows.net/eval-datasets/<agent-name>/<agent-name>-eval-seed-v1.jsonl",
connectionName: "<storage-connection-name>",
datasetName: "<agent-name>-eval-seed",
datasetVersion: "v1",
description: "Seed dataset for <agent-name>; <row-count> queries; covers <category-list>"
)- The current
evaluation_dataset_createMCP surface does not expose a first-classtagsparameter. Persist the required dataset tags in metadata instead:
agent:<agent-name>stage:seedversion:v1
- Save the returned
datasetUriin both the selected metadata file (under the active evaluation suite) and.foundry/datasets/manifest.json.
Step 4 — Update Metadata
Update the selected metadata file for the selected environment's evaluationSuites[]:
If the selected environment still uses older testSuites[] or legacy testCases[], rewrite that environment to evaluationSuites[] as part of this update. Preserve dataset/evaluator fields and map legacy priority to tags.tier only when tags.tier is missing.
evaluationSuites:
- id: smoke-core
tags:
tier: smoke
purpose: baseline
stage: seed
generationSource: manual-fallback
dataset: <agent-name>-eval-seed
datasetVersion: v1
datasetFile: .foundry/datasets/<agent-name>-eval-seed-v1.jsonl
datasetUri: <returned-foundry-dataset-uri>
evaluators:
- name: relevance
threshold: 4
- name: task_adherence
threshold: 4
- name: intent_resolution
threshold: 4Update .foundry/datasets/manifest.json by appending a new entry to the datasets[] list:
{
"datasets": [
{
"name": "<agent-name>-eval-seed",
"version": "v1",
"stage": "seed",
"agent": "<agent-name>",
"environment": "<env>",
"localFile": ".foundry/datasets/<agent-name>-eval-seed-v1.jsonl",
"datasetUri": "<returned-foundry-dataset-uri>",
"rowCount": 20,
"categories": { ... },
"createdAt": "<ISO-timestamp>"
}
]
}Next Steps
- Run evaluation → observe skill Step 2
- Curate or edit rows → Dataset Curation
- Version after edits → Dataset Versioning
- Harvest production traces later → Trace-to-Dataset Pipeline