Step 2 - Run Evaluation
Prerequisites
- Agent deployed and running in the selected environment
- Selected
.foundry/agent-metadata*.yamlfile loaded for the active agent root - Evaluation suite selected from the environment's
evaluationSuites[] - For generated suites:
suiteNamepresent and verified withevaluation_suite_get - For legacy suites: local dataset and evaluator metadata available in
.foundry/
Definition of Done — Evaluation Run
A Step 2 evaluation run is complete only when every box below is checked. Do not produce a final "evaluation complete" summary, score table, or report link until all items are done. "Status reached completed" is not a stopping condition — evaluation_get returns metadata only.
- [ ]
evaluation_agent_batch_eval_createreturned anevalRunId - [ ]
evalIdandevalRunIdmirrored into the selected.foundry/agent-metadata*.yaml(environments.<env>.lastEval.{evalId, evalRunId, runName, suiteName, suiteVersion, agentVersion, startedAt}) so a later turn can resume - [ ] Polling reached terminal state (
completed,failed, orcancelled) - [ ] Per-item
output_itemsdownloaded via theazure-ai-projectsPython SDK (see Step 3 → Download Results) — NOT viaevaluation_get, NOT viaevaluation_dataset_sas_url_get - [ ] Results persisted under
.foundry/results/<env>/<eval-id>/<run-id>.json - [ ] Per-item failures and any
passed: null/reason: nullitems have been clustered (Step 4) before summarizing
Run Agent-Target Batch Eval
Use evaluation_agent_batch_eval_create for batch evaluation, even when the selected metadata entry was produced by evaluation-suite generation. Treat the generated suite as the reviewed source of dataset/evaluator metadata, not as the execution API.
| Parameter | Description |
|---|---|
projectEndpoint | Azure AI Project endpoint from the selected metadata file |
agentName | Agent name for the selected environment |
agentVersion | Agent version (string, for example "1") |
evaluatorNames | Array of evaluator names from the selected evaluation suite |
evaluationName | Include environment and evaluation-suite ID |
runName | Include environment, suite ID, and agent version |
deploymentName | Required for LLM-judge evaluators |
inputData | Array of inline test items, each an object with a query string (and optional expected_behavior). Required for agent-target runs unless generateSyntheticData=true is set. The parameter name is inputData — not data, inputItems, or inputDataItems. |
generateSyntheticData | Set true to skip inputData and let the service generate test queries. Requires generationModelDeploymentName and samplesCount. The service rejects requests with only datasetName/datasetVersion; it does not auto-resolve generated suite datasets into input rows. |
generationModelDeploymentName | Model deployment used to generate synthetic queries when generateSyntheticData=true. |
samplesCount | Number of synthetic queries to generate (15–1000). |
evaluationId | Existing eval group ID, only when evaluator set and thresholds are unchanged |
Before the run, if the selected suite has suiteName, call evaluation_suite_get(projectEndpoint, suiteName, version) and confirm it references the expected dataset/evaluators. Use the suite to select evaluator names, thresholds, and local review artifacts, then run evaluation_agent_batch_eval_create. Run suites tagged tier=smoke first unless the user chooses a broader suite tag or a specific suite.
Test Data
Use generated suite datasets for user review and lineage. For the agent-target batch eval tool:
- Pass test rows inline via the
inputDataparameter (array of{query: "...", expected_behavior?: "..."}objects). The service does not acceptdatasetName/datasetVersionreferences for agent-target runs — a generated suite dataset must be materialized intoinputDatarows by the caller. - Reviewed local rows should include
expected_behaviorso rubric-based evaluators and failure analysis can preserve the user's rubric. - Alternatively, set
generateSyntheticData=truewithgenerationModelDeploymentName,samplesCount(15–1000), and optionaloutputDatasetNamewhen the user wants the agent-target run to generate a fresh test set instead of supplyinginputData. - Do not call
evaluation_suite_runfor batch eval.
⚠️ Parameter-name guardrail: The inline-rows parameter is
inputData. The service rejectsdata,inputItems, andinputDataItemswith the misleading error"At least one input data item must be provided ... Set generateSyntheticData=true to auto-generate test queries instead."— that error means the rows were sent under the wrong key, not that synthetic generation is required.
Before setting deploymentName, use model_deployment_get to list actual project deployments and choose one that supports chat completions; do not assume gpt-4o exists.
Parameter Naming Guardrail
| Tool | Correct Group Parameter | Notes |
|---|---|---|
evaluation_agent_batch_eval_create | evaluationId | Agent-target batch eval run grouping |
evaluation_get | evalId | Use with isRequestForRuns=true to list runs in one group |
evaluation_comparison_create | insightRequest.request.evalId | Comparison requests take evalId, not evaluationId |
evaluation_get does not accept evaluationId; switch to evalId after run creation.
⚠️ Eval-group immutability: Reuse an existing eval group only when dataset, evaluator list, and thresholds are unchanged. If evaluator definitions or thresholds change, create a new evaluation group or suite version.
Auto-Poll for Completion
Immediately after creating the run, poll evaluation_get in a background terminal until completion. Use evalId + isRequestForRuns=true for run lists. The run ID parameter is evalRunId (not runId).
Only surface the final result when status reaches completed, failed, or cancelled.
⚠️
evaluation_getreturns run metadata only — it does NOT return per-item scores, agent responses, or judge reasons. Once the run reaches terminal state, you MUST immediately follow Step 3 → Download Results and pulloutput_itemsvia theazure-ai-projectsPython SDK (client.get_openai_client().evals.runs.output_items.list(...)). Do not attempt to useevaluation_dataset_sas_url_geton the result artifact (eval-result-<runId>-*) — that endpoint is for evaluation input datasets and returns 500 for result artifacts.
💡 Mirror IDs to metadata immediately. Right after
evaluation_agent_batch_eval_createreturns, writeevalId,evalRunId,runName,suiteName/suiteVersion,agentVersion, andstartedAtto the selected environment'slastEvalblock in.foundry/agent-metadata*.yaml. This lets a later turn resume polling or downloading without re-reading chat history. The azd.env(LAST_EVAL_ID, etc.) is azd-internal and should not be relied on by skill flows.
Background Polling Pattern
MCP tools live in the agent's process, so a true detached poller cannot call MCP tools directly. Use one of these concrete patterns instead of saying "ping me later":
- Sentinel-file poller (preferred for long-running jobs). Spawn a sync terminal Python job that polls the Foundry REST API with the user's Azure credential (via
azure-identity+requests) every 60–120s and writes status to.foundry/.poll/<evalRunId>.jsonwhen terminal. The next turn reads the sentinel file before doing anything else. - Batched in-turn polling. If a sentinel poller is unavailable, batch 2–4 poll calls per turn (60–120s apart, via short
sleepbetween MCP calls in the same response) before yielding back to the user. Always explain that polling will continue on the next turn and update the metadata'slastEval.lastPolledAtso resumption is obvious. - Never silently stop. Returning "ping me later" without updating metadata or spawning a sentinel is a workflow violation — the user has to remember state for you.
Next Steps
When evaluation completes -> immediately proceed to Step 3: Analyze Results and download output_items. Do not produce a summary first.