Loading source
Pulling the file list, source metadata, and syntax-aware rendering for this listing.
Source from repo
Deploy, evaluate, and manage AI agents end-to-end on Microsoft Azure AI Foundry
Files
Skill
Size
Entrypoint
Format
Open file
Syntax-highlighted preview of this file as included in the skill package.
foundry-agent/observe/references/evaluate-step.md
1# Step 2 — Create Batch Evaluation23## Prerequisites45- Agent deployed and running in the selected environment6- Selected `.foundry/agent-metadata*.yaml` file loaded for the active agent root7- Evaluators configured (from [Step 1](deploy-and-setup.md) or `.foundry/evaluators/`)8- Local test dataset available (from the selected agent root's `.foundry/datasets/`)9- Evaluation suite selected from the environment's `evaluationSuites[]`1011## Run Evaluation1213Use **`evaluation_agent_batch_eval_create`** to run the selected evaluation suite's evaluators against the selected environment's agent.1415### Required Parameters1617| Parameter | Description |18|-----------|-------------|19| `projectEndpoint` | Azure AI Project endpoint from the selected metadata file |20| `agentName` | Agent name for the selected environment |21| `agentVersion` | Agent version (string, for example `"1"`) |22| `evaluatorNames` | Array of evaluator names from the selected evaluation suite |2324### Test Data Options2526**Preferred — local dataset:** Read JSONL from `.foundry/datasets/` and pass via `inputData` (array of objects with `query` and `expected_behavior`, optionally `context`, `ground_truth`). Always use this when the referenced cache file exists.2728**Fallback only — server-side synthetic data:** Set `generateSyntheticData=true` and provide `generationModelDeploymentName`. Only use this when the local cache is missing and the user explicitly requests a refresh-free synthetic run.2930## Resolve Judge Deployment3132Before setting `deploymentName`, use **`model_deployment_get`** to list the selected project's actual model deployments. Choose a deployment that supports chat completions and use that deployment name for quality evaluators. Do **not** assume `gpt-4o` exists. If the project has no chat-completions-capable deployment, stop and tell the user quality evaluators cannot run until one is available.3334### Additional Parameters3536| Parameter | When Needed |37|-----------|-------------|38| `deploymentName` | Required for quality evaluators (the LLM-judge model) |39| `evaluationId` | Pass existing eval group ID to group runs for comparison |40| `evaluationName` | Name for a new evaluation group; include environment and evaluation-suite ID |4142> **Important:** Use `evaluationId` on `evaluation_agent_batch_eval_create` (not `evalId`) to group runs. Run suites tagged `tier=smoke` first unless the user chooses a broader suite tag or a specific suite.4344> ⚠️ **Eval-group immutability:** Reuse an existing `evaluationId` only when the dataset comparison setup is unchanged for that group: same evaluator list and same thresholds. If evaluator definitions or thresholds change, create a **new** evaluation group instead of adding another run to the old one.4546## Parameter Naming Guardrail4748These eval tools use similar names for the same evaluation-group identifier. Match the parameter name to the tool exactly:4950| Tool | Correct Group Parameter | Notes |51|------|-------------------------|-------|52| `evaluation_agent_batch_eval_create` | `evaluationId` | Reuse the existing group when creating a new run |53| `evaluation_get` | `evalId` | Use with `isRequestForRuns=true` to list runs in one group |54| `evaluation_comparison_create` | `insightRequest.request.evalId` | Comparison requests take `evalId`, not `evaluationId` |5556> ⚠️ **Common mistake:** `evaluation_get` does **not** accept `evaluationId`. Always switch from `evaluationId` to `evalId` after the run is created.5758## Auto-Poll for Completion5960Immediately after creating the run, poll **`evaluation_get`** in a background terminal until completion. Use `evalId + isRequestForRuns=true`. The run ID parameter is `evalRunId` (not `runId`).6162Only surface the final result when status reaches `completed`, `failed`, or `cancelled`.6364## Next Steps6566When evaluation completes -> proceed to [Step 3: Analyze Results](analyze-results.md).6768## Reference6970- [Azure AI Foundry Cloud Evaluation](https://learn.microsoft.com/en-us/azure/ai-foundry/how-to/develop/cloud-evaluation)71- [Built-in Evaluators](https://learn.microsoft.com/en-us/azure/foundry/concepts/built-in-evaluators)72