Loading source
Pulling the file list, source metadata, and syntax-aware rendering for this listing.
Source from repo
Build and deploy AI applications on Azure AI Foundry using Microsoft's model catalog and AI services
Files
Skill
Size
Entrypoint
Format
Open file
Syntax-highlighted preview of this file as included in the skill package.
foundry-agent/observe/references/evaluate-step.md
1# Step 2 — Create Batch Evaluation23## Prerequisites45- Agent deployed and running in the selected environment6- Selected `.foundry/agent-metadata*.yaml` file loaded for the active agent root7- Evaluators configured (from [Step 1](deploy-and-setup.md) or `.foundry/evaluators/`)8- Local test dataset available (from the selected agent root's `.foundry/datasets/`)9- Evaluation suite selected from the environment's `evaluationSuites[]`1011## Run Evaluation1213Use **`evaluation_agent_batch_eval_create`** to run the selected evaluation suite's evaluators against the selected environment's agent.1415### Required Parameters1617| Parameter | Description |18|-----------|-------------|19| `projectEndpoint` | Azure AI Project endpoint from the selected metadata file |20| `agentName` | Agent name for the selected environment |21| `agentVersion` | Agent version (string, for example `"1"`) |22| `evaluatorNames` | Array of evaluator names from the selected evaluation suite |2324### Test Data Options2526**Preferred — local dataset:** Read JSONL from `.foundry/datasets/` and pass via `inputData` (array of objects with `query` and `expected_behavior`, optionally `context`, `ground_truth`). Always use this when the referenced cache file exists.2728**Fallback only — server-side synthetic data:** Set `generateSyntheticData=true` and provide `generationModelDeploymentName`. Only use this when the local cache is missing and the user explicitly requests a refresh-free synthetic run.2930## Resolve Judge Deployment3132Before setting `deploymentName`, use **`model_deployment_get`** to list the selected project's actual model deployments. Choose a deployment that supports chat completions and use that deployment name for quality evaluators. Do **not** assume `gpt-4o` exists. If the project has no chat-completions-capable deployment, stop and tell the user quality evaluators cannot run until one is available.3334### Additional Parameters3536| Parameter | When Needed |37|-----------|-------------|38| `deploymentName` | Required for quality evaluators (the LLM-judge model) |39| `evaluationId` | Pass existing eval group ID to group runs for comparison |40| `evaluationName` | Name for a new evaluation group; include environment and evaluation-suite ID |4142> **Important:** Use `evaluationId` on `evaluation_agent_batch_eval_create` (not `evalId`) to group runs. Run suites tagged `tier=smoke` first unless the user chooses a broader suite tag or a specific suite.4344> ⚠️ **Eval-group immutability:** Reuse an existing `evaluationId` only when the dataset comparison setup is unchanged for that group: same evaluator list and same thresholds. If evaluator definitions or thresholds change, create a **new** evaluation group instead of adding another run to the old one.4546## Parameter Naming Guardrail4748These eval tools use similar names for the same evaluation-group identifier. Match the parameter name to the tool exactly:4950| Tool | Correct Group Parameter | Notes |51|------|-------------------------|-------|52| `evaluation_agent_batch_eval_create` | `evaluationId` | Reuse the existing group when creating a new run |53| `evaluation_get` | `evalId` | Use with `isRequestForRuns=true` to list runs in one group |54| `evaluation_comparison_create` | `insightRequest.request.evalId` | Comparison requests take `evalId`, not `evaluationId` |5556> ⚠️ **Common mistake:** `evaluation_get` does **not** accept `evaluationId`. Always switch from `evaluationId` to `evalId` after the run is created.5758## Auto-Poll for Completion5960Immediately after creating the run, poll **`evaluation_get`** in a background terminal until completion. Use `evalId + isRequestForRuns=true`. The run ID parameter is `evalRunId` (not `runId`).6162Only surface the final result when status reaches `completed`, `failed`, or `cancelled`.6364## Next Steps6566When evaluation completes -> proceed to [Step 3: Analyze Results](analyze-results.md).6768## Reference6970- [Azure AI Foundry Cloud Evaluation](https://learn.microsoft.com/en-us/azure/ai-foundry/how-to/develop/cloud-evaluation)71- [Built-in Evaluators](https://learn.microsoft.com/en-us/azure/foundry/concepts/built-in-evaluators)72