Step 2 — Create Batch Evaluation
Prerequisites
- Agent deployed and running in the selected environment
- Selected
.foundry/agent-metadata*.yamlfile loaded for the active agent root - Evaluators configured (from Step 1 or
.foundry/evaluators/) - Local test dataset available (from the selected agent root's
.foundry/datasets/) - Evaluation suite selected from the environment's
evaluationSuites[]
Run Evaluation
Use evaluation_agent_batch_eval_create to run the selected evaluation suite's evaluators against the selected environment's agent.
Required Parameters
| Parameter | Description |
|---|---|
projectEndpoint | Azure AI Project endpoint from the selected metadata file |
agentName | Agent name for the selected environment |
agentVersion | Agent version (string, for example "1") |
evaluatorNames | Array of evaluator names from the selected evaluation suite |
Test Data Options
Preferred — local dataset: Read JSONL from .foundry/datasets/ and pass via inputData (array of objects with query and expected_behavior, optionally context, ground_truth). Always use this when the referenced cache file exists.
Fallback only — server-side synthetic data: Set generateSyntheticData=true and provide generationModelDeploymentName. Only use this when the local cache is missing and the user explicitly requests a refresh-free synthetic run.
Resolve Judge Deployment
Before setting deploymentName, use model_deployment_get to list the selected project's actual model deployments. Choose a deployment that supports chat completions and use that deployment name for quality evaluators. Do not assume gpt-4o exists. If the project has no chat-completions-capable deployment, stop and tell the user quality evaluators cannot run until one is available.
Additional Parameters
| Parameter | When Needed |
|---|---|
deploymentName | Required for quality evaluators (the LLM-judge model) |
evaluationId | Pass existing eval group ID to group runs for comparison |
evaluationName | Name for a new evaluation group; include environment and evaluation-suite ID |
Important: Use
evaluationIdonevaluation_agent_batch_eval_create(notevalId) to group runs. Run suites taggedtier=smokefirst unless the user chooses a broader suite tag or a specific suite.
⚠️ Eval-group immutability: Reuse an existing
evaluationIdonly when the dataset comparison setup is unchanged for that group: same evaluator list and same thresholds. If evaluator definitions or thresholds change, create a new evaluation group instead of adding another run to the old one.
Parameter Naming Guardrail
These eval tools use similar names for the same evaluation-group identifier. Match the parameter name to the tool exactly:
| Tool | Correct Group Parameter | Notes |
|---|---|---|
evaluation_agent_batch_eval_create | evaluationId | Reuse the existing group when creating a new run |
evaluation_get | evalId | Use with isRequestForRuns=true to list runs in one group |
evaluation_comparison_create | insightRequest.request.evalId | Comparison requests take evalId, not evaluationId |
⚠️ Common mistake:
evaluation_getdoes not acceptevaluationId. Always switch fromevaluationIdtoevalIdafter the run is created.
Auto-Poll for Completion
Immediately after creating the run, poll evaluation_get in a background terminal until completion. Use evalId + isRequestForRuns=true. The run ID parameter is evalRunId (not runId).
Only surface the final result when status reaches completed, failed, or cancelled.
Next Steps
When evaluation completes -> proceed to Step 3: Analyze Results.