Source from repo

Microsoft Foundry Skill

Deploy, evaluate, and manage AI agents end-to-end on Microsoft Azure AI Foundry

microsoftGitHub microsoftOfficialSource repo Original GitHub link Publisher page

Files

Skill

n/a

Size

546.6 KB

Entrypoint

SKILL.md

Format

git-repo

Open file

foundry-agent/observe/references/evaluate-step.md

Syntax-highlighted preview of this file as included in the skill package.

Rendered Source

markdown72 linesFree

foundry-agent/observe/references/evaluate-step.md

1# Step 2 — Create Batch Evaluation
2 
3## Prerequisites
4 
5- Agent deployed and running in the selected environment
6- Selected `.foundry/agent-metadata*.yaml` file loaded for the active agent root
7- Evaluators configured (from [Step 1](deploy-and-setup.md) or `.foundry/evaluators/`)
8- Local test dataset available (from the selected agent root's `.foundry/datasets/`)
9- Evaluation suite selected from the environment's `evaluationSuites[]`
10 
11## Run Evaluation
12 
13Use **`evaluation_agent_batch_eval_create`** to run the selected evaluation suite's evaluators against the selected environment's agent.
14 
15### Required Parameters
16 
17| Parameter | Description |
18|-----------|-------------|
19| `projectEndpoint` | Azure AI Project endpoint from the selected metadata file |
20| `agentName` | Agent name for the selected environment |
21| `agentVersion` | Agent version (string, for example `"1"`) |
22| `evaluatorNames` | Array of evaluator names from the selected evaluation suite |
23 
24### Test Data Options
25 
26**Preferred — local dataset:** Read JSONL from `.foundry/datasets/` and pass via `inputData` (array of objects with `query` and `expected_behavior`, optionally `context`, `ground_truth`). Always use this when the referenced cache file exists.
27 
28**Fallback only — server-side synthetic data:** Set `generateSyntheticData=true` and provide `generationModelDeploymentName`. Only use this when the local cache is missing and the user explicitly requests a refresh-free synthetic run.
29 
30## Resolve Judge Deployment
31 
32Before setting `deploymentName`, use **`model_deployment_get`** to list the selected project's actual model deployments. Choose a deployment that supports chat completions and use that deployment name for quality evaluators. Do **not** assume `gpt-4o` exists. If the project has no chat-completions-capable deployment, stop and tell the user quality evaluators cannot run until one is available.
33 
34### Additional Parameters
35 
36| Parameter | When Needed |
37|-----------|-------------|
38| `deploymentName` | Required for quality evaluators (the LLM-judge model) |
39| `evaluationId` | Pass existing eval group ID to group runs for comparison |
40| `evaluationName` | Name for a new evaluation group; include environment and evaluation-suite ID |
41 
42> **Important:** Use `evaluationId` on `evaluation_agent_batch_eval_create` (not `evalId`) to group runs. Run suites tagged `tier=smoke` first unless the user chooses a broader suite tag or a specific suite.
43 
44> ⚠️ **Eval-group immutability:** Reuse an existing `evaluationId` only when the dataset comparison setup is unchanged for that group: same evaluator list and same thresholds. If evaluator definitions or thresholds change, create a **new** evaluation group instead of adding another run to the old one.
45 
46## Parameter Naming Guardrail
47 
48These eval tools use similar names for the same evaluation-group identifier. Match the parameter name to the tool exactly:
49 
50| Tool | Correct Group Parameter | Notes |
51|------|-------------------------|-------|
52| `evaluation_agent_batch_eval_create` | `evaluationId` | Reuse the existing group when creating a new run |
53| `evaluation_get` | `evalId` | Use with `isRequestForRuns=true` to list runs in one group |
54| `evaluation_comparison_create` | `insightRequest.request.evalId` | Comparison requests take `evalId`, not `evaluationId` |
55 
56> ⚠️ **Common mistake:** `evaluation_get` does **not** accept `evaluationId`. Always switch from `evaluationId` to `evalId` after the run is created.
57 
58## Auto-Poll for Completion
59 
60Immediately after creating the run, poll **`evaluation_get`** in a background terminal until completion. Use `evalId + isRequestForRuns=true`. The run ID parameter is `evalRunId` (not `runId`).
61 
62Only surface the final result when status reaches `completed`, `failed`, or `cancelled`.
63 
64## Next Steps
65 
66When evaluation completes -> proceed to [Step 3: Analyze Results](analyze-results.md).
67 
68## Reference
69 
70- [Azure AI Foundry Cloud Evaluation](https://learn.microsoft.com/en-us/azure/ai-foundry/how-to/develop/cloud-evaluation)
71- [Built-in Evaluators](https://learn.microsoft.com/en-us/azure/foundry/concepts/built-in-evaluators)
72

Preparing the source view

Microsoft Foundry Skill

foundry-agent/observe/references/evaluate-step.md