Source from repo

Microsoft Foundry Skill

Build and deploy AI applications on Azure AI Foundry using Microsoft's model catalog and AI services

microsoftGitHub microsoftOfficialSource repo Original GitHub link Publisher page

Files

Skill

n/a

Size

564.8 KB

Entrypoint

SKILL.md

Format

git-repo

Open file

foundry-agent/observe/references/evaluate-step.md

Syntax-highlighted preview of this file as included in the skill package.

Rendered Source

markdown72 linesFree

foundry-agent/observe/references/evaluate-step.md

1# Step 2 — Create Batch Evaluation
2 
3## Prerequisites
4 
5- Agent deployed and running in the selected environment
6- Selected `.foundry/agent-metadata*.yaml` file loaded for the active agent root
7- Evaluators configured (from [Step 1](deploy-and-setup.md) or `.foundry/evaluators/`)
8- Local test dataset available (from the selected agent root's `.foundry/datasets/`)
9- Evaluation suite selected from the environment's `evaluationSuites[]`
10 
11## Run Evaluation
12 
13Use **`evaluation_agent_batch_eval_create`** to run the selected evaluation suite's evaluators against the selected environment's agent.
14 
15### Required Parameters
16 
17| Parameter | Description |
18|-----------|-------------|
19| `projectEndpoint` | Azure AI Project endpoint from the selected metadata file |
20| `agentName` | Agent name for the selected environment |
21| `agentVersion` | Agent version (string, for example `"1"`) |
22| `evaluatorNames` | Array of evaluator names from the selected evaluation suite |
23 
24### Test Data Options
25 
26**Preferred — local dataset:** Read JSONL from `.foundry/datasets/` and pass via `inputData` (array of objects with `query` and `expected_behavior`, optionally `context`, `ground_truth`). Always use this when the referenced cache file exists.
27 
28**Fallback only — server-side synthetic data:** Set `generateSyntheticData=true` and provide `generationModelDeploymentName`. Only use this when the local cache is missing and the user explicitly requests a refresh-free synthetic run.
29 
30## Resolve Judge Deployment
31 
32Before setting `deploymentName`, use **`model_deployment_get`** to list the selected project's actual model deployments. Choose a deployment that supports chat completions and use that deployment name for quality evaluators. Do **not** assume `gpt-4o` exists. If the project has no chat-completions-capable deployment, stop and tell the user quality evaluators cannot run until one is available.
33 
34### Additional Parameters
35 
36| Parameter | When Needed |
37|-----------|-------------|
38| `deploymentName` | Required for quality evaluators (the LLM-judge model) |
39| `evaluationId` | Pass existing eval group ID to group runs for comparison |
40| `evaluationName` | Name for a new evaluation group; include environment and evaluation-suite ID |
41 
42> **Important:** Use `evaluationId` on `evaluation_agent_batch_eval_create` (not `evalId`) to group runs. Run suites tagged `tier=smoke` first unless the user chooses a broader suite tag or a specific suite.
43 
44> ⚠️ **Eval-group immutability:** Reuse an existing `evaluationId` only when the dataset comparison setup is unchanged for that group: same evaluator list and same thresholds. If evaluator definitions or thresholds change, create a **new** evaluation group instead of adding another run to the old one.
45 
46## Parameter Naming Guardrail
47 
48These eval tools use similar names for the same evaluation-group identifier. Match the parameter name to the tool exactly:
49 
50| Tool | Correct Group Parameter | Notes |
51|------|-------------------------|-------|
52| `evaluation_agent_batch_eval_create` | `evaluationId` | Reuse the existing group when creating a new run |
53| `evaluation_get` | `evalId` | Use with `isRequestForRuns=true` to list runs in one group |
54| `evaluation_comparison_create` | `insightRequest.request.evalId` | Comparison requests take `evalId`, not `evaluationId` |
55 
56> ⚠️ **Common mistake:** `evaluation_get` does **not** accept `evaluationId`. Always switch from `evaluationId` to `evalId` after the run is created.
57 
58## Auto-Poll for Completion
59 
60Immediately after creating the run, poll **`evaluation_get`** in a background terminal until completion. Use `evalId + isRequestForRuns=true`. The run ID parameter is `evalRunId` (not `runId`).
61 
62Only surface the final result when status reaches `completed`, `failed`, or `cancelled`.
63 
64## Next Steps
65 
66When evaluation completes -> proceed to [Step 3: Analyze Results](analyze-results.md).
67 
68## Reference
69 
70- [Azure AI Foundry Cloud Evaluation](https://learn.microsoft.com/en-us/azure/ai-foundry/how-to/develop/cloud-evaluation)
71- [Built-in Evaluators](https://learn.microsoft.com/en-us/azure/foundry/concepts/built-in-evaluators)
72

Preparing the source view

Microsoft Foundry Skill

foundry-agent/observe/references/evaluate-step.md