Loading source
Pulling the file list, source metadata, and syntax-aware rendering for this listing.
Source from repo
Deploy, evaluate, and manage AI agents end-to-end on Microsoft Azure AI Foundry
Files
Skill
Size
Entrypoint
Format
Open file
Syntax-highlighted preview of this file as included in the skill package.
foundry-agent/observe/references/compare-iterate.md
1# Steps 8–10 — Re-Evaluate, Compare Versions, Iterate23## Step 8 — Re-Evaluate45Use **`evaluation_agent_batch_eval_create`** with the **same `evaluationId`** as the baseline run. This places both runs in the same eval group for comparison. Use the same local test dataset (from the selected agent root's `.foundry/datasets/`) and evaluator bundle from the selected environment/evaluation suite. Update `agentVersion` to the new version.67> ⚠️ **Parameter switch reminder:** Re-evaluation creation uses `evaluationId`, but follow-up calls to `evaluation_get` and `evaluation_comparison_create` must use `evalId`.89> ⚠️ **Eval-group immutability:** Reuse the same `evaluationId` only when `evaluatorNames` and thresholds are unchanged. If you add/remove evaluators or change thresholds, create a new evaluation group first, then compare runs within that new group.1011Auto-poll for completion in a background terminal (same as [Step 2](evaluate-step.md)).1213## Step 9 — Compare Versions1415> **Critical:** `displayName` is **required** in the `insightRequest`. Despite the MCP tool schema showing `displayName` as optional (`type: ["string", "null"]`), the API will reject requests without it with a BadRequest error. `state` must be `"NotStarted"`.1617### Required Parameters for `evaluation_comparison_create`1819| Parameter | Required | Description |20|-----------|----------|-------------|21| `insightRequest.displayName` | ✅ | Human-readable name. **Omitting causes BadRequest.** |22| `insightRequest.state` | ✅ | Must be `"NotStarted"` |23| `insightRequest.request.evalId` | ✅ | Eval group ID containing both runs |24| `insightRequest.request.baselineRunId` | ✅ | Run ID of the baseline |25| `insightRequest.request.treatmentRunIds` | ✅ | Array of treatment run IDs |2627Use **`evaluation_comparison_create`** with a nested `insightRequest`:2829```json30{31"insightRequest": {32"displayName": "V1 vs V2 Comparison",33"state": "NotStarted",34"request": {35"type": "EvaluationComparison",36"evalId": "<eval-group-id>",37"baselineRunId": "<baseline-run-id>",38"treatmentRunIds": ["<new-run-id>"]39}40}41}42```4344> **Important:** Both runs must be in the **same eval group** (same `evaluationId` in Steps 2 and 8), but comparison requests and lookups use `evalId` for that same group identifier. That shared group assumes the evaluator bundle is fixed for all runs in the group.4546Then use **`evaluation_comparison_get`** (with the returned `insightId`) to retrieve comparison results. Present a summary showing which version performed better per evaluator, and recommend which version to keep.4748## Step 10 — Iterate or Finish4950If more categories remain in the prioritized action table (from [Step 4](analyze-results.md)), loop back to **Step 5** (dive into next category) → **Step 6** (optimize) → **Step 7** (deploy) → **Step 8** (re-evaluate) → **Step 9** (compare).5152Otherwise, confirm the final agent version with the user, then prompt for [CI/CD evals & monitoring](cicd-monitoring.md).53