Loading source
Pulling the file list, source metadata, and syntax-aware rendering for this listing.
Source from repo
Build and deploy AI applications on Azure AI Foundry using Microsoft's model catalog and AI services
Files
Skill
Size
Entrypoint
Format
Open file
Syntax-highlighted preview of this file as included in the skill package.
foundry-agent/observe/references/compare-iterate.md
1# Steps 8–10 — Re-Evaluate, Compare Versions, Iterate23## Step 8 — Re-Evaluate45Use **`evaluation_agent_batch_eval_create`** with the **same `evaluationId`** as the baseline run. This places both runs in the same eval group for comparison. Use the same local test dataset (from the selected agent root's `.foundry/datasets/`) and evaluator bundle from the selected environment/evaluation suite. Update `agentVersion` to the new version.67> ⚠️ **Parameter switch reminder:** Re-evaluation creation uses `evaluationId`, but follow-up calls to `evaluation_get` and `evaluation_comparison_create` must use `evalId`.89> ⚠️ **Eval-group immutability:** Reuse the same `evaluationId` only when `evaluatorNames` and thresholds are unchanged. If you add/remove evaluators or change thresholds, create a new evaluation group first, then compare runs within that new group.1011Auto-poll for completion in a background terminal (same as [Step 2](evaluate-step.md)).1213## Step 9 — Compare Versions1415> **Critical:** `displayName` is **required** in the `insightRequest`. Despite the MCP tool schema showing `displayName` as optional (`type: ["string", "null"]`), the API will reject requests without it with a BadRequest error. `state` must be `"NotStarted"`.1617### Required Parameters for `evaluation_comparison_create`1819| Parameter | Required | Description |20|-----------|----------|-------------|21| `insightRequest.displayName` | ✅ | Human-readable name. **Omitting causes BadRequest.** |22| `insightRequest.state` | ✅ | Must be `"NotStarted"` |23| `insightRequest.request.evalId` | ✅ | Eval group ID containing both runs |24| `insightRequest.request.baselineRunId` | ✅ | Run ID of the baseline |25| `insightRequest.request.treatmentRunIds` | ✅ | Array of treatment run IDs |2627Use **`evaluation_comparison_create`** with a nested `insightRequest`:2829```json30{31"insightRequest": {32"displayName": "V1 vs V2 Comparison",33"state": "NotStarted",34"request": {35"type": "EvaluationComparison",36"evalId": "<eval-group-id>",37"baselineRunId": "<baseline-run-id>",38"treatmentRunIds": ["<new-run-id>"]39}40}41}42```4344> **Important:** Both runs must be in the **same eval group** (same `evaluationId` in Steps 2 and 8), but comparison requests and lookups use `evalId` for that same group identifier. That shared group assumes the evaluator bundle is fixed for all runs in the group.4546Then use **`evaluation_comparison_get`** (with the returned `insightId`) to retrieve comparison results. Present a summary showing which version performed better per evaluator, and recommend which version to keep.4748## Step 10 — Iterate or Finish4950If more categories remain in the prioritized action table (from [Step 4](analyze-results.md)), loop back to **Step 5** (dive into next category) → **Step 6** (optimize) → **Step 7** (deploy) → **Step 8** (re-evaluate) → **Step 9** (compare).5152Otherwise, confirm the final agent version with the user, then prompt for [CI/CD evals & monitoring](cicd-monitoring.md).53