Loading source
Pulling the file list, source metadata, and syntax-aware rendering for this listing.
Source from repo
Build and deploy AI applications on Azure AI Foundry using Microsoft's model catalog and AI services
Files
Skill
Size
Entrypoint
Format
Open file
Syntax-highlighted preview of this file as included in the skill package.
foundry-agent/observe/references/compare-iterate.md
1# Steps 8–10 — Re-Evaluate, Compare Versions, Iterate23## Step 8 — Re-Evaluate45Use **`evaluation_agent_batch_eval_create`** for re-evaluation, even when the selected evaluation suite has `suiteName`. The generated suite preserves the reviewed dataset/evaluator bundle for selection and lineage, but the run should target the agent directly. Reuse the **same `evaluationId`** as the baseline run when the evaluator set and thresholds are unchanged. Use the same local or registered test dataset (from the selected agent root's `.foundry/datasets/` and suite metadata) and evaluator bundle from the selected environment/evaluation suite. Update `agentVersion` to the new version.67> ⚠️ **Parameter switch reminder:** Agent-target batch re-evaluation creation uses `evaluationId`, but follow-up calls to `evaluation_get` and `evaluation_comparison_create` must use `evalId`. Do not call `evaluation_suite_run` for batch eval.89> ⚠️ **Eval-group immutability:** Reuse the same `evaluationId` only when `evaluatorNames` and thresholds are unchanged. If you add/remove evaluators or change thresholds, create a new evaluation group first, then compare runs within that new group.1011Auto-poll for completion in a background terminal (same as [Step 2](evaluate-step.md)).1213## Step 9 — Compare Versions1415> **Critical:** `displayName` is **required** in the `insightRequest`. Despite the MCP tool schema showing `displayName` as optional (`type: ["string", "null"]`), the API will reject requests without it with a BadRequest error. `state` must be `"NotStarted"`.1617### Required Parameters for `evaluation_comparison_create`1819| Parameter | Required | Description |20|-----------|----------|-------------|21| `insightRequest.displayName` | ✅ | Human-readable name. **Omitting causes BadRequest.** |22| `insightRequest.state` | ✅ | Must be `"NotStarted"` |23| `insightRequest.request.evalId` | ✅ | Eval group ID containing both runs |24| `insightRequest.request.baselineRunId` | ✅ | Run ID of the baseline |25| `insightRequest.request.treatmentRunIds` | ✅ | Array of treatment run IDs |2627Use **`evaluation_comparison_create`** with a nested `insightRequest`:2829```json30{31"insightRequest": {32"displayName": "V1 vs V2 Comparison",33"state": "NotStarted",34"request": {35"type": "EvaluationComparison",36"evalId": "<eval-group-id>",37"baselineRunId": "<baseline-run-id>",38"treatmentRunIds": ["<new-run-id>"]39}40}41}42```4344> **Important:** Both runs must be in the **same eval group** (same `evaluationId` in Steps 2 and 8), but comparison requests and lookups use `evalId` for that same group identifier. That shared group assumes the evaluator bundle is fixed for all runs in the group.4546Then use **`evaluation_comparison_get`** (with the returned `insightId`) to retrieve comparison results. Present a summary showing which version performed better per evaluator, and recommend which version to keep.4748## Step 10 — Iterate or Finish4950If more categories remain in the prioritized action table (from [Step 4](analyze-results.md)), loop back to **Step 5** (dive into next category) → **Step 6** (optimize) → **Step 7** (deploy) → **Step 8** (re-evaluate) → **Step 9** (compare).5152Otherwise, confirm the final agent version with the user, then prompt for [CI/CD evals & monitoring](cicd-monitoring.md).53