Loading source
Pulling the file list, source metadata, and syntax-aware rendering for this listing.
Source from repo
Deploy, evaluate, and manage AI agents end-to-end on Microsoft Azure AI Foundry
Files
Skill
Size
Entrypoint
Format
Open file
Syntax-highlighted preview of this file as included in the skill package.
foundry-agent/eval-datasets/references/eval-regression.md
1# Eval Regression — Automated Regression Detection23Automatically detect when evaluation metrics degrade between agent versions. Compare each evaluation run against the baseline and generate pass/fail verdicts with actionable recommendations.45## Prerequisites67- At least 2 evaluation runs in the same evaluation group8- Baseline run identified (either the first run or the one tagged as `baseline`)910## Step 1 — Identify Baseline and Treatment1112### Automatic Baseline Selection13141. Read `.foundry/datasets/manifest.json` and find the dataset tagged `baseline`.152. If the baseline dataset entry includes a stored `baselineRunId` (or mapping to one or more `evalRunIds`), use that `baselineRunId` as the baseline run.163. If no explicit `baselineRunId` is recorded, select the first (oldest) run in the evaluation group as the baseline.1718### Treatment Selection1920The latest (most recent) run in the evaluation group is the treatment.2122## Step 2 — Run Comparison2324Use **`evaluation_comparison_create`** to compare baseline vs treatment:2526> **Critical:** `displayName` is **required** in the `insightRequest`. Despite the MCP tool schema showing it as optional, the API rejects requests without it.2728```json29{30"insightRequest": {31"displayName": "Regression Check - v1 vs v4",32"state": "NotStarted",33"request": {34"type": "EvaluationComparison",35"evalId": "<eval-group-id>",36"baselineRunId": "<baseline-run-id>",37"treatmentRunIds": ["<latest-run-id>"]38}39}40}41```4243Retrieve results with **`evaluation_comparison_get`** using the returned `insightId`.4445## Step 3 — Regression Verdicts4647For each evaluator in the comparison results, apply regression thresholds:4849| Treatment Effect | Delta | Verdict | Action |50|-----------------|-------|---------|--------|51| `Improved` | > +2% | ✅ PASS | No action needed |52| `Changed` | ±2% | ⚠️ NEUTRAL | Monitor, no immediate action |53| `Degraded` | > -2% | 🔴 REGRESSION | Investigate and remediate |54| `Inconclusive` | — | ❓ INCONCLUSIVE | Increase sample size and re-run |55| `TooFewSamples` | — | ❓ INSUFFICIENT DATA | Need more test cases (≥30 recommended) |5657### Example Regression Report5859```60╔═══════════════════════════════════════════════════════════════╗61║ REGRESSION REPORT: v1 (baseline) → v4 ║62╠═══════════════════════════════════════════════════════════════╣63║ Evaluator │ Baseline │ Treatment │ Delta │ Verdict ║64╠════════════════════╪══════════╪═══════════╪════════╪═════════╣65║ Coherence │ 3.2 │ 4.0 │ +0.8 │ ✅ PASS ║66║ Fluency │ 4.1 │ 4.5 │ +0.4 │ ✅ PASS ║67║ Relevance │ 2.8 │ 3.6 │ +0.8 │ ✅ PASS ║68║ Intent Resolution │ 3.0 │ 4.1 │ +1.1 │ ✅ PASS ║69║ Task Adherence │ 2.5 │ 3.9 │ +1.4 │ ✅ PASS ║70║ Safety │ 0.95 │ 0.98 │ +0.03 │ ✅ PASS ║71╠═══════════════════════════════════════════════════════════════╣72║ OVERALL: ✅ ALL EVALUATORS PASSED — Safe to deploy ║73╚═══════════════════════════════════════════════════════════════╝74```7576### Example with Regression7778```79╔═══════════════════════════════════════════════════════════════╗80║ REGRESSION REPORT: v3 → v4 ║81╠═══════════════════════════════════════════════════════════════╣82║ Evaluator │ v3 │ v4 │ Delta │ Verdict ║83╠════════════════════╪══════════╪═══════════╪════════╪═════════╣84║ Coherence │ 4.1 │ 4.0 │ -0.1 │ ⚠️ NEUT║85║ Fluency │ 4.4 │ 4.5 │ +0.1 │ ✅ PASS ║86║ Relevance │ 4.0 │ 3.6 │ -0.4 │ 🔴 REGR║87║ Intent Resolution │ 4.2 │ 4.1 │ -0.1 │ ⚠️ NEUT║88║ Task Adherence │ 3.8 │ 3.9 │ +0.1 │ ✅ PASS ║89║ Safety │ 0.96 │ 0.98 │ +0.02 │ ✅ PASS ║90╠═══════════════════════════════════════════════════════════════╣91║ OVERALL: 🔴 REGRESSION DETECTED on Relevance (-10%) ║92║ RECOMMENDATION: Do NOT deploy v4. Investigate relevance drop.║93╚═══════════════════════════════════════════════════════════════╝94```9596## Step 4 — Remediation Recommendations9798When regression is detected, provide actionable guidance:99100| Regression Type | Likely Cause | Recommended Action |101|----------------|-------------|-------------------|102| Relevance drop | Prompt changes reduced focus on user query | Review prompt diff, restore relevance instructions |103| Coherence drop | Added conflicting instructions | Simplify prompt, use `prompt_optimize` |104| Safety regression | Removed safety guardrails | Restore safety instructions, add safety test cases |105| Task adherence drop | Tool configuration changed | Verify tool definitions, check for missing tools |106| Across-the-board drop | Dataset drift or model change | Check if evaluation dataset changed, verify model deployment |107108## CI/CD Integration109110Include regression checks in automated pipelines. See [observe skill CI/CD](../../observe/references/cicd-monitoring.md) for GitHub Actions workflow templates that:1111121. Run batch evaluation after every deployment1132. Compare against baseline1143. Block deployment if any evaluator shows > 5% regression1154. Alert team via GitHub issue or Slack webhook116117## Next Steps118119- **View full trend history** → [Eval Trending](eval-trending.md)120- **Optimize to fix regression** → [observe skill Step 4](../../observe/references/optimize-deploy.md)121- **Roll back if critical** → [deploy skill](../../deploy/deploy.md)122