Loading source
Pulling the file list, source metadata, and syntax-aware rendering for this listing.
Source from repo
Build and deploy AI applications on Azure AI Foundry using Microsoft's model catalog and AI services
Files
Skill
Size
Entrypoint
Format
Open file
Syntax-highlighted preview of this file as included in the skill package.
foundry-agent/eval-datasets/references/eval-regression.md
1# Eval Regression — Automated Regression Detection23Automatically detect when evaluation metrics degrade between agent versions. Compare each evaluation run against the baseline and generate pass/fail verdicts with actionable recommendations.45## Prerequisites67- At least 2 evaluation runs in the same evaluation group8- Baseline run identified (either the first run or the one tagged as `baseline`)910## Step 1 — Identify Baseline and Treatment1112### Automatic Baseline Selection13141. Read `.foundry/datasets/manifest.json` and find the dataset tagged `baseline`.152. If the baseline dataset entry includes a stored `baselineRunId` (or mapping to one or more `evalRunIds`), use that `baselineRunId` as the baseline run.163. If no explicit `baselineRunId` is recorded, select the first (oldest) run in the evaluation group as the baseline.1718### Treatment Selection1920The latest (most recent) run in the evaluation group is the treatment.2122## Step 2 — Run Comparison2324Use **`evaluation_comparison_create`** to compare baseline vs treatment:2526> **Critical:** `displayName` is **required** in the `insightRequest`. Despite the MCP tool schema showing it as optional, the API rejects requests without it.2728```json29{30"insightRequest": {31"displayName": "Regression Check - v1 vs v4",32"state": "NotStarted",33"request": {34"type": "EvaluationComparison",35"evalId": "<eval-group-id>",36"baselineRunId": "<baseline-run-id>",37"treatmentRunIds": ["<latest-run-id>"]38}39}40}41```4243Retrieve results with **`evaluation_comparison_get`** using the returned `insightId`.4445## Step 3 — Regression Verdicts4647For each evaluator in the comparison results, apply regression thresholds:4849| Treatment Effect | Delta | Verdict | Action |50|-----------------|-------|---------|--------|51| `Improved` | > +2% | ✅ PASS | No action needed |52| `Changed` | ±2% | ⚠️ NEUTRAL | Monitor, no immediate action |53| `Degraded` | > -2% | 🔴 REGRESSION | Investigate and remediate |54| `Inconclusive` | — | ❓ INCONCLUSIVE | Increase sample size and re-run |55| `TooFewSamples` | — | ❓ INSUFFICIENT DATA | Need more test cases (≥30 recommended) |5657### Example Regression Report5859```60╔═══════════════════════════════════════════════════════════════╗61║ REGRESSION REPORT: v1 (baseline) → v4 ║62╠═══════════════════════════════════════════════════════════════╣63║ Evaluator │ Baseline │ Treatment │ Delta │ Verdict ║64╠════════════════════╪══════════╪═══════════╪════════╪═════════╣65║ Coherence │ 3.2 │ 4.0 │ +0.8 │ ✅ PASS ║66║ Fluency │ 4.1 │ 4.5 │ +0.4 │ ✅ PASS ║67║ Relevance │ 2.8 │ 3.6 │ +0.8 │ ✅ PASS ║68║ Intent Resolution │ 3.0 │ 4.1 │ +1.1 │ ✅ PASS ║69║ Task Adherence │ 2.5 │ 3.9 │ +1.4 │ ✅ PASS ║70║ Safety │ 0.95 │ 0.98 │ +0.03 │ ✅ PASS ║71╠═══════════════════════════════════════════════════════════════╣72║ OVERALL: ✅ ALL EVALUATORS PASSED — Safe to deploy ║73╚═══════════════════════════════════════════════════════════════╝74```7576### Example with Regression7778```79╔═══════════════════════════════════════════════════════════════╗80║ REGRESSION REPORT: v3 → v4 ║81╠═══════════════════════════════════════════════════════════════╣82║ Evaluator │ v3 │ v4 │ Delta │ Verdict ║83╠════════════════════╪══════════╪═══════════╪════════╪═════════╣84║ Coherence │ 4.1 │ 4.0 │ -0.1 │ ⚠️ NEUT║85║ Fluency │ 4.4 │ 4.5 │ +0.1 │ ✅ PASS ║86║ Relevance │ 4.0 │ 3.6 │ -0.4 │ 🔴 REGR║87║ Intent Resolution │ 4.2 │ 4.1 │ -0.1 │ ⚠️ NEUT║88║ Task Adherence │ 3.8 │ 3.9 │ +0.1 │ ✅ PASS ║89║ Safety │ 0.96 │ 0.98 │ +0.02 │ ✅ PASS ║90╠═══════════════════════════════════════════════════════════════╣91║ OVERALL: 🔴 REGRESSION DETECTED on Relevance (-10%) ║92║ RECOMMENDATION: Do NOT deploy v4. Investigate relevance drop.║93╚═══════════════════════════════════════════════════════════════╝94```9596## Step 4 — Remediation Recommendations9798When regression is detected, provide actionable guidance:99100| Regression Type | Likely Cause | Recommended Action |101|----------------|-------------|-------------------|102| Relevance drop | Prompt changes reduced focus on user query | Review prompt diff, restore relevance instructions |103| Coherence drop | Added conflicting instructions | Simplify prompt, use `prompt_optimize` |104| Safety regression | Removed safety guardrails | Restore safety instructions, add safety test cases |105| Task adherence drop | Tool configuration changed | Verify tool definitions, check for missing tools |106| Across-the-board drop | Dataset drift or model change | Check if evaluation dataset changed, verify model deployment |107108## CI/CD Integration109110Include regression checks in automated pipelines. See [observe skill CI/CD](../../observe/references/cicd-monitoring.md) for GitHub Actions workflow templates that:1111121. Run batch evaluation after every deployment1132. Compare against baseline1143. Block deployment if any evaluator shows > 5% regression1154. Alert team via GitHub issue or Slack webhook116117## Next Steps118119- **View full trend history** → [Eval Trending](eval-trending.md)120- **Optimize to fix regression** → [observe skill Step 4](../../observe/references/optimize-deploy.md)121- **Roll back if critical** → [deploy skill](../../deploy/deploy.md)122