Source from repo
Microsoft Foundry Skill

Build and deploy AI applications on Azure AI Foundry using Microsoft's model catalog and AI services
microsoftGitHub microsoftOfficialSource repo Original GitHub link Publisher page
Files
Skill
n/a
Size
564.8 KB
Entrypoint
SKILL.md
Format
git-repo
Open file
foundry-agent/eval-datasets/references/eval-regression.md

Syntax-highlighted preview of this file as included in the skill package.
Rendered Source
markdown122 linesFree
foundry-agent/eval-datasets/references/eval-regression.md
1# Eval Regression — Automated Regression Detection
2 
3Automatically detect when evaluation metrics degrade between agent versions. Compare each evaluation run against the baseline and generate pass/fail verdicts with actionable recommendations.
4 
5## Prerequisites
6 
7- At least 2 evaluation runs in the same evaluation group
8- Baseline run identified (either the first run or the one tagged as `baseline`)
9 
10## Step 1 — Identify Baseline and Treatment
11 
12### Automatic Baseline Selection
13 
141. Read `.foundry/datasets/manifest.json` and find the dataset tagged `baseline`.
152. If the baseline dataset entry includes a stored `baselineRunId` (or mapping to one or more `evalRunIds`), use that `baselineRunId` as the baseline run.
163. If no explicit `baselineRunId` is recorded, select the first (oldest) run in the evaluation group as the baseline.
17 
18### Treatment Selection
19 
20The latest (most recent) run in the evaluation group is the treatment.
21 
22## Step 2 — Run Comparison
23 
24Use **`evaluation_comparison_create`** to compare baseline vs treatment:
25 
26> **Critical:** `displayName` is **required** in the `insightRequest`. Despite the MCP tool schema showing it as optional, the API rejects requests without it.
27 
28```json
29{
30  "insightRequest": {
31    "displayName": "Regression Check - v1 vs v4",
32    "state": "NotStarted",
33    "request": {
34      "type": "EvaluationComparison",
35      "evalId": "<eval-group-id>",
36      "baselineRunId": "<baseline-run-id>",
37      "treatmentRunIds": ["<latest-run-id>"]
38    }
39  }
40}
41```
42 
43Retrieve results with **`evaluation_comparison_get`** using the returned `insightId`.
44 
45## Step 3 — Regression Verdicts
46 
47For each evaluator in the comparison results, apply regression thresholds:
48 
49| Treatment Effect | Delta | Verdict | Action |
50|-----------------|-------|---------|--------|
51| `Improved` | > +2% | ✅ PASS | No action needed |
52| `Changed` | ±2% | ⚠️ NEUTRAL | Monitor, no immediate action |
53| `Degraded` | > -2% | 🔴 REGRESSION | Investigate and remediate |
54| `Inconclusive` | — | ❓ INCONCLUSIVE | Increase sample size and re-run |
55| `TooFewSamples` | — | ❓ INSUFFICIENT DATA | Need more test cases (≥30 recommended) |
56 
57### Example Regression Report
58 
59```
60╔═══════════════════════════════════════════════════════════════╗
61║              REGRESSION REPORT: v1 (baseline) → v4           ║
62╠═══════════════════════════════════════════════════════════════╣
63║ Evaluator          │ Baseline │ Treatment │ Delta  │ Verdict ║
64╠════════════════════╪══════════╪═══════════╪════════╪═════════╣
65║ Coherence          │ 3.2      │ 4.0       │ +0.8   │ ✅ PASS ║
66║ Fluency            │ 4.1      │ 4.5       │ +0.4   │ ✅ PASS ║
67║ Relevance          │ 2.8      │ 3.6       │ +0.8   │ ✅ PASS ║
68║ Intent Resolution  │ 3.0      │ 4.1       │ +1.1   │ ✅ PASS ║
69║ Task Adherence     │ 2.5      │ 3.9       │ +1.4   │ ✅ PASS ║
70║ Safety             │ 0.95     │ 0.98      │ +0.03  │ ✅ PASS ║
71╠═══════════════════════════════════════════════════════════════╣
72║ OVERALL: ✅ ALL EVALUATORS PASSED — Safe to deploy           ║
73╚═══════════════════════════════════════════════════════════════╝
74```
75 
76### Example with Regression
77 
78```
79╔═══════════════════════════════════════════════════════════════╗
80║              REGRESSION REPORT: v3 → v4                      ║
81╠═══════════════════════════════════════════════════════════════╣
82║ Evaluator          │ v3       │ v4        │ Delta  │ Verdict ║
83╠════════════════════╪══════════╪═══════════╪════════╪═════════╣
84║ Coherence          │ 4.1      │ 4.0       │ -0.1   │ ⚠️ NEUT║
85║ Fluency            │ 4.4      │ 4.5       │ +0.1   │ ✅ PASS ║
86║ Relevance          │ 4.0      │ 3.6       │ -0.4   │ 🔴 REGR║
87║ Intent Resolution  │ 4.2      │ 4.1       │ -0.1   │ ⚠️ NEUT║
88║ Task Adherence     │ 3.8      │ 3.9       │ +0.1   │ ✅ PASS ║
89║ Safety             │ 0.96     │ 0.98      │ +0.02  │ ✅ PASS ║
90╠═══════════════════════════════════════════════════════════════╣
91║ OVERALL: 🔴 REGRESSION DETECTED on Relevance (-10%)         ║
92║ RECOMMENDATION: Do NOT deploy v4. Investigate relevance drop.║
93╚═══════════════════════════════════════════════════════════════╝
94```
95 
96## Step 4 — Remediation Recommendations
97 
98When regression is detected, provide actionable guidance:
99 
100| Regression Type | Likely Cause | Recommended Action |
101|----------------|-------------|-------------------|
102| Relevance drop | Prompt changes reduced focus on user query | Review prompt diff, restore relevance instructions |
103| Coherence drop | Added conflicting instructions | Simplify prompt, use `prompt_optimize` |
104| Safety regression | Removed safety guardrails | Restore safety instructions, add safety test cases |
105| Task adherence drop | Tool configuration changed | Verify tool definitions, check for missing tools |
106| Across-the-board drop | Dataset drift or model change | Check if evaluation dataset changed, verify model deployment |
107 
108## CI/CD Integration
109 
110Include regression checks in automated pipelines. See [observe skill CI/CD](../../observe/references/cicd-monitoring.md) for GitHub Actions workflow templates that:
111 
1121. Run batch evaluation after every deployment
1132. Compare against baseline
1143. Block deployment if any evaluator shows > 5% regression
1154. Alert team via GitHub issue or Slack webhook
116 
117## Next Steps
118 
119- **View full trend history** → [Eval Trending](eval-trending.md)
120- **Optimize to fix regression** → [observe skill Step 4](../../observe/references/optimize-deploy.md)
121- **Roll back if critical** → [deploy skill](../../deploy/deploy.md)
122
Preparing the source view

Microsoft Foundry Skill

foundry-agent/eval-datasets/references/eval-regression.md