Source from repo
Microsoft Foundry Skill

Deploy, evaluate, and manage AI agents end-to-end on Microsoft Azure AI Foundry
microsoftGitHub microsoftOfficialSource repo Original GitHub link Publisher page
Files
Skill
n/a
Size
560.1 KB
Entrypoint
SKILL.md
Format
git-repo
Open file
foundry-agent/eval-datasets/references/eval-regression.md

Syntax-highlighted preview of this file as included in the skill package.
Rendered Source
markdown122 linesFree
foundry-agent/eval-datasets/references/eval-regression.md
1# Eval Regression — Automated Regression Detection
2 
3Automatically detect when evaluation metrics degrade between agent versions. Compare each evaluation run against the baseline and generate pass/fail verdicts with actionable recommendations.
4 
5## Prerequisites
6 
7- At least 2 evaluation runs in the same evaluation group
8- Baseline run identified (either the first run or the one tagged as `baseline`)
9 
10## Step 1 — Identify Baseline and Treatment
11 
12### Automatic Baseline Selection
13 
141. Read `.foundry/datasets/manifest.json` and find the dataset tagged `baseline`.
152. If the baseline dataset entry includes a stored `baselineRunId` (or mapping to one or more `evalRunIds`), use that `baselineRunId` as the baseline run.
163. If no explicit `baselineRunId` is recorded, select the first (oldest) run in the evaluation group as the baseline.
17 
18### Treatment Selection
19 
20The latest (most recent) run in the evaluation group is the treatment.
21 
22## Step 2 — Run Comparison
23 
24Use **`evaluation_comparison_create`** to compare baseline vs treatment:
25 
26> **Critical:** `displayName` is **required** in the `insightRequest`. Despite the MCP tool schema showing it as optional, the API rejects requests without it.
27 
28```json
29{
30  "insightRequest": {
31    "displayName": "Regression Check - v1 vs v4",
32    "state": "NotStarted",
33    "request": {
34      "type": "EvaluationComparison",
35      "evalId": "<eval-group-id>",
36      "baselineRunId": "<baseline-run-id>",
37      "treatmentRunIds": ["<latest-run-id>"]
38    }
39  }
40}
41```
42 
43Retrieve results with **`evaluation_comparison_get`** using the returned `insightId`.
44 
45## Step 3 — Regression Verdicts
46 
47For each evaluator in the comparison results, apply regression thresholds:
48 
49| Treatment Effect | Delta | Verdict | Action |
50|-----------------|-------|---------|--------|
51| `Improved` | > +2% | ✅ PASS | No action needed |
52| `Changed` | ±2% | ⚠️ NEUTRAL | Monitor, no immediate action |
53| `Degraded` | > -2% | 🔴 REGRESSION | Investigate and remediate |
54| `Inconclusive` | — | ❓ INCONCLUSIVE | Increase sample size and re-run |
55| `TooFewSamples` | — | ❓ INSUFFICIENT DATA | Need more test cases (≥30 recommended) |
56 
57### Example Regression Report
58 
59```
60╔═══════════════════════════════════════════════════════════════╗
61║              REGRESSION REPORT: v1 (baseline) → v4           ║
62╠═══════════════════════════════════════════════════════════════╣
63║ Evaluator          │ Baseline │ Treatment │ Delta  │ Verdict ║
64╠════════════════════╪══════════╪═══════════╪════════╪═════════╣
65║ Coherence          │ 3.2      │ 4.0       │ +0.8   │ ✅ PASS ║
66║ Fluency            │ 4.1      │ 4.5       │ +0.4   │ ✅ PASS ║
67║ Relevance          │ 2.8      │ 3.6       │ +0.8   │ ✅ PASS ║
68║ Intent Resolution  │ 3.0      │ 4.1       │ +1.1   │ ✅ PASS ║
69║ Task Adherence     │ 2.5      │ 3.9       │ +1.4   │ ✅ PASS ║
70║ Safety             │ 0.95     │ 0.98      │ +0.03  │ ✅ PASS ║
71╠═══════════════════════════════════════════════════════════════╣
72║ OVERALL: ✅ ALL EVALUATORS PASSED — Safe to deploy           ║
73╚═══════════════════════════════════════════════════════════════╝
74```
75 
76### Example with Regression
77 
78```
79╔═══════════════════════════════════════════════════════════════╗
80║              REGRESSION REPORT: v3 → v4                      ║
81╠═══════════════════════════════════════════════════════════════╣
82║ Evaluator          │ v3       │ v4        │ Delta  │ Verdict ║
83╠════════════════════╪══════════╪═══════════╪════════╪═════════╣
84║ Coherence          │ 4.1      │ 4.0       │ -0.1   │ ⚠️ NEUT║
85║ Fluency            │ 4.4      │ 4.5       │ +0.1   │ ✅ PASS ║
86║ Relevance          │ 4.0      │ 3.6       │ -0.4   │ 🔴 REGR║
87║ Intent Resolution  │ 4.2      │ 4.1       │ -0.1   │ ⚠️ NEUT║
88║ Task Adherence     │ 3.8      │ 3.9       │ +0.1   │ ✅ PASS ║
89║ Safety             │ 0.96     │ 0.98      │ +0.02  │ ✅ PASS ║
90╠═══════════════════════════════════════════════════════════════╣
91║ OVERALL: 🔴 REGRESSION DETECTED on Relevance (-10%)         ║
92║ RECOMMENDATION: Do NOT deploy v4. Investigate relevance drop.║
93╚═══════════════════════════════════════════════════════════════╝
94```
95 
96## Step 4 — Remediation Recommendations
97 
98When regression is detected, provide actionable guidance:
99 
100| Regression Type | Likely Cause | Recommended Action |
101|----------------|-------------|-------------------|
102| Relevance drop | Prompt changes reduced focus on user query | Review prompt diff, restore relevance instructions |
103| Coherence drop | Added conflicting instructions | Simplify prompt, use `prompt_optimize` |
104| Safety regression | Removed safety guardrails | Restore safety instructions, add safety test cases |
105| Task adherence drop | Tool configuration changed | Verify tool definitions, check for missing tools |
106| Across-the-board drop | Dataset drift or model change | Check if evaluation dataset changed, verify model deployment |
107 
108## CI/CD Integration
109 
110Include regression checks in automated pipelines. See [observe skill CI/CD](../../observe/references/cicd-monitoring.md) for GitHub Actions workflow templates that:
111 
1121. Run batch evaluation after every deployment
1132. Compare against baseline
1143. Block deployment if any evaluator shows > 5% regression
1154. Alert team via GitHub issue or Slack webhook
116 
117## Next Steps
118 
119- **View full trend history** → [Eval Trending](eval-trending.md)
120- **Optimize to fix regression** → [observe skill Step 4](../../observe/references/optimize-deploy.md)
121- **Roll back if critical** → [deploy skill](../../deploy/deploy.md)
122
Preparing the source view

Microsoft Foundry Skill

foundry-agent/eval-datasets/references/eval-regression.md