Source from repo

Microsoft Foundry Skill

Build and deploy AI applications on Azure AI Foundry using Microsoft's model catalog and AI services

microsoftGitHub microsoftOfficialSource repo Original GitHub link Publisher page

Files

Skill

n/a

Size

546.7 KB

Entrypoint

SKILL.md

Format

git-repo

Open file

foundry-agent/eval-datasets/references/eval-trending.md

Syntax-highlighted preview of this file as included in the skill package.

Rendered Source

markdown96 linesFree

foundry-agent/eval-datasets/references/eval-trending.md

1# Eval Trending — Metrics Over Time
2 
3Track evaluation metrics across multiple runs and versions to visualize improvement trends and detect regressions. This addresses the gap of understanding how agent quality changes over time.
4 
5## Prerequisites
6 
7- At least 2 evaluation runs in the same evaluation group (same `evaluationId` when created)
8- Project endpoint and selected environment available in the selected `.foundry/agent-metadata*.yaml` file
9 
10> ⚠️ **Eval-group immutability:** Trend a group only when its evaluator set and thresholds stayed fixed across runs. If either changed, start a new evaluation group and track that history separately.
11 
12## Step 1 — Retrieve Evaluation History
13 
14Use **`evaluation_get`** to list all evaluation groups:
15 
16| Parameter | Required | Description |
17|-----------|----------|-------------|
18| `projectEndpoint` | ✅ | Azure AI Project endpoint |
19| `isRequestForRuns` | | `false` (default) to list evaluation groups |
20 
21Then retrieve all runs within the target evaluation group:
22 
23| Parameter | Required | Description |
24|-----------|----------|-------------|
25| `projectEndpoint` | ✅ | Azure AI Project endpoint |
26| `evalId` | ✅ | Evaluation group ID |
27| `isRequestForRuns` | ✅ | `true` to list runs |
28 
29> ⚠️ **Parameter guardrail:** evaluation_get expects `evalId`, not `evaluationId`, even if the runs were grouped earlier with `evaluationId`.
30 
31## Step 2 — Build Metrics Timeline
32 
33For each run, extract per-evaluator scores and build a timeline:
34 
35| Run | Agent Version | Date | Coherence | Fluency | Relevance | Intent Resolution | Task Adherence | Safety |
36|-----|--------------|------|-----------|---------|-----------|-------------------|----------------|--------|
37| run-001 | v1 | 2025-01-15 | 3.2 | 4.1 | 2.8 | 3.0 | 2.5 | 0.95 |
38| run-002 | v2 | 2025-01-22 | 3.8 | 4.3 | 3.5 | 3.7 | 3.2 | 0.97 |
39| run-003 | v3 | 2025-02-01 | 4.1 | 4.4 | 4.0 | 4.2 | 3.8 | 0.96 |
40| run-004 | v4 | 2025-02-08 | 4.0 | 4.5 | 3.6 | 4.1 | 3.9 | 0.98 |
41 
42## Step 3 — Trend Analysis
43 
44Calculate trends for each evaluator:
45 
46| Evaluator | v1 → v4 Change | Trend | Status |
47|-----------|----------------|-------|--------|
48| Coherence | +0.8 (+25%) | ↑ Improving | ✅ |
49| Fluency | +0.4 (+10%) | ↑ Improving | ✅ |
50| Relevance | +0.8 (+29%) | ↑ Improving (dip at v4) | ⚠️ |
51| Intent Resolution | +1.1 (+37%) | ↑ Improving | ✅ |
52| Task Adherence | +1.4 (+56%) | ↑ Improving | ✅ |
53| Safety | +0.03 (+3%) | → Stable | ✅ |
54 
55### Detecting Regressions
56 
57Flag any evaluator where the latest run scored **lower** than the previous run:
58 
59| Evaluator | Previous (v3) | Latest (v4) | Delta | Alert |
60|-----------|--------------|-------------|-------|-------|
61| Relevance | 4.0 | 3.6 | -0.4 (-10%) | ⚠️ **REGRESSION** |
62 
63> ⚠️ **Regression detected:** Relevance dropped 10% from v3 to v4. Investigate prompt changes or dataset drift. See [Eval Regression](eval-regression.md) for automated analysis.
64 
65### Trend Visualization (Text-based)
66 
67```
68Coherence   ████████████████████████████████░░░░░░ 4.0/5.0  ↑ +25%
69Fluency     █████████████████████████████████████░░ 4.5/5.0  ↑ +10%
70Relevance   ████████████████████████████░░░░░░░░░░ 3.6/5.0  ↑ +29% ⚠️ dip
71Intent Res. █████████████████████████████████░░░░░░ 4.1/5.0  ↑ +37%
72Task Adh.   ████████████████████████████████░░░░░░░ 3.9/5.0  ↑ +56%
73Safety      ████████████████████████████████████████ 0.98     → Stable
74```
75 
76## Step 4 — Cross-Version Summary
77 
78Present an executive summary:
79 
80*"Over 4 agent versions (v1→v4), your agent has improved significantly across all quality metrics. The biggest gain is Task Adherence (+56%). However, Relevance showed a 10% regression from v3 to v4 — recommend investigating recent prompt changes. Safety remains stable at 98%."*
81 
82## Recommended Thresholds
83 
84| Severity | Threshold | Action |
85|----------|-----------|--------|
86| ✅ Healthy | ≤ 2% drop from previous run | No action needed |
87| ⚠️ Warning | 2–5% drop from previous run | Review recent changes |
88| 🔴 Regression | > 5% drop from previous run | Block deployment, investigate |
89| 🔴 Critical | Below baseline (v1) on any metric | Rollback to last known good version |
90 
91## Next Steps
92 
93- **Investigate regression** → [Eval Regression](eval-regression.md)
94- **Compare specific versions** → [Dataset Comparison](dataset-comparison.md)
95- **Set up automated monitoring** → [observe skill CI/CD](../../observe/references/cicd-monitoring.md)
96

Preparing the source view

Microsoft Foundry Skill

foundry-agent/eval-datasets/references/eval-trending.md