Source from repo

Microsoft Foundry Skill

Deploy, evaluate, and manage AI agents end-to-end on Microsoft Azure AI Foundry

microsoftGitHub microsoftOfficialSource repo Original GitHub link Publisher page

Files

Skill

n/a

Size

546.6 KB

Entrypoint

SKILL.md

Format

git-repo

Open file

foundry-agent/eval-datasets/references/eval-trending.md

Syntax-highlighted preview of this file as included in the skill package.

Rendered Source

markdown96 linesFree

foundry-agent/eval-datasets/references/eval-trending.md

1# Eval Trending — Metrics Over Time
2 
3Track evaluation metrics across multiple runs and versions to visualize improvement trends and detect regressions. This addresses the gap of understanding how agent quality changes over time.
4 
5## Prerequisites
6 
7- At least 2 evaluation runs in the same evaluation group (same `evaluationId` when created)
8- Project endpoint and selected environment available in the selected `.foundry/agent-metadata*.yaml` file
9 
10> ⚠️ **Eval-group immutability:** Trend a group only when its evaluator set and thresholds stayed fixed across runs. If either changed, start a new evaluation group and track that history separately.
11 
12## Step 1 — Retrieve Evaluation History
13 
14Use **`evaluation_get`** to list all evaluation groups:
15 
16| Parameter | Required | Description |
17|-----------|----------|-------------|
18| `projectEndpoint` | ✅ | Azure AI Project endpoint |
19| `isRequestForRuns` | | `false` (default) to list evaluation groups |
20 
21Then retrieve all runs within the target evaluation group:
22 
23| Parameter | Required | Description |
24|-----------|----------|-------------|
25| `projectEndpoint` | ✅ | Azure AI Project endpoint |
26| `evalId` | ✅ | Evaluation group ID |
27| `isRequestForRuns` | ✅ | `true` to list runs |
28 
29> ⚠️ **Parameter guardrail:** evaluation_get expects `evalId`, not `evaluationId`, even if the runs were grouped earlier with `evaluationId`.
30 
31## Step 2 — Build Metrics Timeline
32 
33For each run, extract per-evaluator scores and build a timeline:
34 
35| Run | Agent Version | Date | Coherence | Fluency | Relevance | Intent Resolution | Task Adherence | Safety |
36|-----|--------------|------|-----------|---------|-----------|-------------------|----------------|--------|
37| run-001 | v1 | 2025-01-15 | 3.2 | 4.1 | 2.8 | 3.0 | 2.5 | 0.95 |
38| run-002 | v2 | 2025-01-22 | 3.8 | 4.3 | 3.5 | 3.7 | 3.2 | 0.97 |
39| run-003 | v3 | 2025-02-01 | 4.1 | 4.4 | 4.0 | 4.2 | 3.8 | 0.96 |
40| run-004 | v4 | 2025-02-08 | 4.0 | 4.5 | 3.6 | 4.1 | 3.9 | 0.98 |
41 
42## Step 3 — Trend Analysis
43 
44Calculate trends for each evaluator:
45 
46| Evaluator | v1 → v4 Change | Trend | Status |
47|-----------|----------------|-------|--------|
48| Coherence | +0.8 (+25%) | ↑ Improving | ✅ |
49| Fluency | +0.4 (+10%) | ↑ Improving | ✅ |
50| Relevance | +0.8 (+29%) | ↑ Improving (dip at v4) | ⚠️ |
51| Intent Resolution | +1.1 (+37%) | ↑ Improving | ✅ |
52| Task Adherence | +1.4 (+56%) | ↑ Improving | ✅ |
53| Safety | +0.03 (+3%) | → Stable | ✅ |
54 
55### Detecting Regressions
56 
57Flag any evaluator where the latest run scored **lower** than the previous run:
58 
59| Evaluator | Previous (v3) | Latest (v4) | Delta | Alert |
60|-----------|--------------|-------------|-------|-------|
61| Relevance | 4.0 | 3.6 | -0.4 (-10%) | ⚠️ **REGRESSION** |
62 
63> ⚠️ **Regression detected:** Relevance dropped 10% from v3 to v4. Investigate prompt changes or dataset drift. See [Eval Regression](eval-regression.md) for automated analysis.
64 
65### Trend Visualization (Text-based)
66 
67```
68Coherence   ████████████████████████████████░░░░░░ 4.0/5.0  ↑ +25%
69Fluency     █████████████████████████████████████░░ 4.5/5.0  ↑ +10%
70Relevance   ████████████████████████████░░░░░░░░░░ 3.6/5.0  ↑ +29% ⚠️ dip
71Intent Res. █████████████████████████████████░░░░░░ 4.1/5.0  ↑ +37%
72Task Adh.   ████████████████████████████████░░░░░░░ 3.9/5.0  ↑ +56%
73Safety      ████████████████████████████████████████ 0.98     → Stable
74```
75 
76## Step 4 — Cross-Version Summary
77 
78Present an executive summary:
79 
80*"Over 4 agent versions (v1→v4), your agent has improved significantly across all quality metrics. The biggest gain is Task Adherence (+56%). However, Relevance showed a 10% regression from v3 to v4 — recommend investigating recent prompt changes. Safety remains stable at 98%."*
81 
82## Recommended Thresholds
83 
84| Severity | Threshold | Action |
85|----------|-----------|--------|
86| ✅ Healthy | ≤ 2% drop from previous run | No action needed |
87| ⚠️ Warning | 2–5% drop from previous run | Review recent changes |
88| 🔴 Regression | > 5% drop from previous run | Block deployment, investigate |
89| 🔴 Critical | Below baseline (v1) on any metric | Rollback to last known good version |
90 
91## Next Steps
92 
93- **Investigate regression** → [Eval Regression](eval-regression.md)
94- **Compare specific versions** → [Dataset Comparison](dataset-comparison.md)
95- **Set up automated monitoring** → [observe skill CI/CD](../../observe/references/cicd-monitoring.md)
96

Marketplace

Source from repo

Microsoft Foundry Skill

Deploy, evaluate, and manage AI agents end-to-end on Microsoft Azure AI Foundry

microsoftGitHub microsoftOfficialSource repo Original GitHub link Publisher page

Files

Skill

n/a

Size

546.6 KB

Entrypoint

SKILL.md

Format

git-repo

Open file

foundry-agent/eval-datasets/references/eval-trending.md

Syntax-highlighted preview of this file as included in the skill package.

Rendered Source

markdown96 linesFree

foundry-agent/eval-datasets/references/eval-trending.md

1# Eval Trending — Metrics Over Time
2 
3Track evaluation metrics across multiple runs and versions to visualize improvement trends and detect regressions. This addresses the gap of understanding how agent quality changes over time.
4 
5## Prerequisites
6 
7- At least 2 evaluation runs in the same evaluation group (same `evaluationId` when created)
8- Project endpoint and selected environment available in the selected `.foundry/agent-metadata*.yaml` file
9 
10> ⚠️ **Eval-group immutability:** Trend a group only when its evaluator set and thresholds stayed fixed across runs. If either changed, start a new evaluation group and track that history separately.
11 
12## Step 1 — Retrieve Evaluation History
13 
14Use **`evaluation_get`** to list all evaluation groups:
15 
16| Parameter | Required | Description |
17|-----------|----------|-------------|
18| `projectEndpoint` | ✅ | Azure AI Project endpoint |
19| `isRequestForRuns` | | `false` (default) to list evaluation groups |
20 
21Then retrieve all runs within the target evaluation group:
22 
23| Parameter | Required | Description |
24|-----------|----------|-------------|
25| `projectEndpoint` | ✅ | Azure AI Project endpoint |
26| `evalId` | ✅ | Evaluation group ID |
27| `isRequestForRuns` | ✅ | `true` to list runs |
28 
29> ⚠️ **Parameter guardrail:** evaluation_get expects `evalId`, not `evaluationId`, even if the runs were grouped earlier with `evaluationId`.
30 
31## Step 2 — Build Metrics Timeline
32 
33For each run, extract per-evaluator scores and build a timeline:
34 
35| Run | Agent Version | Date | Coherence | Fluency | Relevance | Intent Resolution | Task Adherence | Safety |
36|-----|--------------|------|-----------|---------|-----------|-------------------|----------------|--------|
37| run-001 | v1 | 2025-01-15 | 3.2 | 4.1 | 2.8 | 3.0 | 2.5 | 0.95 |
38| run-002 | v2 | 2025-01-22 | 3.8 | 4.3 | 3.5 | 3.7 | 3.2 | 0.97 |
39| run-003 | v3 | 2025-02-01 | 4.1 | 4.4 | 4.0 | 4.2 | 3.8 | 0.96 |
40| run-004 | v4 | 2025-02-08 | 4.0 | 4.5 | 3.6 | 4.1 | 3.9 | 0.98 |
41 
42## Step 3 — Trend Analysis
43 
44Calculate trends for each evaluator:
45 
46| Evaluator | v1 → v4 Change | Trend | Status |
47|-----------|----------------|-------|--------|
48| Coherence | +0.8 (+25%) | ↑ Improving | ✅ |
49| Fluency | +0.4 (+10%) | ↑ Improving | ✅ |
50| Relevance | +0.8 (+29%) | ↑ Improving (dip at v4) | ⚠️ |
51| Intent Resolution | +1.1 (+37%) | ↑ Improving | ✅ |
52| Task Adherence | +1.4 (+56%) | ↑ Improving | ✅ |
53| Safety | +0.03 (+3%) | → Stable | ✅ |
54 
55### Detecting Regressions
56 
57Flag any evaluator where the latest run scored **lower** than the previous run:
58 
59| Evaluator | Previous (v3) | Latest (v4) | Delta | Alert |
60|-----------|--------------|-------------|-------|-------|
61| Relevance | 4.0 | 3.6 | -0.4 (-10%) | ⚠️ **REGRESSION** |
62 
63> ⚠️ **Regression detected:** Relevance dropped 10% from v3 to v4. Investigate prompt changes or dataset drift. See [Eval Regression](eval-regression.md) for automated analysis.
64 
65### Trend Visualization (Text-based)
66 
67```
68Coherence   ████████████████████████████████░░░░░░ 4.0/5.0  ↑ +25%
69Fluency     █████████████████████████████████████░░ 4.5/5.0  ↑ +10%
70Relevance   ████████████████████████████░░░░░░░░░░ 3.6/5.0  ↑ +29% ⚠️ dip
71Intent Res. █████████████████████████████████░░░░░░ 4.1/5.0  ↑ +37%
72Task Adh.   ████████████████████████████████░░░░░░░ 3.9/5.0  ↑ +56%
73Safety      ████████████████████████████████████████ 0.98     → Stable
74```
75 
76## Step 4 — Cross-Version Summary
77 
78Present an executive summary:
79 
80*"Over 4 agent versions (v1→v4), your agent has improved significantly across all quality metrics. The biggest gain is Task Adherence (+56%). However, Relevance showed a 10% regression from v3 to v4 — recommend investigating recent prompt changes. Safety remains stable at 98%."*
81 
82## Recommended Thresholds
83 
84| Severity | Threshold | Action |
85|----------|-----------|--------|
86| ✅ Healthy | ≤ 2% drop from previous run | No action needed |
87| ⚠️ Warning | 2–5% drop from previous run | Review recent changes |
88| 🔴 Regression | > 5% drop from previous run | Block deployment, investigate |
89| 🔴 Critical | Below baseline (v1) on any metric | Rollback to last known good version |
90 
91## Next Steps
92 
93- **Investigate regression** → [Eval Regression](eval-regression.md)
94- **Compare specific versions** → [Dataset Comparison](dataset-comparison.md)
95- **Set up automated monitoring** → [observe skill CI/CD](../../observe/references/cicd-monitoring.md)
96

Microsoft Foundry Skill

foundry-agent/eval-datasets/references/eval-trending.md

Preparing the source view

Microsoft Foundry Skill

foundry-agent/eval-datasets/references/eval-trending.md