Loading source
Pulling the file list, source metadata, and syntax-aware rendering for this listing.
Source from repo
Deploy, evaluate, and manage AI agents end-to-end on Microsoft Azure AI Foundry
Files
Skill
Size
Entrypoint
Format
Open file
Syntax-highlighted preview of this file as included in the skill package.
foundry-agent/eval-datasets/references/eval-trending.md
1# Eval Trending — Metrics Over Time23Track evaluation metrics across multiple runs and versions to visualize improvement trends and detect regressions. This addresses the gap of understanding how agent quality changes over time.45## Prerequisites67- At least 2 evaluation runs in the same evaluation group (same `evaluationId` when created)8- Project endpoint and selected environment available in the selected `.foundry/agent-metadata*.yaml` file910> ⚠️ **Eval-group immutability:** Trend a group only when its evaluator set and thresholds stayed fixed across runs. If either changed, start a new evaluation group and track that history separately.1112## Step 1 — Retrieve Evaluation History1314Use **`evaluation_get`** to list all evaluation groups:1516| Parameter | Required | Description |17|-----------|----------|-------------|18| `projectEndpoint` | ✅ | Azure AI Project endpoint |19| `isRequestForRuns` | | `false` (default) to list evaluation groups |2021Then retrieve all runs within the target evaluation group:2223| Parameter | Required | Description |24|-----------|----------|-------------|25| `projectEndpoint` | ✅ | Azure AI Project endpoint |26| `evalId` | ✅ | Evaluation group ID |27| `isRequestForRuns` | ✅ | `true` to list runs |2829> ⚠️ **Parameter guardrail:** evaluation_get expects `evalId`, not `evaluationId`, even if the runs were grouped earlier with `evaluationId`.3031## Step 2 — Build Metrics Timeline3233For each run, extract per-evaluator scores and build a timeline:3435| Run | Agent Version | Date | Coherence | Fluency | Relevance | Intent Resolution | Task Adherence | Safety |36|-----|--------------|------|-----------|---------|-----------|-------------------|----------------|--------|37| run-001 | v1 | 2025-01-15 | 3.2 | 4.1 | 2.8 | 3.0 | 2.5 | 0.95 |38| run-002 | v2 | 2025-01-22 | 3.8 | 4.3 | 3.5 | 3.7 | 3.2 | 0.97 |39| run-003 | v3 | 2025-02-01 | 4.1 | 4.4 | 4.0 | 4.2 | 3.8 | 0.96 |40| run-004 | v4 | 2025-02-08 | 4.0 | 4.5 | 3.6 | 4.1 | 3.9 | 0.98 |4142## Step 3 — Trend Analysis4344Calculate trends for each evaluator:4546| Evaluator | v1 → v4 Change | Trend | Status |47|-----------|----------------|-------|--------|48| Coherence | +0.8 (+25%) | ↑ Improving | ✅ |49| Fluency | +0.4 (+10%) | ↑ Improving | ✅ |50| Relevance | +0.8 (+29%) | ↑ Improving (dip at v4) | ⚠️ |51| Intent Resolution | +1.1 (+37%) | ↑ Improving | ✅ |52| Task Adherence | +1.4 (+56%) | ↑ Improving | ✅ |53| Safety | +0.03 (+3%) | → Stable | ✅ |5455### Detecting Regressions5657Flag any evaluator where the latest run scored **lower** than the previous run:5859| Evaluator | Previous (v3) | Latest (v4) | Delta | Alert |60|-----------|--------------|-------------|-------|-------|61| Relevance | 4.0 | 3.6 | -0.4 (-10%) | ⚠️ **REGRESSION** |6263> ⚠️ **Regression detected:** Relevance dropped 10% from v3 to v4. Investigate prompt changes or dataset drift. See [Eval Regression](eval-regression.md) for automated analysis.6465### Trend Visualization (Text-based)6667```68Coherence ████████████████████████████████░░░░░░ 4.0/5.0 ↑ +25%69Fluency █████████████████████████████████████░░ 4.5/5.0 ↑ +10%70Relevance ████████████████████████████░░░░░░░░░░ 3.6/5.0 ↑ +29% ⚠️ dip71Intent Res. █████████████████████████████████░░░░░░ 4.1/5.0 ↑ +37%72Task Adh. ████████████████████████████████░░░░░░░ 3.9/5.0 ↑ +56%73Safety ████████████████████████████████████████ 0.98 → Stable74```7576## Step 4 — Cross-Version Summary7778Present an executive summary:7980*"Over 4 agent versions (v1→v4), your agent has improved significantly across all quality metrics. The biggest gain is Task Adherence (+56%). However, Relevance showed a 10% regression from v3 to v4 — recommend investigating recent prompt changes. Safety remains stable at 98%."*8182## Recommended Thresholds8384| Severity | Threshold | Action |85|----------|-----------|--------|86| ✅ Healthy | ≤ 2% drop from previous run | No action needed |87| ⚠️ Warning | 2–5% drop from previous run | Review recent changes |88| 🔴 Regression | > 5% drop from previous run | Block deployment, investigate |89| 🔴 Critical | Below baseline (v1) on any metric | Rollback to last known good version |9091## Next Steps9293- **Investigate regression** → [Eval Regression](eval-regression.md)94- **Compare specific versions** → [Dataset Comparison](dataset-comparison.md)95- **Set up automated monitoring** → [observe skill CI/CD](../../observe/references/cicd-monitoring.md)96