Loading source
Pulling the file list, source metadata, and syntax-aware rendering for this listing.
Source from repo
Build and deploy AI applications on Azure AI Foundry using Microsoft's model catalog and AI services
Files
Skill
Size
Entrypoint
Format
Open file
Syntax-highlighted preview of this file as included in the skill package.
foundry-agent/eval-datasets/references/eval-trending.md
1# Eval Trending — Metrics Over Time23Track evaluation metrics across multiple runs and versions to visualize improvement trends and detect regressions. This addresses the gap of understanding how agent quality changes over time.45## Prerequisites67- At least 2 evaluation runs in the same evaluation group (same `evaluationId` when created)8- Project endpoint and selected environment available in the selected `.foundry/agent-metadata*.yaml` file910> ⚠️ **Eval-group immutability:** Trend a group only when its evaluator set and thresholds stayed fixed across runs. If either changed, start a new evaluation group and track that history separately.1112## Step 1 — Retrieve Evaluation History1314Use **`evaluation_get`** to list all evaluation groups:1516| Parameter | Required | Description |17|-----------|----------|-------------|18| `projectEndpoint` | ✅ | Azure AI Project endpoint |19| `isRequestForRuns` | | `false` (default) to list evaluation groups |2021Then retrieve all runs within the target evaluation group:2223| Parameter | Required | Description |24|-----------|----------|-------------|25| `projectEndpoint` | ✅ | Azure AI Project endpoint |26| `evalId` | ✅ | Evaluation group ID |27| `isRequestForRuns` | ✅ | `true` to list runs |2829> ⚠️ **Parameter guardrail:** evaluation_get expects `evalId`, not `evaluationId`, even if the runs were grouped earlier with `evaluationId`.3031## Step 2 — Build Metrics Timeline3233For each run, extract per-evaluator scores and build a timeline:3435| Run | Agent Version | Date | Coherence | Fluency | Relevance | Intent Resolution | Task Adherence | Safety |36|-----|--------------|------|-----------|---------|-----------|-------------------|----------------|--------|37| run-001 | v1 | 2025-01-15 | 3.2 | 4.1 | 2.8 | 3.0 | 2.5 | 0.95 |38| run-002 | v2 | 2025-01-22 | 3.8 | 4.3 | 3.5 | 3.7 | 3.2 | 0.97 |39| run-003 | v3 | 2025-02-01 | 4.1 | 4.4 | 4.0 | 4.2 | 3.8 | 0.96 |40| run-004 | v4 | 2025-02-08 | 4.0 | 4.5 | 3.6 | 4.1 | 3.9 | 0.98 |4142## Step 3 — Trend Analysis4344Calculate trends for each evaluator:4546| Evaluator | v1 → v4 Change | Trend | Status |47|-----------|----------------|-------|--------|48| Coherence | +0.8 (+25%) | ↑ Improving | ✅ |49| Fluency | +0.4 (+10%) | ↑ Improving | ✅ |50| Relevance | +0.8 (+29%) | ↑ Improving (dip at v4) | ⚠️ |51| Intent Resolution | +1.1 (+37%) | ↑ Improving | ✅ |52| Task Adherence | +1.4 (+56%) | ↑ Improving | ✅ |53| Safety | +0.03 (+3%) | → Stable | ✅ |5455### Detecting Regressions5657Flag any evaluator where the latest run scored **lower** than the previous run:5859| Evaluator | Previous (v3) | Latest (v4) | Delta | Alert |60|-----------|--------------|-------------|-------|-------|61| Relevance | 4.0 | 3.6 | -0.4 (-10%) | ⚠️ **REGRESSION** |6263> ⚠️ **Regression detected:** Relevance dropped 10% from v3 to v4. Investigate prompt changes or dataset drift. See [Eval Regression](eval-regression.md) for automated analysis.6465### Trend Visualization (Text-based)6667```68Coherence ████████████████████████████████░░░░░░ 4.0/5.0 ↑ +25%69Fluency █████████████████████████████████████░░ 4.5/5.0 ↑ +10%70Relevance ████████████████████████████░░░░░░░░░░ 3.6/5.0 ↑ +29% ⚠️ dip71Intent Res. █████████████████████████████████░░░░░░ 4.1/5.0 ↑ +37%72Task Adh. ████████████████████████████████░░░░░░░ 3.9/5.0 ↑ +56%73Safety ████████████████████████████████████████ 0.98 → Stable74```7576## Step 4 — Cross-Version Summary7778Present an executive summary:7980*"Over 4 agent versions (v1→v4), your agent has improved significantly across all quality metrics. The biggest gain is Task Adherence (+56%). However, Relevance showed a 10% regression from v3 to v4 — recommend investigating recent prompt changes. Safety remains stable at 98%."*8182## Recommended Thresholds8384| Severity | Threshold | Action |85|----------|-----------|--------|86| ✅ Healthy | ≤ 2% drop from previous run | No action needed |87| ⚠️ Warning | 2–5% drop from previous run | Review recent changes |88| 🔴 Regression | > 5% drop from previous run | Block deployment, investigate |89| 🔴 Critical | Below baseline (v1) on any metric | Rollback to last known good version |9091## Next Steps9293- **Investigate regression** → [Eval Regression](eval-regression.md)94- **Compare specific versions** → [Dataset Comparison](dataset-comparison.md)95- **Set up automated monitoring** → [observe skill CI/CD](../../observe/references/cicd-monitoring.md)96