Loading source
Pulling the file list, source metadata, and syntax-aware rendering for this listing.
Source from repo
Build and deploy AI applications on Azure AI Foundry using Microsoft's model catalog and AI services
Files
Skill
Size
Entrypoint
Format
Open file
Syntax-highlighted preview of this file as included in the skill package.
foundry-agent/eval-datasets/references/dataset-comparison.md
1# Dataset Comparison — A/B Testing Across Dataset Versions23Run structured experiments that compare how an agent performs across different dataset versions, and present results as leaderboards with per-evaluator breakdowns. Use this to answer: "Did scores drop because of harder tests or agent regression?"45## Experiment Structure67An experiment consists of:81. **Pinned agent version** — the same agent evaluated on each dataset92. **Varied dataset versions** — the versions being compared103. **Same evaluators** — applied consistently across all runs114. **Comparison results** — which dataset version the agent performs better on1213## Step 1 — Define the Experiment1415| Parameter | Value | Example |16|-----------|-------|---------|17| Agent | Pinned agent version | `v3` |18| Baseline dataset | Previous dataset version | `support-bot-prod-traces-v2` |19| Treatment dataset(s) | New dataset version(s) | `support-bot-prod-traces-v3` |20| Evaluators | Same set for all runs | coherence, fluency, relevance, intent_resolution, task_adherence |2122## Step 2 — Run Evaluations2324For each dataset version, run **`evaluation_agent_batch_eval_create`** with:25- Same `evaluationId` (groups all runs for comparison)26- Same `agentVersion`27- Same `evaluatorNames`28- Different `inputData` (from each dataset version)2930> **Important:** Use `evaluationId` on `evaluation_agent_batch_eval_create` to group runs. After the runs exist, switch to `evalId` for `evaluation_get` and `evaluation_comparison_create`.3132> ⚠️ **Eval-group immutability:** Keep the evaluator set and thresholds fixed within one evaluation group. If you need to change evaluators or thresholds, create a new evaluation group instead of reusing the previous `evaluationId`.3334> ⚠️ **Score drops are expected.** When comparing v1→v2 datasets, lower scores on the new dataset likely mean the new test cases are harder (better coverage), not that the agent regressed. **Do NOT remove dataset rows or weaken evaluators to recover scores.** Instead, optimize the agent for the new failure patterns, then re-evaluate.3536## Step 3 — Compare Results3738Use **`evaluation_comparison_create`** with the baseline and treatment runs:3940```json41{42"insightRequest": {43"displayName": "Dataset comparison: traces-v2 vs traces-v3 on agent-v3",44"state": "NotStarted",45"request": {46"type": "EvaluationComparison",47"evalId": "<eval-group-id>",48"baselineRunId": "<traces-v2-run-id>",49"treatmentRunIds": ["<traces-v3-run-id>"]50}51}52}53```5455> ⚠️ **Common mistake:** `evaluation_comparison_create` uses `insightRequest.request.evalId`, not `evaluationId`, even when the runs were originally grouped with `evaluationId`.5657## Step 4 — Leaderboard5859Present results as a leaderboard table:6061| Evaluator | traces-v2 (baseline) | traces-v3 | Effect |62|-----------|:---:|:---:|:---:|63| Coherence | 4.0 | 3.6 | ⚠️ Lower |64| Fluency | 4.5 | 4.3 | ⚠️ Lower |65| Relevance | 3.6 | 3.2 | ⚠️ Lower |66| Intent Resolution | 4.1 | 3.7 | ⚠️ Lower |67| Task Adherence | 3.9 | 3.4 | ⚠️ Lower |6869### Recommendation7071If scores drop uniformly across all evaluators, the new dataset is likely harder:7273*"Agent v3 scores dropped on traces-v3 across all evaluators. traces-v3 added 15 edge-case queries from production failures. This is expected — optimize the agent for the new failure patterns rather than reverting the dataset."*7475## Pairwise A/B Comparison7677For detailed pairwise analysis between exactly two dataset versions:7879| Evaluator | Baseline (traces-v2) | Treatment (traces-v3) | Delta | p-value | Effect |80|-----------|:---:|:---:|:---:|:---:|:---:|81| Coherence | 4.0 ± 0.6 | 3.6 ± 0.9 | −0.4 | 0.03 | Degraded |82| Fluency | 4.5 ± 0.4 | 4.3 ± 0.5 | −0.2 | 0.12 | Inconclusive |83| Relevance | 3.6 ± 0.9 | 3.2 ± 1.1 | −0.4 | 0.04 | Degraded |8485> 💡 **Tip:** The `evaluation_comparison_create` result includes `pValue` and `treatmentEffect` fields. Use `pValue < 0.05` as the threshold for statistical significance.8687## Multi-Dataset Comparison8889Compare how the same agent version performs across different datasets:9091| Dataset | Coherence | Fluency | Relevance | Notes |92|---------|:---------:|:-------:|:---------:|-------|93| traces-v3 (prod) | 4.0 | 4.5 | 3.6 | Production-derived |94| synthetic-v2 | 4.3 | 4.6 | 4.1 | May overestimate quality |95| manual-v1 (curated) | 3.8 | 4.4 | 3.2 | Hardest test cases |9697> ⚠️ **Warning:** Be cautious comparing scores across datasets with different structures (e.g., production traces vs synthetic). Differences may reflect dataset difficulty, not agent quality.9899## Next Steps100101- **Track trends over time** → [Eval Trending](eval-trending.md)102- **Check for regressions** → [Eval Regression](eval-regression.md)103- **Audit full lineage** → [Eval Lineage](eval-lineage.md)104