Loading source
Pulling the file list, source metadata, and syntax-aware rendering for this listing.
Source from repo
Deploy, evaluate, and manage AI agents end-to-end on Microsoft Azure AI Foundry
Files
Skill
Size
Entrypoint
Format
Open file
Syntax-highlighted preview of this file as included in the skill package.
foundry-agent/eval-datasets/references/dataset-comparison.md
1# Dataset Comparison — A/B Testing Across Dataset Versions23Run structured experiments that compare how an agent performs across different dataset versions, and present results as leaderboards with per-evaluator breakdowns. Use this to answer: "Did scores drop because of harder tests or agent regression?"45## Experiment Structure67An experiment consists of:81. **Pinned agent version** — the same agent evaluated on each dataset92. **Varied dataset versions** — the versions being compared103. **Same evaluators** — applied consistently across all runs114. **Comparison results** — which dataset version the agent performs better on1213## Step 1 — Define the Experiment1415| Parameter | Value | Example |16|-----------|-------|---------|17| Agent | Pinned agent version | `v3` |18| Baseline dataset | Previous dataset version | `support-bot-prod-traces-v2` |19| Treatment dataset(s) | New dataset version(s) | `support-bot-prod-traces-v3` |20| Evaluators | Same set for all runs | coherence, fluency, relevance, intent_resolution, task_adherence |2122## Step 2 — Run Evaluations2324For each dataset version, run **`evaluation_agent_batch_eval_create`** with:25- Same `evaluationId` (groups all runs for comparison)26- Same `agentVersion`27- Same `evaluatorNames`28- Different `inputData` (from each dataset version)2930> **Important:** Use `evaluationId` on `evaluation_agent_batch_eval_create` to group runs. After the runs exist, switch to `evalId` for `evaluation_get` and `evaluation_comparison_create`.3132> ⚠️ **Eval-group immutability:** Keep the evaluator set and thresholds fixed within one evaluation group. If you need to change evaluators or thresholds, create a new evaluation group instead of reusing the previous `evaluationId`.3334> ⚠️ **Score drops are expected.** When comparing v1→v2 datasets, lower scores on the new dataset likely mean the new test cases are harder (better coverage), not that the agent regressed. **Do NOT remove dataset rows or weaken evaluators to recover scores.** Instead, optimize the agent for the new failure patterns, then re-evaluate.3536## Step 3 — Compare Results3738Use **`evaluation_comparison_create`** with the baseline and treatment runs:3940```json41{42"insightRequest": {43"displayName": "Dataset comparison: traces-v2 vs traces-v3 on agent-v3",44"state": "NotStarted",45"request": {46"type": "EvaluationComparison",47"evalId": "<eval-group-id>",48"baselineRunId": "<traces-v2-run-id>",49"treatmentRunIds": ["<traces-v3-run-id>"]50}51}52}53```5455> ⚠️ **Common mistake:** `evaluation_comparison_create` uses `insightRequest.request.evalId`, not `evaluationId`, even when the runs were originally grouped with `evaluationId`.5657## Step 4 — Leaderboard5859Present results as a leaderboard table:6061| Evaluator | traces-v2 (baseline) | traces-v3 | Effect |62|-----------|:---:|:---:|:---:|63| Coherence | 4.0 | 3.6 | ⚠️ Lower |64| Fluency | 4.5 | 4.3 | ⚠️ Lower |65| Relevance | 3.6 | 3.2 | ⚠️ Lower |66| Intent Resolution | 4.1 | 3.7 | ⚠️ Lower |67| Task Adherence | 3.9 | 3.4 | ⚠️ Lower |6869### Recommendation7071If scores drop uniformly across all evaluators, the new dataset is likely harder:7273*"Agent v3 scores dropped on traces-v3 across all evaluators. traces-v3 added 15 edge-case queries from production failures. This is expected — optimize the agent for the new failure patterns rather than reverting the dataset."*7475## Pairwise A/B Comparison7677For detailed pairwise analysis between exactly two dataset versions:7879| Evaluator | Baseline (traces-v2) | Treatment (traces-v3) | Delta | p-value | Effect |80|-----------|:---:|:---:|:---:|:---:|:---:|81| Coherence | 4.0 ± 0.6 | 3.6 ± 0.9 | −0.4 | 0.03 | Degraded |82| Fluency | 4.5 ± 0.4 | 4.3 ± 0.5 | −0.2 | 0.12 | Inconclusive |83| Relevance | 3.6 ± 0.9 | 3.2 ± 1.1 | −0.4 | 0.04 | Degraded |8485> 💡 **Tip:** The `evaluation_comparison_create` result includes `pValue` and `treatmentEffect` fields. Use `pValue < 0.05` as the threshold for statistical significance.8687## Multi-Dataset Comparison8889Compare how the same agent version performs across different datasets:9091| Dataset | Coherence | Fluency | Relevance | Notes |92|---------|:---------:|:-------:|:---------:|-------|93| traces-v3 (prod) | 4.0 | 4.5 | 3.6 | Production-derived |94| synthetic-v2 | 4.3 | 4.6 | 4.1 | May overestimate quality |95| manual-v1 (curated) | 3.8 | 4.4 | 3.2 | Hardest test cases |9697> ⚠️ **Warning:** Be cautious comparing scores across datasets with different structures (e.g., production traces vs synthetic). Differences may reflect dataset difficulty, not agent quality.9899## Next Steps100101- **Track trends over time** → [Eval Trending](eval-trending.md)102- **Check for regressions** → [Eval Regression](eval-regression.md)103- **Audit full lineage** → [Eval Lineage](eval-lineage.md)104