Source from repo

Microsoft Foundry Skill

Build and deploy AI applications on Azure AI Foundry using Microsoft's model catalog and AI services

microsoftGitHub microsoftOfficialSource repo Original GitHub link Publisher page

Files

Skill

n/a

Size

546.7 KB

Entrypoint

SKILL.md

Format

git-repo

Open file

foundry-agent/eval-datasets/references/dataset-comparison.md

Syntax-highlighted preview of this file as included in the skill package.

Rendered Source

markdown104 linesFree

foundry-agent/eval-datasets/references/dataset-comparison.md

1# Dataset Comparison — A/B Testing Across Dataset Versions
2 
3Run structured experiments that compare how an agent performs across different dataset versions, and present results as leaderboards with per-evaluator breakdowns. Use this to answer: "Did scores drop because of harder tests or agent regression?"
4 
5## Experiment Structure
6 
7An experiment consists of:
81. **Pinned agent version** — the same agent evaluated on each dataset
92. **Varied dataset versions** — the versions being compared
103. **Same evaluators** — applied consistently across all runs
114. **Comparison results** — which dataset version the agent performs better on
12 
13## Step 1 — Define the Experiment
14 
15| Parameter | Value | Example |
16|-----------|-------|---------|
17| Agent | Pinned agent version | `v3` |
18| Baseline dataset | Previous dataset version | `support-bot-prod-traces-v2` |
19| Treatment dataset(s) | New dataset version(s) | `support-bot-prod-traces-v3` |
20| Evaluators | Same set for all runs | coherence, fluency, relevance, intent_resolution, task_adherence |
21 
22## Step 2 — Run Evaluations
23 
24For each dataset version, run **`evaluation_agent_batch_eval_create`** with:
25- Same `evaluationId` (groups all runs for comparison)
26- Same `agentVersion`
27- Same `evaluatorNames`
28- Different `inputData` (from each dataset version)
29 
30> **Important:** Use `evaluationId` on `evaluation_agent_batch_eval_create` to group runs. After the runs exist, switch to `evalId` for `evaluation_get` and `evaluation_comparison_create`.
31 
32> ⚠️ **Eval-group immutability:** Keep the evaluator set and thresholds fixed within one evaluation group. If you need to change evaluators or thresholds, create a new evaluation group instead of reusing the previous `evaluationId`.
33 
34> ⚠️ **Score drops are expected.** When comparing v1→v2 datasets, lower scores on the new dataset likely mean the new test cases are harder (better coverage), not that the agent regressed. **Do NOT remove dataset rows or weaken evaluators to recover scores.** Instead, optimize the agent for the new failure patterns, then re-evaluate.
35 
36## Step 3 — Compare Results
37 
38Use **`evaluation_comparison_create`** with the baseline and treatment runs:
39 
40```json
41{
42  "insightRequest": {
43    "displayName": "Dataset comparison: traces-v2 vs traces-v3 on agent-v3",
44    "state": "NotStarted",
45    "request": {
46      "type": "EvaluationComparison",
47      "evalId": "<eval-group-id>",
48      "baselineRunId": "<traces-v2-run-id>",
49      "treatmentRunIds": ["<traces-v3-run-id>"]
50    }
51  }
52}
53```
54 
55> ⚠️ **Common mistake:** `evaluation_comparison_create` uses `insightRequest.request.evalId`, not `evaluationId`, even when the runs were originally grouped with `evaluationId`.
56 
57## Step 4 — Leaderboard
58 
59Present results as a leaderboard table:
60 
61| Evaluator | traces-v2 (baseline) | traces-v3 | Effect |
62|-----------|:---:|:---:|:---:|
63| Coherence | 4.0 | 3.6 | ⚠️ Lower |
64| Fluency | 4.5 | 4.3 | ⚠️ Lower |
65| Relevance | 3.6 | 3.2 | ⚠️ Lower |
66| Intent Resolution | 4.1 | 3.7 | ⚠️ Lower |
67| Task Adherence | 3.9 | 3.4 | ⚠️ Lower |
68 
69### Recommendation
70 
71If scores drop uniformly across all evaluators, the new dataset is likely harder:
72 
73*"Agent v3 scores dropped on traces-v3 across all evaluators. traces-v3 added 15 edge-case queries from production failures. This is expected — optimize the agent for the new failure patterns rather than reverting the dataset."*
74 
75## Pairwise A/B Comparison
76 
77For detailed pairwise analysis between exactly two dataset versions:
78 
79| Evaluator | Baseline (traces-v2) | Treatment (traces-v3) | Delta | p-value | Effect |
80|-----------|:---:|:---:|:---:|:---:|:---:|
81| Coherence | 4.0 ± 0.6 | 3.6 ± 0.9 | −0.4 | 0.03 | Degraded |
82| Fluency | 4.5 ± 0.4 | 4.3 ± 0.5 | −0.2 | 0.12 | Inconclusive |
83| Relevance | 3.6 ± 0.9 | 3.2 ± 1.1 | −0.4 | 0.04 | Degraded |
84 
85> 💡 **Tip:** The `evaluation_comparison_create` result includes `pValue` and `treatmentEffect` fields. Use `pValue < 0.05` as the threshold for statistical significance.
86 
87## Multi-Dataset Comparison
88 
89Compare how the same agent version performs across different datasets:
90 
91| Dataset | Coherence | Fluency | Relevance | Notes |
92|---------|:---------:|:-------:|:---------:|-------|
93| traces-v3 (prod) | 4.0 | 4.5 | 3.6 | Production-derived |
94| synthetic-v2 | 4.3 | 4.6 | 4.1 | May overestimate quality |
95| manual-v1 (curated) | 3.8 | 4.4 | 3.2 | Hardest test cases |
96 
97> ⚠️ **Warning:** Be cautious comparing scores across datasets with different structures (e.g., production traces vs synthetic). Differences may reflect dataset difficulty, not agent quality.
98 
99## Next Steps
100 
101- **Track trends over time** → [Eval Trending](eval-trending.md)
102- **Check for regressions** → [Eval Regression](eval-regression.md)
103- **Audit full lineage** → [Eval Lineage](eval-lineage.md)
104

Preparing the source view

Microsoft Foundry Skill

foundry-agent/eval-datasets/references/dataset-comparison.md