Source from repo

Microsoft Foundry Skill

Deploy, evaluate, and manage AI agents end-to-end on Microsoft Azure AI Foundry

microsoftGitHub microsoftOfficialSource repo Original GitHub link Publisher page

Files

151

Skill

n/a

Size

940.9 KB

Entrypoint

SKILL.md

Format

git-repo

Open file

foundry-agent/eval-datasets/references/dataset-comparison.md

Syntax-highlighted preview of this file as included in the skill package.

Rendered Source

markdown104 linesFree

foundry-agent/eval-datasets/references/dataset-comparison.md

1# Dataset Comparison — A/B Testing Across Dataset Versions
2 
3Run structured experiments that compare how an agent performs across different dataset versions, and present results as leaderboards with per-evaluator breakdowns. Use this to answer: "Did scores drop because of harder tests or agent regression?"
4 
5## Experiment Structure
6 
7An experiment consists of:
81. **Pinned agent version** — the same agent evaluated on each dataset
92. **Varied dataset versions** — the versions being compared
103. **Same evaluators** — applied consistently across all runs
114. **Comparison results** — which dataset version the agent performs better on
12 
13## Step 1 — Define the Experiment
14 
15| Parameter | Value | Example |
16|-----------|-------|---------|
17| Agent | Pinned agent version | `v3` |
18| Baseline dataset | Previous dataset version | `support-bot-prod-traces-v2` |
19| Treatment dataset(s) | New dataset version(s) | `support-bot-prod-traces-v3` |
20| Evaluators | Same set for all runs | coherence, fluency, relevance, intent_resolution, task_adherence |
21 
22## Step 2 — Run Evaluations
23 
24For each dataset version, run **`evaluation_agent_batch_eval_create`** with:
25- Same `evaluationId` (groups all runs for comparison)
26- Same `agentVersion`
27- Same `evaluatorNames`
28- Different `inputData` (from each dataset version)
29 
30> **Important:** Use `evaluationId` on `evaluation_agent_batch_eval_create` to group runs. After the runs exist, switch to `evalId` for `evaluation_get` and `evaluation_comparison_create`.
31 
32> ⚠️ **Eval-group immutability:** Keep the evaluator set and thresholds fixed within one evaluation group. If you need to change evaluators or thresholds, create a new evaluation group instead of reusing the previous `evaluationId`.
33 
34> ⚠️ **Score drops are expected.** When comparing v1→v2 datasets, lower scores on the new dataset likely mean the new test cases are harder (better coverage), not that the agent regressed. **Do NOT remove dataset rows or weaken evaluators to recover scores.** Instead, optimize the agent for the new failure patterns, then re-evaluate.
35 
36## Step 3 — Compare Results
37 
38Use **`evaluation_comparison_create`** with the baseline and treatment runs:
39 
40```json
41{
42  "insightRequest": {
43    "displayName": "Dataset comparison: traces-v2 vs traces-v3 on agent-v3",
44    "state": "NotStarted",
45    "request": {
46      "type": "EvaluationComparison",
47      "evalId": "<eval-group-id>",
48      "baselineRunId": "<traces-v2-run-id>",
49      "treatmentRunIds": ["<traces-v3-run-id>"]
50    }
51  }
52}
53```
54 
55> ⚠️ **Common mistake:** `evaluation_comparison_create` uses `insightRequest.request.evalId`, not `evaluationId`, even when the runs were originally grouped with `evaluationId`.
56 
57## Step 4 — Leaderboard
58 
59Present results as a leaderboard table:
60 
61| Evaluator | traces-v2 (baseline) | traces-v3 | Effect |
62|-----------|:---:|:---:|:---:|
63| Coherence | 4.0 | 3.6 | ⚠️ Lower |
64| Fluency | 4.5 | 4.3 | ⚠️ Lower |
65| Relevance | 3.6 | 3.2 | ⚠️ Lower |
66| Intent Resolution | 4.1 | 3.7 | ⚠️ Lower |
67| Task Adherence | 3.9 | 3.4 | ⚠️ Lower |
68 
69### Recommendation
70 
71If scores drop uniformly across all evaluators, the new dataset is likely harder:
72 
73*"Agent v3 scores dropped on traces-v3 across all evaluators. traces-v3 added 15 edge-case queries from production failures. This is expected — optimize the agent for the new failure patterns rather than reverting the dataset."*
74 
75## Pairwise A/B Comparison
76 
77For detailed pairwise analysis between exactly two dataset versions:
78 
79| Evaluator | Baseline (traces-v2) | Treatment (traces-v3) | Delta | p-value | Effect |
80|-----------|:---:|:---:|:---:|:---:|:---:|
81| Coherence | 4.0 ± 0.6 | 3.6 ± 0.9 | −0.4 | 0.03 | Degraded |
82| Fluency | 4.5 ± 0.4 | 4.3 ± 0.5 | −0.2 | 0.12 | Inconclusive |
83| Relevance | 3.6 ± 0.9 | 3.2 ± 1.1 | −0.4 | 0.04 | Degraded |
84 
85> 💡 **Tip:** The `evaluation_comparison_create` result includes `pValue` and `treatmentEffect` fields. Use `pValue < 0.05` as the threshold for statistical significance.
86 
87## Multi-Dataset Comparison
88 
89Compare how the same agent version performs across different datasets:
90 
91| Dataset | Coherence | Fluency | Relevance | Notes |
92|---------|:---------:|:-------:|:---------:|-------|
93| traces-v3 (prod) | 4.0 | 4.5 | 3.6 | Production-derived |
94| synthetic-v2 | 4.3 | 4.6 | 4.1 | May overestimate quality |
95| manual-v1 (curated) | 3.8 | 4.4 | 3.2 | Hardest test cases |
96 
97> ⚠️ **Warning:** Be cautious comparing scores across datasets with different structures (e.g., production traces vs synthetic). Differences may reflect dataset difficulty, not agent quality.
98 
99## Next Steps
100 
101- **Track trends over time** → [Eval Trending](eval-trending.md)
102- **Check for regressions** → [Eval Regression](eval-regression.md)
103- **Audit full lineage** → [Eval Lineage](eval-lineage.md)
104

Microsoft Foundry Skill

foundry-agent/eval-datasets/references/dataset-comparison.md

Preparing the source view

Microsoft Foundry Skill

foundry-agent/eval-datasets/references/dataset-comparison.md