Source from repo

Microsoft Foundry Skill

Deploy, evaluate, and manage AI agents end-to-end on Microsoft Azure AI Foundry

microsoftGitHub microsoftOfficialSource repo Original GitHub link Publisher page

Files

Skill

n/a

Size

560.1 KB

Entrypoint

SKILL.md

Format

git-repo

Open file

foundry-agent/eval-datasets/references/dataset-curation.md

Syntax-highlighted preview of this file as included in the skill package.

Rendered Source

markdown103 linesFree

foundry-agent/eval-datasets/references/dataset-curation.md

1# Dataset Curation — Human-in-the-Loop Review
2 
3Review, annotate, and approve harvested trace candidates before including them in evaluation datasets. This ensures dataset quality by adding a human review gate between raw trace extraction and finalized test cases.
4 
5## Workflow Overview
6 
7```
8Raw Traces (from KQL harvest)
9    │
10    ▼
11[1] Candidate File (unreviewed)
12    │
13    ▼
14[2] Human Review (approve/edit/reject each)
15    │
16    ▼
17[3] Approved Dataset (versioned, ready for eval)
18```
19 
20## Step 1 — Generate Candidate File
21 
22After running a [trace harvest](trace-to-dataset.md), save candidates with a `status` field:
23 
24```
25.foundry/datasets/<agent-name>-traces-candidates-<date>.jsonl
26```
27 
28Each line includes a review status:
29 
30```json
31{"query": "How do I reset my password?", "response": "...", "status": "pending", "metadata": {"source": "trace", "conversationId": "conv-abc-123", "harvestRule": "error", "errorType": "TimeoutError", "duration": 12300}}
32{"query": "What's the refund policy?", "response": "...", "status": "pending", "metadata": {"source": "trace", "conversationId": "conv-def-456", "harvestRule": "latency", "duration": 8700}}
33```
34 
35## Step 2 — Present for Review
36 
37Show candidates in a review table:
38 
39| # | Status | Query (preview) | Source | Error | Duration | Eval Score |
40|---|--------|----------------|--------|-------|----------|------------|
41| 1 | ⏳ pending | "How do I reset my..." | error harvest | TimeoutError | 12.3s | — |
42| 2 | ⏳ pending | "What's the refund..." | latency harvest | — | 8.7s | — |
43| 3 | ⏳ pending | "Can you help me..." | low-eval harvest | — | 0.4s | 2.0 |
44 
45### Review Actions
46 
47For each candidate, the user can:
48 
49| Action | Result |
50|--------|--------|
51| **Approve** | Include in dataset as-is |
52| **Approve + Edit** | Include with modified query/response/ground_truth |
53| **Add Ground Truth** | Approve and add the expected correct answer |
54| **Reject** | Exclude from dataset |
55| **Flag** | Mark for later review |
56 
57### Batch Operations
58 
59- *"Approve all"* — include all pending candidates
60- *"Approve all errors"* — include all candidates from error harvest
61- *"Reject duplicates"* — exclude candidates with similar queries to existing dataset entries
62- *"Approve #1, #3, #5; reject #2, #4"* — selective approval by number
63 
64## Step 3 — Finalize Dataset
65 
66After review, filter approved candidates and save to a versioned dataset:
67 
681. Read `.foundry/datasets/manifest.json` to find the latest version number
692. Filter candidates where `status == "approved"`
703. Remove the `status` field from the output
714. Save to `.foundry/datasets/<agent-name>-<source>-v<N>.jsonl`
725. Update `.foundry/datasets/manifest.json` with metadata
73 
74### Update Candidate Status
75 
76Mark the candidate file with final statuses:
77 
78```json
79{"query": "How do I reset my password?", "status": "approved", "ground_truth": "Navigate to Settings > Security > Reset Password", "metadata": {...}}
80{"query": "What's the refund policy?", "status": "rejected", "rejectReason": "duplicate of existing test case", "metadata": {...}}
81{"query": "Can you help me...", "status": "approved", "metadata": {...}}
82```
83 
84> 💡 **Tip:** Keep candidate files as an audit trail. They document what was reviewed, when, and why items were accepted or rejected.
85 
86## Quality Checks
87 
88Before finalizing, verify dataset quality:
89 
90| Check | Criteria |
91|-------|----------|
92| **No duplicates** | Ensure no query appears in both the new dataset and existing datasets |
93| **Balanced categories** | Verify reasonable distribution across categories (not all edge-cases) |
94| **Ground truth coverage** | Flag examples without ground_truth that may benefit from one |
95| **Minimum size** | Warn if dataset has fewer than 20 examples (may not be statistically meaningful) |
96| **Safety coverage** | Ensure safety-related test cases are included if the agent handles sensitive topics |
97 
98## Next Steps
99 
100- **Version the approved dataset** → [Dataset Versioning](dataset-versioning.md)
101- **Organize into splits** → [Dataset Organization](dataset-organization.md)
102- **Run evaluation** → [observe skill Step 2](../../observe/references/evaluate-step.md)
103

Preparing the source view

Microsoft Foundry Skill

foundry-agent/eval-datasets/references/dataset-curation.md