Loading source
Pulling the file list, source metadata, and syntax-aware rendering for this listing.
Source from repo
Deploy, evaluate, and manage AI agents end-to-end on Microsoft Azure AI Foundry
Files
Skill
Size
Entrypoint
Format
Open file
Syntax-highlighted preview of this file as included in the skill package.
foundry-agent/eval-datasets/references/dataset-curation.md
1# Dataset Curation — Human-in-the-Loop Review23Review, annotate, and approve harvested trace candidates before including them in evaluation datasets. This ensures dataset quality by adding a human review gate between raw trace extraction and finalized test cases.45## Workflow Overview67```8Raw Traces (from KQL harvest)9│10▼11[1] Candidate File (unreviewed)12│13▼14[2] Human Review (approve/edit/reject each)15│16▼17[3] Approved Dataset (versioned, ready for eval)18```1920## Step 1 — Generate Candidate File2122After running a [trace harvest](trace-to-dataset.md), save candidates with a `status` field:2324```25.foundry/datasets/<agent-name>-traces-candidates-<date>.jsonl26```2728Each line includes a review status:2930```json31{"query": "How do I reset my password?", "response": "...", "status": "pending", "metadata": {"source": "trace", "conversationId": "conv-abc-123", "harvestRule": "error", "errorType": "TimeoutError", "duration": 12300}}32{"query": "What's the refund policy?", "response": "...", "status": "pending", "metadata": {"source": "trace", "conversationId": "conv-def-456", "harvestRule": "latency", "duration": 8700}}33```3435## Step 2 — Present for Review3637Show candidates in a review table:3839| # | Status | Query (preview) | Source | Error | Duration | Eval Score |40|---|--------|----------------|--------|-------|----------|------------|41| 1 | ⏳ pending | "How do I reset my..." | error harvest | TimeoutError | 12.3s | — |42| 2 | ⏳ pending | "What's the refund..." | latency harvest | — | 8.7s | — |43| 3 | ⏳ pending | "Can you help me..." | low-eval harvest | — | 0.4s | 2.0 |4445### Review Actions4647For each candidate, the user can:4849| Action | Result |50|--------|--------|51| **Approve** | Include in dataset as-is |52| **Approve + Edit** | Include with modified query/response/ground_truth |53| **Add Ground Truth** | Approve and add the expected correct answer |54| **Reject** | Exclude from dataset |55| **Flag** | Mark for later review |5657### Batch Operations5859- *"Approve all"* — include all pending candidates60- *"Approve all errors"* — include all candidates from error harvest61- *"Reject duplicates"* — exclude candidates with similar queries to existing dataset entries62- *"Approve #1, #3, #5; reject #2, #4"* — selective approval by number6364## Step 3 — Finalize Dataset6566After review, filter approved candidates and save to a versioned dataset:67681. Read `.foundry/datasets/manifest.json` to find the latest version number692. Filter candidates where `status == "approved"`703. Remove the `status` field from the output714. Save to `.foundry/datasets/<agent-name>-<source>-v<N>.jsonl`725. Update `.foundry/datasets/manifest.json` with metadata7374### Update Candidate Status7576Mark the candidate file with final statuses:7778```json79{"query": "How do I reset my password?", "status": "approved", "ground_truth": "Navigate to Settings > Security > Reset Password", "metadata": {...}}80{"query": "What's the refund policy?", "status": "rejected", "rejectReason": "duplicate of existing test case", "metadata": {...}}81{"query": "Can you help me...", "status": "approved", "metadata": {...}}82```8384> 💡 **Tip:** Keep candidate files as an audit trail. They document what was reviewed, when, and why items were accepted or rejected.8586## Quality Checks8788Before finalizing, verify dataset quality:8990| Check | Criteria |91|-------|----------|92| **No duplicates** | Ensure no query appears in both the new dataset and existing datasets |93| **Balanced categories** | Verify reasonable distribution across categories (not all edge-cases) |94| **Ground truth coverage** | Flag examples without ground_truth that may benefit from one |95| **Minimum size** | Warn if dataset has fewer than 20 examples (may not be statistically meaningful) |96| **Safety coverage** | Ensure safety-related test cases are included if the agent handles sensitive topics |9798## Next Steps99100- **Version the approved dataset** → [Dataset Versioning](dataset-versioning.md)101- **Organize into splits** → [Dataset Organization](dataset-organization.md)102- **Run evaluation** → [observe skill Step 2](../../observe/references/evaluate-step.md)103