Loading source
Pulling the file list, source metadata, and syntax-aware rendering for this listing.
Source from repo
Build and deploy AI applications on Azure AI Foundry using Microsoft's model catalog and AI services
Files
Skill
Size
Entrypoint
Format
Open file
Syntax-highlighted preview of this file as included in the skill package.
foundry-agent/eval-datasets/references/dataset-curation.md
1# Dataset Curation — Human-in-the-Loop Review23Review, annotate, and approve harvested trace candidates before including them in evaluation datasets. This ensures dataset quality by adding a human review gate between raw trace extraction and finalized test cases.45## Workflow Overview67```8Raw Traces (from KQL harvest)9│10▼11[1] Candidate File (unreviewed)12│13▼14[2] Human Review (approve/edit/reject each)15│16▼17[3] Approved Dataset (versioned, ready for eval)18```1920## Step 1 — Generate Candidate File2122After running a [trace harvest](trace-to-dataset.md), save candidates with a `status` field:2324```25.foundry/datasets/<agent-name>-traces-candidates-<date>.jsonl26```2728Each line includes a review status:2930```json31{"query": "How do I reset my password?", "response": "...", "status": "pending", "metadata": {"source": "trace", "conversationId": "conv-abc-123", "harvestRule": "error", "errorType": "TimeoutError", "duration": 12300}}32{"query": "What's the refund policy?", "response": "...", "status": "pending", "metadata": {"source": "trace", "conversationId": "conv-def-456", "harvestRule": "latency", "duration": 8700}}33```3435## Step 2 — Present for Review3637Show candidates in a review table:3839| # | Status | Query (preview) | Source | Error | Duration | Eval Score |40|---|--------|----------------|--------|-------|----------|------------|41| 1 | ⏳ pending | "How do I reset my..." | error harvest | TimeoutError | 12.3s | — |42| 2 | ⏳ pending | "What's the refund..." | latency harvest | — | 8.7s | — |43| 3 | ⏳ pending | "Can you help me..." | low-eval harvest | — | 0.4s | 2.0 |4445### Review Actions4647For each candidate, the user can:4849| Action | Result |50|--------|--------|51| **Approve** | Include in dataset as-is |52| **Approve + Edit** | Include with modified query/response/ground_truth |53| **Add Ground Truth** | Approve and add the expected correct answer |54| **Reject** | Exclude from dataset |55| **Flag** | Mark for later review |5657### Batch Operations5859- *"Approve all"* — include all pending candidates60- *"Approve all errors"* — include all candidates from error harvest61- *"Reject duplicates"* — exclude candidates with similar queries to existing dataset entries62- *"Approve #1, #3, #5; reject #2, #4"* — selective approval by number6364## Step 3 — Finalize Dataset6566After review, filter approved candidates and save to a versioned dataset:67681. Read `.foundry/datasets/manifest.json` to find the latest version number692. Filter candidates where `status == "approved"`703. Remove the `status` field from the output714. Save to `.foundry/datasets/<agent-name>-<source>-v<N>.jsonl`725. Update `.foundry/datasets/manifest.json` with metadata7374### Update Candidate Status7576Mark the candidate file with final statuses:7778```json79{"query": "How do I reset my password?", "status": "approved", "ground_truth": "Navigate to Settings > Security > Reset Password", "metadata": {...}}80{"query": "What's the refund policy?", "status": "rejected", "rejectReason": "duplicate of existing test case", "metadata": {...}}81{"query": "Can you help me...", "status": "approved", "metadata": {...}}82```8384> 💡 **Tip:** Keep candidate files as an audit trail. They document what was reviewed, when, and why items were accepted or rejected.8586## Quality Checks8788Before finalizing, verify dataset quality:8990| Check | Criteria |91|-------|----------|92| **No duplicates** | Ensure no query appears in both the new dataset and existing datasets |93| **Balanced categories** | Verify reasonable distribution across categories (not all edge-cases) |94| **Ground truth coverage** | Flag examples without ground_truth that may benefit from one |95| **Minimum size** | Warn if dataset has fewer than 20 examples (may not be statistically meaningful) |96| **Safety coverage** | Ensure safety-related test cases are included if the agent handles sensitive topics |9798## Next Steps99100- **Version the approved dataset** → [Dataset Versioning](dataset-versioning.md)101- **Organize into splits** → [Dataset Organization](dataset-organization.md)102- **Run evaluation** → [observe skill Step 2](../../observe/references/evaluate-step.md)103