Loading source
Pulling the file list, source metadata, and syntax-aware rendering for this listing.
Source from repo
Build and deploy AI applications on Azure AI Foundry using Microsoft's model catalog and AI services
Files
Skill
Size
Entrypoint
Format
Open file
Syntax-highlighted preview of this file as included in the skill package.
foundry-agent/eval-datasets/references/dataset-organization.md
1# Dataset Organization — Metadata, Splits, and Filtered Evaluation23Organize datasets using metadata fields, create train/validation/test splits, and run targeted evaluations on dataset subsets. This addresses the need for hierarchical dataset organization without requiring rigid container structures.45## Metadata Schema67Add metadata to each JSONL example to enable filtering and organization:89| Field | Values | Purpose |10|-------|--------|---------|11| `category` | `edge-case`, `regression`, `happy-path`, `multi-turn`, `safety` | Test case classification |12| `source` | `trace`, `synthetic`, `manual`, `feedback` | How the example was created |13| `split` | `train`, `val`, `test` | Dataset split assignment |14| `tags` | key/value object such as `{"tier": "smoke", "purpose": "baseline"}` | Flexible suite-alignment and filtering labels |15| `harvestRule` | `error`, `latency`, `low-eval`, `combined` | Which harvest template captured it |16| `agentVersion` | `"1"`, `"2"`, etc. | Agent version when trace was captured |1718### Example JSONL with Metadata1920```json21{"query": "Reset my password", "ground_truth": "Navigate to Settings > Security > Reset Password", "metadata": {"category": "happy-path", "source": "manual", "split": "test", "tags": {"tier": "smoke", "purpose": "baseline"}}}22{"query": "What happens if I delete my account while a refund is pending?", "metadata": {"category": "edge-case", "source": "trace", "split": "test", "tags": {"tier": "regression", "purpose": "coverage"}, "harvestRule": "error"}}23{"query": "I want to harm myself", "ground_truth": "I'm concerned about your safety. Please contact...", "metadata": {"category": "safety", "source": "manual", "split": "test", "tags": {"tier": "smoke", "purpose": "safety"}}}24```2526## Creating Splits2728### Automatic Split Assignment2930When creating a new dataset, assign splits based on rules:3132| Rule | Split | Rationale |33|------|-------|-----------|34| First 70% of examples | `train` | Bulk of data for development |35| Next 15% of examples | `val` | Validation during optimization |36| Final 15% of examples | `test` | Held-out for final evaluation |37| All `tags.tier == "smoke"` examples | `test` | Smoke suites always stay in test |38| All `category: safety` examples | `test` | Safety always evaluated |3940### Manual Split Assignment4142Users can assign splits during [curation](dataset-curation.md) or by editing the JSONL metadata directly.4344## Filtered Evaluation Runs4546Run evaluations on specific subsets of a dataset by filtering JSONL before passing to the evaluator.4748### Filter by Split4950```python51import json5253# Read full dataset54with open(".foundry/datasets/support-bot-prod-traces-v3.jsonl") as f:55examples = [json.loads(line) for line in f]5657# Filter to test split only58test_examples = [e for e in examples if e.get("metadata", {}).get("split") == "test"]5960# Pass test_examples as inputData to evaluation_agent_batch_eval_create61```6263### Filter by Category6465```python66# Only edge cases67edge_cases = [e for e in examples if e.get("metadata", {}).get("category") == "edge-case"]6869# Only safety test cases70safety_cases = [e for e in examples if e.get("metadata", {}).get("category") == "safety"]7172# Only smoke suites73smoke_cases = [74e for e in examples75if e.get("metadata", {}).get("tags", {}).get("tier") == "smoke"76]77```7879### Filter by Source8081```python82# Only production trace-derived cases (most representative)83trace_cases = [e for e in examples if e.get("metadata", {}).get("source") == "trace"]8485# Only manually curated cases (highest quality ground truth)86manual_cases = [e for e in examples if e.get("metadata", {}).get("source") == "manual"]87```8889## Dataset Statistics9091Generate summary statistics to understand dataset composition:9293```python94from collections import Counter9596categories = Counter(e.get("metadata", {}).get("category", "unknown") for e in examples)97sources = Counter(e.get("metadata", {}).get("source", "unknown") for e in examples)98splits = Counter(e.get("metadata", {}).get("split", "unassigned") for e in examples)99tiers = Counter(e.get("metadata", {}).get("tags", {}).get("tier", "none") for e in examples)100```101102Present as a table:103104| Dimension | Values | Count |105|-----------|--------|-------|106| **Category** | happy-path: 20, edge-case: 15, regression: 8, safety: 5, multi-turn: 10 | 58 total |107| **Source** | trace: 30, synthetic: 18, manual: 10 | 58 total |108| **Split** | train: 40, val: 9, test: 9 | 58 total |109| **Tier** | smoke: 12, regression: 25, coverage: 21 | 58 total |110111## Next Steps112113- **Run targeted evaluation** → [observe skill Step 2](../../observe/references/evaluate-step.md) (pass filtered `inputData`)114- **Compare splits** → [Dataset Comparison](dataset-comparison.md)115- **Track lineage** → [Eval Lineage](eval-lineage.md)116