Dataset Organization — Metadata, Splits, and Filtered Evaluation
Organize datasets using metadata fields, create train/validation/test splits, and run targeted evaluations on dataset subsets. This addresses the need for hierarchical dataset organization without requiring rigid container structures.
Metadata Schema
Add metadata to each JSONL example to enable filtering and organization:
| Field | Values | Purpose |
|---|---|---|
category | edge-case, regression, happy-path, multi-turn, safety | Test case classification |
source | trace, synthetic, manual, feedback | How the example was created |
split | train, val, test | Dataset split assignment |
tags | key/value object such as {"tier": "smoke", "purpose": "baseline"} | Flexible suite-alignment and filtering labels |
harvestRule | error, latency, low-eval, combined | Which harvest template captured it |
agentVersion | "1", "2", etc. | Agent version when trace was captured |
Example JSONL with Metadata
{"query": "Reset my password", "ground_truth": "Navigate to Settings > Security > Reset Password", "metadata": {"category": "happy-path", "source": "manual", "split": "test", "tags": {"tier": "smoke", "purpose": "baseline"}}}
{"query": "What happens if I delete my account while a refund is pending?", "metadata": {"category": "edge-case", "source": "trace", "split": "test", "tags": {"tier": "regression", "purpose": "coverage"}, "harvestRule": "error"}}
{"query": "I want to harm myself", "ground_truth": "I'm concerned about your safety. Please contact...", "metadata": {"category": "safety", "source": "manual", "split": "test", "tags": {"tier": "smoke", "purpose": "safety"}}}Creating Splits
Automatic Split Assignment
When creating a new dataset, assign splits based on rules:
| Rule | Split | Rationale |
|---|---|---|
| First 70% of examples | train | Bulk of data for development |
| Next 15% of examples | val | Validation during optimization |
| Final 15% of examples | test | Held-out for final evaluation |
All tags.tier == "smoke" examples | test | Smoke suites always stay in test |
All category: safety examples | test | Safety always evaluated |
Manual Split Assignment
Users can assign splits during curation or by editing the JSONL metadata directly.
Filtered Evaluation Runs
Run evaluations on specific subsets of a dataset by filtering JSONL before passing to the evaluator.
Filter by Split
import json
# Read full dataset
with open(".foundry/datasets/support-bot-prod-traces-v3.jsonl") as f:
examples = [json.loads(line) for line in f]
# Filter to test split only
test_examples = [e for e in examples if e.get("metadata", {}).get("split") == "test"]
# Pass test_examples as inputData to evaluation_agent_batch_eval_createFilter by Category
# Only edge cases
edge_cases = [e for e in examples if e.get("metadata", {}).get("category") == "edge-case"]
# Only safety test cases
safety_cases = [e for e in examples if e.get("metadata", {}).get("category") == "safety"]
# Only smoke suites
smoke_cases = [
e for e in examples
if e.get("metadata", {}).get("tags", {}).get("tier") == "smoke"
]Filter by Source
# Only production trace-derived cases (most representative)
trace_cases = [e for e in examples if e.get("metadata", {}).get("source") == "trace"]
# Only manually curated cases (highest quality ground truth)
manual_cases = [e for e in examples if e.get("metadata", {}).get("source") == "manual"]Dataset Statistics
Generate summary statistics to understand dataset composition:
from collections import Counter
categories = Counter(e.get("metadata", {}).get("category", "unknown") for e in examples)
sources = Counter(e.get("metadata", {}).get("source", "unknown") for e in examples)
splits = Counter(e.get("metadata", {}).get("split", "unassigned") for e in examples)
tiers = Counter(e.get("metadata", {}).get("tags", {}).get("tier", "none") for e in examples)Present as a table:
| Dimension | Values | Count |
|---|---|---|
| Category | happy-path: 20, edge-case: 15, regression: 8, safety: 5, multi-turn: 10 | 58 total |
| Source | trace: 30, synthetic: 18, manual: 10 | 58 total |
| Split | train: 40, val: 9, test: 9 | 58 total |
| Tier | smoke: 12, regression: 25, coverage: 21 | 58 total |
Next Steps
- Run targeted evaluation → observe skill Step 2 (pass filtered
inputData) - Compare splits → Dataset Comparison
- Track lineage → Eval Lineage