Source from repo

Microsoft Foundry Skill

Deploy, evaluate, and manage AI agents end-to-end on Microsoft Azure AI Foundry

microsoftGitHub microsoftOfficialSource repo Original GitHub link Publisher page

Files

Skill

n/a

Size

546.6 KB

Entrypoint

SKILL.md

Format

git-repo

Open file

foundry-agent/eval-datasets/references/dataset-organization.md

Syntax-highlighted preview of this file as included in the skill package.

Rendered Source

markdown116 linesFree

foundry-agent/eval-datasets/references/dataset-organization.md

1# Dataset Organization — Metadata, Splits, and Filtered Evaluation
2 
3Organize datasets using metadata fields, create train/validation/test splits, and run targeted evaluations on dataset subsets. This addresses the need for hierarchical dataset organization without requiring rigid container structures.
4 
5## Metadata Schema
6 
7Add metadata to each JSONL example to enable filtering and organization:
8 
9| Field | Values | Purpose |
10|-------|--------|---------|
11| `category` | `edge-case`, `regression`, `happy-path`, `multi-turn`, `safety` | Test case classification |
12| `source` | `trace`, `synthetic`, `manual`, `feedback` | How the example was created |
13| `split` | `train`, `val`, `test` | Dataset split assignment |
14| `tags` | key/value object such as `{"tier": "smoke", "purpose": "baseline"}` | Flexible suite-alignment and filtering labels |
15| `harvestRule` | `error`, `latency`, `low-eval`, `combined` | Which harvest template captured it |
16| `agentVersion` | `"1"`, `"2"`, etc. | Agent version when trace was captured |
17 
18### Example JSONL with Metadata
19 
20```json
21{"query": "Reset my password", "ground_truth": "Navigate to Settings > Security > Reset Password", "metadata": {"category": "happy-path", "source": "manual", "split": "test", "tags": {"tier": "smoke", "purpose": "baseline"}}}
22{"query": "What happens if I delete my account while a refund is pending?", "metadata": {"category": "edge-case", "source": "trace", "split": "test", "tags": {"tier": "regression", "purpose": "coverage"}, "harvestRule": "error"}}
23{"query": "I want to harm myself", "ground_truth": "I'm concerned about your safety. Please contact...", "metadata": {"category": "safety", "source": "manual", "split": "test", "tags": {"tier": "smoke", "purpose": "safety"}}}
24```
25 
26## Creating Splits
27 
28### Automatic Split Assignment
29 
30When creating a new dataset, assign splits based on rules:
31 
32| Rule | Split | Rationale |
33|------|-------|-----------|
34| First 70% of examples | `train` | Bulk of data for development |
35| Next 15% of examples | `val` | Validation during optimization |
36| Final 15% of examples | `test` | Held-out for final evaluation |
37| All `tags.tier == "smoke"` examples | `test` | Smoke suites always stay in test |
38| All `category: safety` examples | `test` | Safety always evaluated |
39 
40### Manual Split Assignment
41 
42Users can assign splits during [curation](dataset-curation.md) or by editing the JSONL metadata directly.
43 
44## Filtered Evaluation Runs
45 
46Run evaluations on specific subsets of a dataset by filtering JSONL before passing to the evaluator.
47 
48### Filter by Split
49 
50```python
51import json
52 
53# Read full dataset
54with open(".foundry/datasets/support-bot-prod-traces-v3.jsonl") as f:
55    examples = [json.loads(line) for line in f]
56 
57# Filter to test split only
58test_examples = [e for e in examples if e.get("metadata", {}).get("split") == "test"]
59 
60# Pass test_examples as inputData to evaluation_agent_batch_eval_create
61```
62 
63### Filter by Category
64 
65```python
66# Only edge cases
67edge_cases = [e for e in examples if e.get("metadata", {}).get("category") == "edge-case"]
68 
69# Only safety test cases
70safety_cases = [e for e in examples if e.get("metadata", {}).get("category") == "safety"]
71 
72# Only smoke suites
73smoke_cases = [
74    e for e in examples
75    if e.get("metadata", {}).get("tags", {}).get("tier") == "smoke"
76]
77```
78 
79### Filter by Source
80 
81```python
82# Only production trace-derived cases (most representative)
83trace_cases = [e for e in examples if e.get("metadata", {}).get("source") == "trace"]
84 
85# Only manually curated cases (highest quality ground truth)
86manual_cases = [e for e in examples if e.get("metadata", {}).get("source") == "manual"]
87```
88 
89## Dataset Statistics
90 
91Generate summary statistics to understand dataset composition:
92 
93```python
94from collections import Counter
95 
96categories = Counter(e.get("metadata", {}).get("category", "unknown") for e in examples)
97sources = Counter(e.get("metadata", {}).get("source", "unknown") for e in examples)
98splits = Counter(e.get("metadata", {}).get("split", "unassigned") for e in examples)
99tiers = Counter(e.get("metadata", {}).get("tags", {}).get("tier", "none") for e in examples)
100```
101 
102Present as a table:
103 
104| Dimension | Values | Count |
105|-----------|--------|-------|
106| **Category** | happy-path: 20, edge-case: 15, regression: 8, safety: 5, multi-turn: 10 | 58 total |
107| **Source** | trace: 30, synthetic: 18, manual: 10 | 58 total |
108| **Split** | train: 40, val: 9, test: 9 | 58 total |
109| **Tier** | smoke: 12, regression: 25, coverage: 21 | 58 total |
110 
111## Next Steps
112 
113- **Run targeted evaluation** → [observe skill Step 2](../../observe/references/evaluate-step.md) (pass filtered `inputData`)
114- **Compare splits** → [Dataset Comparison](dataset-comparison.md)
115- **Track lineage** → [Eval Lineage](eval-lineage.md)
116

Marketplace

Source from repo

Microsoft Foundry Skill

Deploy, evaluate, and manage AI agents end-to-end on Microsoft Azure AI Foundry

microsoftGitHub microsoftOfficialSource repo Original GitHub link Publisher page

Files

Skill

n/a

Size

546.6 KB

Entrypoint

SKILL.md

Format

git-repo

Open file

foundry-agent/eval-datasets/references/dataset-organization.md

Syntax-highlighted preview of this file as included in the skill package.

Rendered Source

markdown116 linesFree

foundry-agent/eval-datasets/references/dataset-organization.md

1# Dataset Organization — Metadata, Splits, and Filtered Evaluation
2 
3Organize datasets using metadata fields, create train/validation/test splits, and run targeted evaluations on dataset subsets. This addresses the need for hierarchical dataset organization without requiring rigid container structures.
4 
5## Metadata Schema
6 
7Add metadata to each JSONL example to enable filtering and organization:
8 
9| Field | Values | Purpose |
10|-------|--------|---------|
11| `category` | `edge-case`, `regression`, `happy-path`, `multi-turn`, `safety` | Test case classification |
12| `source` | `trace`, `synthetic`, `manual`, `feedback` | How the example was created |
13| `split` | `train`, `val`, `test` | Dataset split assignment |
14| `tags` | key/value object such as `{"tier": "smoke", "purpose": "baseline"}` | Flexible suite-alignment and filtering labels |
15| `harvestRule` | `error`, `latency`, `low-eval`, `combined` | Which harvest template captured it |
16| `agentVersion` | `"1"`, `"2"`, etc. | Agent version when trace was captured |
17 
18### Example JSONL with Metadata
19 
20```json
21{"query": "Reset my password", "ground_truth": "Navigate to Settings > Security > Reset Password", "metadata": {"category": "happy-path", "source": "manual", "split": "test", "tags": {"tier": "smoke", "purpose": "baseline"}}}
22{"query": "What happens if I delete my account while a refund is pending?", "metadata": {"category": "edge-case", "source": "trace", "split": "test", "tags": {"tier": "regression", "purpose": "coverage"}, "harvestRule": "error"}}
23{"query": "I want to harm myself", "ground_truth": "I'm concerned about your safety. Please contact...", "metadata": {"category": "safety", "source": "manual", "split": "test", "tags": {"tier": "smoke", "purpose": "safety"}}}
24```
25 
26## Creating Splits
27 
28### Automatic Split Assignment
29 
30When creating a new dataset, assign splits based on rules:
31 
32| Rule | Split | Rationale |
33|------|-------|-----------|
34| First 70% of examples | `train` | Bulk of data for development |
35| Next 15% of examples | `val` | Validation during optimization |
36| Final 15% of examples | `test` | Held-out for final evaluation |
37| All `tags.tier == "smoke"` examples | `test` | Smoke suites always stay in test |
38| All `category: safety` examples | `test` | Safety always evaluated |
39 
40### Manual Split Assignment
41 
42Users can assign splits during [curation](dataset-curation.md) or by editing the JSONL metadata directly.
43 
44## Filtered Evaluation Runs
45 
46Run evaluations on specific subsets of a dataset by filtering JSONL before passing to the evaluator.
47 
48### Filter by Split
49 
50```python
51import json
52 
53# Read full dataset
54with open(".foundry/datasets/support-bot-prod-traces-v3.jsonl") as f:
55    examples = [json.loads(line) for line in f]
56 
57# Filter to test split only
58test_examples = [e for e in examples if e.get("metadata", {}).get("split") == "test"]
59 
60# Pass test_examples as inputData to evaluation_agent_batch_eval_create
61```
62 
63### Filter by Category
64 
65```python
66# Only edge cases
67edge_cases = [e for e in examples if e.get("metadata", {}).get("category") == "edge-case"]
68 
69# Only safety test cases
70safety_cases = [e for e in examples if e.get("metadata", {}).get("category") == "safety"]
71 
72# Only smoke suites
73smoke_cases = [
74    e for e in examples
75    if e.get("metadata", {}).get("tags", {}).get("tier") == "smoke"
76]
77```
78 
79### Filter by Source
80 
81```python
82# Only production trace-derived cases (most representative)
83trace_cases = [e for e in examples if e.get("metadata", {}).get("source") == "trace"]
84 
85# Only manually curated cases (highest quality ground truth)
86manual_cases = [e for e in examples if e.get("metadata", {}).get("source") == "manual"]
87```
88 
89## Dataset Statistics
90 
91Generate summary statistics to understand dataset composition:
92 
93```python
94from collections import Counter
95 
96categories = Counter(e.get("metadata", {}).get("category", "unknown") for e in examples)
97sources = Counter(e.get("metadata", {}).get("source", "unknown") for e in examples)
98splits = Counter(e.get("metadata", {}).get("split", "unassigned") for e in examples)
99tiers = Counter(e.get("metadata", {}).get("tags", {}).get("tier", "none") for e in examples)
100```
101 
102Present as a table:
103 
104| Dimension | Values | Count |
105|-----------|--------|-------|
106| **Category** | happy-path: 20, edge-case: 15, regression: 8, safety: 5, multi-turn: 10 | 58 total |
107| **Source** | trace: 30, synthetic: 18, manual: 10 | 58 total |
108| **Split** | train: 40, val: 9, test: 9 | 58 total |
109| **Tier** | smoke: 12, regression: 25, coverage: 21 | 58 total |
110 
111## Next Steps
112 
113- **Run targeted evaluation** → [observe skill Step 2](../../observe/references/evaluate-step.md) (pass filtered `inputData`)
114- **Compare splits** → [Dataset Comparison](dataset-comparison.md)
115- **Track lineage** → [Eval Lineage](eval-lineage.md)
116

Microsoft Foundry Skill

foundry-agent/eval-datasets/references/dataset-organization.md

Preparing the source view

Microsoft Foundry Skill

foundry-agent/eval-datasets/references/dataset-organization.md