Source from repo

Microsoft Foundry Skill

Deploy, evaluate, and manage AI agents end-to-end on Microsoft Azure AI Foundry

microsoftGitHub microsoftOfficialSource repo Original GitHub link Publisher page

Files

154

Skill

n/a

Size

976.2 KB

Entrypoint

SKILL.md

Format

git-repo

Open file

finetuning/workflows/dataset-creation.md

Syntax-highlighted preview of this file as included in the skill package.

Rendered Source

markdown80 linesFree

finetuning/workflows/dataset-creation.md

1# Dataset Creation Workflow
2 
3Three paths to training data (these combine well: curate seeds → augment → generate at scale):
4 
5> If you already have data, skip to validation: `python scripts/validate/validate_sft.py your_data.jsonl`
6 
7## Approach 1: Manual Curation
8 
9Write examples by hand, collect from production logs, or adapt existing datasets.
10 
11**When to use:**
12- You have real-world examples (production logs, support tickets, labeled data)
13- Your task requires domain expertise an LLM can't reliably generate
14- You need a gold-standard evaluation set (always curate manually)
15 
16**Tips:**
17- Start with 10-20 examples to establish quality standards and format consistency
18- These seed examples also serve as the foundation of your evaluation test set
19- For RFT, you only need prompts + expected answers — no model responses needed
20 
21## Approach 2: LLM Augmentation
22 
23Expand a small curated dataset through **rephrasing** — generating diverse variations while keeping the same expected answer. Especially useful for RFT.
24 
25**When to use:**
26- Well-defined task with clear correct answers
27- You can write quality examples but need more volume
28- Diversity of phrasing matters more than diversity of scenarios
29 
30**Workflow:**
311. Write base examples with correct expected answers
322. For each, use an LLM to generate rephrasings varying tone, detail, and wording
333. Each rephrasing gets the same expected answer — only the phrasing changes
344. Validate the augmented dataset
35 
36**Rephrasing prompt:**
37```
38Generate N different phrasings of this request. Each should:
39- Use different wording, tone, or level of detail
40- Include the same key identifiers (order IDs, item names)
41- Vary between formal, casual, frustrated, brief, and detailed styles
42Return a JSON array of N strings.
43 
44Original: [your example]
45```
46 
47A cheap model (gpt-4.1-mini) works well — no new ground truth needed, just phrasing diversity.
48 
49## Approach 3: Synthetic Generation
50 
51Generate training data from scratch using LLM prompts.
52 
53 
541. Define topic/scenario categories for diversity
552. Generate prompts from an LLM
563. Generate responses (or preferred/non-preferred pairs for DPO)
574. Grade quality with an LLM judge
585. Filter to a quality threshold
596. Split into train/validation/test sets
607. Write JSONL in the correct format (see `references/dataset-formats.md`)
61 
62## Quality Checklist
63 
64Before training, verify:
65 
66- [ ] **No duplicates**: Exact or near-duplicate examples waste budget
67- [ ] **Balanced distribution**: Topics, difficulty, output lengths well-distributed
68- [ ] **Consistent formatting**: All examples follow the same structure
69- [ ] **Correct outputs**: Spot-check 20 random examples manually
70- [ ] **Reasonable lengths**: No extremely short or extremely long outputs
71- [ ] **Clean text**: No encoding errors, garbled text, or template artifacts
72 
73## Dataset Size vs. Quality
74 
75From experiments:
76- **335 high-quality examples** (carefully curated) → best combined eval score (9.15)
77- **1,576 examples** (broader but noisier) → higher correctness but lower conciseness (8.53)
78 
79**Takeaway**: A small, pristine dataset usually beats a large, noisy one. Quality filter aggressively.
80

Microsoft Foundry Skill

finetuning/workflows/dataset-creation.md

Preparing the source view

Microsoft Foundry Skill

finetuning/workflows/dataset-creation.md