Source from repo

Microsoft Foundry Skill

Build and deploy AI applications on Azure AI Foundry using Microsoft's model catalog and AI services

microsoftGitHub microsoftOfficialSource repo Original GitHub link Publisher page

Files

155

Skill

n/a

Size

976.3 KB

Entrypoint

SKILL.md

Format

git-repo

Open file

finetuning/workflows/dataset-creation.md

Syntax-highlighted preview of this file as included in the skill package.

Rendered Source

markdown80 linesFree

finetuning/workflows/dataset-creation.md

1# Dataset Creation Workflow
2 
3Three paths to training data (these combine well: curate seeds → augment → generate at scale):
4 
5> If you already have data, skip to validation: `python scripts/validate/validate_sft.py your_data.jsonl`
6 
7## Approach 1: Manual Curation
8 
9Write examples by hand, collect from production logs, or adapt existing datasets.
10 
11**When to use:**
12- You have real-world examples (production logs, support tickets, labeled data)
13- Your task requires domain expertise an LLM can't reliably generate
14- You need a gold-standard evaluation set (always curate manually)
15 
16**Tips:**
17- Start with 10-20 examples to establish quality standards and format consistency
18- These seed examples also serve as the foundation of your evaluation test set
19- For RFT, you only need prompts + expected answers — no model responses needed
20 
21## Approach 2: LLM Augmentation
22 
23Expand a small curated dataset through **rephrasing** — generating diverse variations while keeping the same expected answer. Especially useful for RFT.
24 
25**When to use:**
26- Well-defined task with clear correct answers
27- You can write quality examples but need more volume
28- Diversity of phrasing matters more than diversity of scenarios
29 
30**Workflow:**
311. Write base examples with correct expected answers
322. For each, use an LLM to generate rephrasings varying tone, detail, and wording
333. Each rephrasing gets the same expected answer — only the phrasing changes
344. Validate the augmented dataset
35 
36**Rephrasing prompt:**
37```
38Generate N different phrasings of this request. Each should:
39- Use different wording, tone, or level of detail
40- Include the same key identifiers (order IDs, item names)
41- Vary between formal, casual, frustrated, brief, and detailed styles
42Return a JSON array of N strings.
43 
44Original: [your example]
45```
46 
47A cheap model (gpt-4.1-mini) works well — no new ground truth needed, just phrasing diversity.
48 
49## Approach 3: Synthetic Generation
50 
51Generate training data from scratch using LLM prompts.
52 
53 
541. Define topic/scenario categories for diversity
552. Generate prompts from an LLM
563. Generate responses (or preferred/non-preferred pairs for DPO)
574. Grade quality with an LLM judge
585. Filter to a quality threshold
596. Split into train/validation/test sets
607. Write JSONL in the correct format (see `references/dataset-formats.md`)
61 
62## Quality Checklist
63 
64Before training, verify:
65 
66- [ ] **No duplicates**: Exact or near-duplicate examples waste budget
67- [ ] **Balanced distribution**: Topics, difficulty, output lengths well-distributed
68- [ ] **Consistent formatting**: All examples follow the same structure
69- [ ] **Correct outputs**: Spot-check 20 random examples manually
70- [ ] **Reasonable lengths**: No extremely short or extremely long outputs
71- [ ] **Clean text**: No encoding errors, garbled text, or template artifacts
72 
73## Dataset Size vs. Quality
74 
75From experiments:
76- **335 high-quality examples** (carefully curated) → best combined eval score (9.15)
77- **1,576 examples** (broader but noisier) → higher correctness but lower conciseness (8.53)
78 
79**Takeaway**: A small, pristine dataset usually beats a large, noisy one. Quality filter aggressively.
80

Microsoft Foundry Skill

finetuning/workflows/dataset-creation.md

Preparing the source view

Microsoft Foundry Skill

finetuning/workflows/dataset-creation.md