Loading source
Pulling the file list, source metadata, and syntax-aware rendering for this listing.
Source from repo
Build and deploy AI applications on Azure AI Foundry using Microsoft's model catalog and AI services
Files
Skill
Size
Entrypoint
Format
Open file
Syntax-highlighted preview of this file as included in the skill package.
finetuning/workflows/dataset-creation.md
1# Dataset Creation Workflow23Three paths to training data (these combine well: curate seeds → augment → generate at scale):45> If you already have data, skip to validation: `python scripts/validate/validate_sft.py your_data.jsonl`67## Approach 1: Manual Curation89Write examples by hand, collect from production logs, or adapt existing datasets.1011**When to use:**12- You have real-world examples (production logs, support tickets, labeled data)13- Your task requires domain expertise an LLM can't reliably generate14- You need a gold-standard evaluation set (always curate manually)1516**Tips:**17- Start with 10-20 examples to establish quality standards and format consistency18- These seed examples also serve as the foundation of your evaluation test set19- For RFT, you only need prompts + expected answers — no model responses needed2021## Approach 2: LLM Augmentation2223Expand a small curated dataset through **rephrasing** — generating diverse variations while keeping the same expected answer. Especially useful for RFT.2425**When to use:**26- Well-defined task with clear correct answers27- You can write quality examples but need more volume28- Diversity of phrasing matters more than diversity of scenarios2930**Workflow:**311. Write base examples with correct expected answers322. For each, use an LLM to generate rephrasings varying tone, detail, and wording333. Each rephrasing gets the same expected answer — only the phrasing changes344. Validate the augmented dataset3536**Rephrasing prompt:**37```38Generate N different phrasings of this request. Each should:39- Use different wording, tone, or level of detail40- Include the same key identifiers (order IDs, item names)41- Vary between formal, casual, frustrated, brief, and detailed styles42Return a JSON array of N strings.4344Original: [your example]45```4647A cheap model (gpt-4.1-mini) works well — no new ground truth needed, just phrasing diversity.4849## Approach 3: Synthetic Generation5051Generate training data from scratch using LLM prompts.5253541. Define topic/scenario categories for diversity552. Generate prompts from an LLM563. Generate responses (or preferred/non-preferred pairs for DPO)574. Grade quality with an LLM judge585. Filter to a quality threshold596. Split into train/validation/test sets607. Write JSONL in the correct format (see `references/dataset-formats.md`)6162## Quality Checklist6364Before training, verify:6566- [ ] **No duplicates**: Exact or near-duplicate examples waste budget67- [ ] **Balanced distribution**: Topics, difficulty, output lengths well-distributed68- [ ] **Consistent formatting**: All examples follow the same structure69- [ ] **Correct outputs**: Spot-check 20 random examples manually70- [ ] **Reasonable lengths**: No extremely short or extremely long outputs71- [ ] **Clean text**: No encoding errors, garbled text, or template artifacts7273## Dataset Size vs. Quality7475From experiments:76- **335 high-quality examples** (carefully curated) → best combined eval score (9.15)77- **1,576 examples** (broader but noisier) → higher correctness but lower conciseness (8.53)7879**Takeaway**: A small, pristine dataset usually beats a large, noisy one. Quality filter aggressively.80