Loading source
Pulling the file list, source metadata, and syntax-aware rendering for this listing.
Source from repo
Deploy, evaluate, and manage AI agents end-to-end on Microsoft Azure AI Foundry
Files
Skill
Size
Entrypoint
Format
Open file
Syntax-highlighted preview of this file as included in the skill package.
finetuning/workflows/dataset-creation.md
1# Dataset Creation Workflow23Three paths to training data (these combine well: curate seeds → augment → generate at scale):45> If you already have data, skip to validation: `python scripts/validate/validate_sft.py your_data.jsonl`67## Approach 1: Manual Curation89Write examples by hand, collect from production logs, or adapt existing datasets.1011**When to use:**12- You have real-world examples (production logs, support tickets, labeled data)13- Your task requires domain expertise an LLM can't reliably generate14- You need a gold-standard evaluation set (always curate manually)1516**Tips:**17- Start with 10-20 examples to establish quality standards and format consistency18- These seed examples also serve as the foundation of your evaluation test set19- For RFT, you only need prompts + expected answers — no model responses needed2021## Approach 2: LLM Augmentation2223Expand a small curated dataset through **rephrasing** — generating diverse variations while keeping the same expected answer. Especially useful for RFT.2425**When to use:**26- Well-defined task with clear correct answers27- You can write quality examples but need more volume28- Diversity of phrasing matters more than diversity of scenarios2930**Workflow:**311. Write base examples with correct expected answers322. For each, use an LLM to generate rephrasings varying tone, detail, and wording333. Each rephrasing gets the same expected answer — only the phrasing changes344. Validate the augmented dataset3536**Rephrasing prompt:**37```38Generate N different phrasings of this request. Each should:39- Use different wording, tone, or level of detail40- Include the same key identifiers (order IDs, item names)41- Vary between formal, casual, frustrated, brief, and detailed styles42Return a JSON array of N strings.4344Original: [your example]45```4647A cheap model (gpt-4.1-mini) works well — no new ground truth needed, just phrasing diversity.4849## Approach 3: Synthetic Generation5051Generate training data from scratch using LLM prompts.5253541. Define topic/scenario categories for diversity552. Generate prompts from an LLM563. Generate responses (or preferred/non-preferred pairs for DPO)574. Grade quality with an LLM judge585. Filter to a quality threshold596. Split into train/validation/test sets607. Write JSONL in the correct format (see `references/dataset-formats.md`)6162## Quality Checklist6364Before training, verify:6566- [ ] **No duplicates**: Exact or near-duplicate examples waste budget67- [ ] **Balanced distribution**: Topics, difficulty, output lengths well-distributed68- [ ] **Consistent formatting**: All examples follow the same structure69- [ ] **Correct outputs**: Spot-check 20 random examples manually70- [ ] **Reasonable lengths**: No extremely short or extremely long outputs71- [ ] **Clean text**: No encoding errors, garbled text, or template artifacts7273## Dataset Size vs. Quality7475From experiments:76- **335 high-quality examples** (carefully curated) → best combined eval score (9.15)77- **1,576 examples** (broader but noisier) → higher correctness but lower conciseness (8.53)7879**Takeaway**: A small, pristine dataset usually beats a large, noisy one. Quality filter aggressively.80