Loading source
Pulling the file list, source metadata, and syntax-aware rendering for this listing.
Source from repo
Build and deploy AI applications on Azure AI Foundry using Microsoft's model catalog and AI services
Files
Skill
Size
Entrypoint
Format
Open file
Syntax-highlighted preview of this file as included in the skill package.
finetuning/workflows/iterative-training.md
1# Iterative Training Workflow23Systematically improve a fine-tuned model through successive experiments.45## The Core Loop67```81. Train with current config92. Analyze training curves103. Evaluate on held-out set114. Diagnose what to change125. Plan next experiment13→ Better than baseline? → Good enough? → Ship it (or loop back to 4)14```1516**Rule**: Change ONE variable per experiment.1718## Experiment Tracking1920| Run | Base model | Dataset | Epochs | LR | Batch | Best val_loss | Combined eval |21|-----|-----------|---------|--------|-----|-------|--------------|---------------|22| R1 | gpt-4.1-mini | v1 (335 ex) | 2 | 1.0 | default | 0.320 | 8.05 |23| R2 | gpt-4.1-mini | v1 (335 ex) | 2 | 0.5 | default | 0.310 | 9.15 |24| ... | ... | ... | ... | ... | ... | ... | ... |2526## What to Try (Priority Order)2728### Priority 1: Data Quality (highest leverage)29- **Fix inconsistencies**: Contradicting examples confuse the model30- **Add diversity**: Add examples for input types the model fails on31- **Reduce noise**: Remove "correct but not ideal" outputs3233### Priority 2: Hyperparameters3435See `references/hyperparameters.md` for full guide.3637**Quick sweep strategy:**381. Baseline: epochs=2, lr=1.0392. Overfitting → lr=0.5 or epochs=1403. Underfitting → lr=1.5 or epochs=3414. Good LR found → try batch_size=16 or 324243### Priority 3: Base Model4445| Model | Best for |46|-------|----------|47| gpt-4.1-mini | Best quality-per-dollar, most tasks |48| gpt-4.1-nano | Fastest inference, simple tasks |49| gpt-oss-20b | Large datasets, lowest absolute loss |50| Ministral-3B | Lightweight, fast inference |51| Qwen-3-32B, Llama-3.3-70B | Multilingual or specialized tasks |5253### Priority 4: Training Type54- SFT plateaued + need better reasoning → RFT (if model supports it)55- Need style alignment → DPO56- See `references/training-types.md` before switching5758## Diagnostic Decision Tree5960```61Training curves healthy (no overfitting)?62├─ Yes63│ ├─ Eval improved? → Refine further64│ └─ Eval same/worse? → Data quality issue — filter or augment65└─ No (overfitting)66├─ Earlier checkpoint evals well? → Deploy that checkpoint67├─ Not severe → Reduce epochs or lower LR68└─ Severe (ratio > 2.0)69├─ Dataset too small → Add more data70└─ Dataset large → Lower LR dramatically (0.1-0.3)71```7273## When to Stop74751. Beaten baseline by meaningful margin (>5%) and last 3 experiments didn't improve762. Diminishing returns: each experiment improves < 0.1 points773. Model is "good enough" for production784. Budget exhausted (time or money)7980## Multi-Model Strategy8182Run the same dataset through 2-3 base models:831. **gpt-4.1-mini** — primary candidate842. **gpt-oss-20b** — large-dataset specialist (500+ examples)853. **gpt-4.1-nano** — fast inference option8687## Common Mistakes88891. Not establishing a baseline first902. Changing multiple variables at once913. Overfitting to the eval set (keep a separate final test set)924. Ignoring training curves (they tell you what to change next)935. More data without quality check (lower-quality data often makes things worse)946. Not cleaning up old deployments (wastes quota and money)95