Loading source
Pulling the file list, source metadata, and syntax-aware rendering for this listing.
Source from repo
Deploy, evaluate, and manage AI agents end-to-end on Microsoft Azure AI Foundry
Files
Skill
Size
Entrypoint
Format
Open file
Syntax-highlighted preview of this file as included in the skill package.
finetuning/workflows/iterative-training.md
1# Iterative Training Workflow23Systematically improve a fine-tuned model through successive experiments.45## The Core Loop67```81. Train with current config92. Analyze training curves103. Evaluate on held-out set114. Diagnose what to change125. Plan next experiment13→ Better than baseline? → Good enough? → Ship it (or loop back to 4)14```1516**Rule**: Change ONE variable per experiment.1718## Experiment Tracking1920| Run | Base model | Dataset | Epochs | LR | Batch | Best val_loss | Combined eval |21|-----|-----------|---------|--------|-----|-------|--------------|---------------|22| R1 | gpt-4.1-mini | v1 (335 ex) | 2 | 1.0 | default | 0.320 | 8.05 |23| R2 | gpt-4.1-mini | v1 (335 ex) | 2 | 0.5 | default | 0.310 | 9.15 |24| ... | ... | ... | ... | ... | ... | ... | ... |2526## What to Try (Priority Order)2728### Priority 1: Data Quality (highest leverage)29- **Fix inconsistencies**: Contradicting examples confuse the model30- **Add diversity**: Add examples for input types the model fails on31- **Reduce noise**: Remove "correct but not ideal" outputs3233### Priority 2: Hyperparameters3435See `references/hyperparameters.md` for full guide.3637**Quick sweep strategy:**381. Baseline: epochs=2, lr=1.0392. Overfitting → lr=0.5 or epochs=1403. Underfitting → lr=1.5 or epochs=3414. Good LR found → try batch_size=16 or 324243### Priority 3: Base Model4445| Model | Best for |46|-------|----------|47| gpt-4.1-mini | Best quality-per-dollar, most tasks |48| gpt-4.1-nano | Fastest inference, simple tasks |49| gpt-oss-20b | Large datasets, lowest absolute loss |50| Ministral-3B | Lightweight, fast inference |51| Qwen-3-32B, Llama-3.3-70B | Multilingual or specialized tasks |5253### Priority 4: Training Type54- SFT plateaued + need better reasoning → RFT (if model supports it)55- Need style alignment → DPO56- See `references/training-types.md` before switching5758## Diagnostic Decision Tree5960```61Training curves healthy (no overfitting)?62├─ Yes63│ ├─ Eval improved? → Refine further64│ └─ Eval same/worse? → Data quality issue — filter or augment65└─ No (overfitting)66├─ Earlier checkpoint evals well? → Deploy that checkpoint67├─ Not severe → Reduce epochs or lower LR68└─ Severe (ratio > 2.0)69├─ Dataset too small → Add more data70└─ Dataset large → Lower LR dramatically (0.1-0.3)71```7273## When to Stop74751. Beaten baseline by meaningful margin (>5%) and last 3 experiments didn't improve762. Diminishing returns: each experiment improves < 0.1 points773. Model is "good enough" for production784. Budget exhausted (time or money)7980## Multi-Model Strategy8182Run the same dataset through 2-3 base models:831. **gpt-4.1-mini** — primary candidate842. **gpt-oss-20b** — large-dataset specialist (500+ examples)853. **gpt-4.1-nano** — fast inference option8687## Common Mistakes88891. Not establishing a baseline first902. Changing multiple variables at once913. Overfitting to the eval set (keep a separate final test set)924. Ignoring training curves (they tell you what to change next)935. More data without quality check (lower-quality data often makes things worse)946. Not cleaning up old deployments (wastes quota and money)95