Loading source
Pulling the file list, source metadata, and syntax-aware rendering for this listing.
Source from repo
Build and deploy AI applications on Azure AI Foundry using Microsoft's model catalog and AI services
Files
Skill
Size
Entrypoint
Format
Open file
Syntax-highlighted preview of this file as included in the skill package.
finetuning/references/hyperparameters.md
1# Hyperparameter Guide23## SFT / DPO Core Parameters45| Parameter | What it controls | Default | Typical range |6|-----------|-----------------|---------|---------------|7| **Epochs** | Passes through data | 2 | 1–5 |8| **Learning rate multiplier** | Weight change aggressiveness | 1.0 | 0.1–2.0 |9| **Batch size** | Examples per gradient step | Model-dependent | 4–32 |1011### Dataset Size vs Epochs1213| Dataset size | Recommended epochs |14|-------------|-------------------|15| < 100 examples | 3–5 |16| 100–500 examples | 2–3 |17| 500–2,000 examples | 1–2 |18| > 2,000 examples | 1 |1920### Learning Rate Guidelines21- **Higher LR** (1.5–2.0): Large/diverse datasets, task very different from pre-training22- **Lower LR** (0.1–0.5): Small datasets (<200), refining not overwriting base behavior23- For 1,000+ examples, LR 0.2–0.5 often beats default 1.02425### DPO-Specific Parameters26- `beta` (default 0.1): Alignment strength. Lower = more conservative.27- `l2_multiplier` (default 0.1): Regularization to prevent drift from base model.2829## HP Sweep Strategy3031| Run | Epochs | LR | Why |32|-----|--------|----|-----|33| 1 | 2 | 1.0 | Baseline |34| 2 | 2 | 0.5 | Conservative |35| 3 | 2 | 1.5 | Aggressive |36| 4 | 3 | 1.0 | More training |37| 5 | 1 | 1.0 | Minimal intervention |3839## Checkpoint Trick4041When overfitting (val loss rises after epoch 2): deploy the epoch-2 checkpoint directly instead of retraining. Azure saves checkpoints at each epoch boundary.4243```python44checkpoints = client.fine_tuning.jobs.checkpoints.list(job_id)45for cp in checkpoints.data:46print(f"Step {cp.step_number}: val_loss={cp.metrics.valid_loss}")47```4849## Model-Specific Recommendations5051| Model | Recommended Start | Notes |52|-------|------------------|-------|53| gpt-4.1-mini | 2ep, lr=0.5–1.0 | Very capable base; small nudges work |54| gpt-4.1-nano | 2–3ep, lr=1.0–1.5 | Smaller capacity, needs more epochs |55| gpt-oss-20b | 2ep, lr=0.2–0.5 | Lower LR critical; deployment may need capacity=100 |56| o4-mini (RFT) | Grader quality > HPs | Focus on grader, not HP sweep |5758## OSS Model Parameters5960All OSS models require `trainingType: "globalStandard"` in the API request.6162| Model | Recommended Start | Best Found | Notes |63|-------|------------------|------------|-------|64| Ministral-3B | 5ep, lr=1.0 | 10ep, lr=0.5 | Small model, slow convergence |65| gpt-oss-20b | 2ep, lr=0.3 | 2ep, lr=0.3 | lr=1.0 overfits quickly |66| Llama-3.3-70B | 3ep, lr=0.3 | 5ep, lr=0.5 | lr=2.0 causes catastrophic degradation |67| Qwen-3-32B | 3ep, lr=0.3 | 3ep, lr=0.3 | Most fragile — more data can hurt |6869**Key patterns**: OSS models need 2–5× more epochs than nano. Lower LR (0.3–0.5) is safer. More data doesn't always help.7071## RFT Hyperparameters7273| Parameter | Description | Recommended Start |74|-----------|-------------|-------------------|75| `reasoning_effort` | `"low"`, `"medium"`, `"high"` | `"medium"` |76| `compute_multiplier` | Scales rollouts per step | `1.5` |77| `learning_rate_multiplier` | Scales LR | `1.0` |78| `n_epochs` | Data passes | `2–3` |79| `eval_interval` | Eval every N steps | `5` |80| `eval_samples` | Validation examples per eval | `10` |81| `max_episode_steps` | Max tool calls + reasoning steps | `5–10` |82