Loading source
Pulling the file list, source metadata, and syntax-aware rendering for this listing.
Source from repo
Deploy, evaluate, and manage AI agents end-to-end on Microsoft Azure AI Foundry
Files
Skill
Size
Entrypoint
Format
Open file
Syntax-highlighted preview of this file as included in the skill package.
finetuning/references/hyperparameters.md
1# Hyperparameter Guide23## SFT / DPO Core Parameters45| Parameter | What it controls | Default | Typical range |6|-----------|-----------------|---------|---------------|7| **Epochs** | Passes through data | 2 | 1–5 |8| **Learning rate multiplier** | Weight change aggressiveness | 1.0 | 0.1–2.0 |9| **Batch size** | Examples per gradient step | Model-dependent | 4–32 |1011### Dataset Size vs Epochs1213| Dataset size | Recommended epochs |14|-------------|-------------------|15| < 100 examples | 3–5 |16| 100–500 examples | 2–3 |17| 500–2,000 examples | 1–2 |18| > 2,000 examples | 1 |1920### Learning Rate Guidelines21- **Higher LR** (1.5–2.0): Large/diverse datasets, task very different from pre-training22- **Lower LR** (0.1–0.5): Small datasets (<200), refining not overwriting base behavior23- For 1,000+ examples, LR 0.2–0.5 often beats default 1.02425### DPO-Specific Parameters26- `beta` (default 0.1): Alignment strength. Lower = more conservative.27- `l2_multiplier` (default 0.1): Regularization to prevent drift from base model.2829## HP Sweep Strategy3031| Run | Epochs | LR | Why |32|-----|--------|----|-----|33| 1 | 2 | 1.0 | Baseline |34| 2 | 2 | 0.5 | Conservative |35| 3 | 2 | 1.5 | Aggressive |36| 4 | 3 | 1.0 | More training |37| 5 | 1 | 1.0 | Minimal intervention |3839## Checkpoint Trick4041When overfitting (val loss rises after epoch 2): deploy the epoch-2 checkpoint directly instead of retraining. Azure saves checkpoints at each epoch boundary.4243```python44checkpoints = client.fine_tuning.jobs.checkpoints.list(job_id)45for cp in checkpoints.data:46print(f"Step {cp.step_number}: val_loss={cp.metrics.valid_loss}")47```4849## Model-Specific Recommendations5051| Model | Recommended Start | Notes |52|-------|------------------|-------|53| gpt-4.1-mini | 2ep, lr=0.5–1.0 | Very capable base; small nudges work |54| gpt-4.1-nano | 2–3ep, lr=1.0–1.5 | Smaller capacity, needs more epochs |55| gpt-oss-20b | 2ep, lr=0.2–0.5 | Lower LR critical; deployment may need capacity=100 |56| o4-mini (RFT) | Grader quality > HPs | Focus on grader, not HP sweep |5758## OSS Model Parameters5960All OSS models require `trainingType: "globalStandard"` in the API request.6162| Model | Recommended Start | Best Found | Notes |63|-------|------------------|------------|-------|64| Ministral-3B | 5ep, lr=1.0 | 10ep, lr=0.5 | Small model, slow convergence |65| gpt-oss-20b | 2ep, lr=0.3 | 2ep, lr=0.3 | lr=1.0 overfits quickly |66| Llama-3.3-70B | 3ep, lr=0.3 | 5ep, lr=0.5 | lr=2.0 causes catastrophic degradation |67| Qwen-3-32B | 3ep, lr=0.3 | 3ep, lr=0.3 | Most fragile — more data can hurt |6869**Key patterns**: OSS models need 2–5× more epochs than nano. Lower LR (0.3–0.5) is safer. More data doesn't always help.7071## RFT Hyperparameters7273| Parameter | Description | Recommended Start |74|-----------|-------------|-------------------|75| `reasoning_effort` | `"low"`, `"medium"`, `"high"` | `"medium"` |76| `compute_multiplier` | Scales rollouts per step | `1.5` |77| `learning_rate_multiplier` | Scales LR | `1.0` |78| `n_epochs` | Data passes | `2–3` |79| `eval_interval` | Eval every N steps | `5` |80| `eval_samples` | Validation examples per eval | `10` |81| `max_episode_steps` | Max tool calls + reasoning steps | `5–10` |82