Source from repo

Microsoft Foundry Skill

Build and deploy AI applications on Azure AI Foundry using Microsoft's model catalog and AI services

microsoftGitHub microsoftOfficialSource repo Original GitHub link Publisher page

Files

155

Skill

n/a

Size

976.3 KB

Entrypoint

SKILL.md

Format

git-repo

Open file

finetuning/references/hyperparameters.md

Syntax-highlighted preview of this file as included in the skill package.

Rendered Source

markdown82 linesFree

finetuning/references/hyperparameters.md

1# Hyperparameter Guide
2 
3## SFT / DPO Core Parameters
4 
5| Parameter | What it controls | Default | Typical range |
6|-----------|-----------------|---------|---------------|
7| **Epochs** | Passes through data | 2 | 1–5 |
8| **Learning rate multiplier** | Weight change aggressiveness | 1.0 | 0.1–2.0 |
9| **Batch size** | Examples per gradient step | Model-dependent | 4–32 |
10 
11### Dataset Size vs Epochs
12 
13| Dataset size | Recommended epochs |
14|-------------|-------------------|
15| < 100 examples | 3–5 |
16| 100–500 examples | 2–3 |
17| 500–2,000 examples | 1–2 |
18| > 2,000 examples | 1 |
19 
20### Learning Rate Guidelines
21- **Higher LR** (1.5–2.0): Large/diverse datasets, task very different from pre-training
22- **Lower LR** (0.1–0.5): Small datasets (<200), refining not overwriting base behavior
23- For 1,000+ examples, LR 0.2–0.5 often beats default 1.0
24 
25### DPO-Specific Parameters
26- `beta` (default 0.1): Alignment strength. Lower = more conservative.
27- `l2_multiplier` (default 0.1): Regularization to prevent drift from base model.
28 
29## HP Sweep Strategy
30 
31| Run | Epochs | LR | Why |
32|-----|--------|----|-----|
33| 1 | 2 | 1.0 | Baseline |
34| 2 | 2 | 0.5 | Conservative |
35| 3 | 2 | 1.5 | Aggressive |
36| 4 | 3 | 1.0 | More training |
37| 5 | 1 | 1.0 | Minimal intervention |
38 
39## Checkpoint Trick
40 
41When overfitting (val loss rises after epoch 2): deploy the epoch-2 checkpoint directly instead of retraining. Azure saves checkpoints at each epoch boundary.
42 
43```python
44checkpoints = client.fine_tuning.jobs.checkpoints.list(job_id)
45for cp in checkpoints.data:
46    print(f"Step {cp.step_number}: val_loss={cp.metrics.valid_loss}")
47```
48 
49## Model-Specific Recommendations
50 
51| Model | Recommended Start | Notes |
52|-------|------------------|-------|
53| gpt-4.1-mini | 2ep, lr=0.5–1.0 | Very capable base; small nudges work |
54| gpt-4.1-nano | 2–3ep, lr=1.0–1.5 | Smaller capacity, needs more epochs |
55| gpt-oss-20b | 2ep, lr=0.2–0.5 | Lower LR critical; deployment may need capacity=100 |
56| o4-mini (RFT) | Grader quality > HPs | Focus on grader, not HP sweep |
57 
58## OSS Model Parameters
59 
60All OSS models require `trainingType: "globalStandard"` in the API request.
61 
62| Model | Recommended Start | Best Found | Notes |
63|-------|------------------|------------|-------|
64| Ministral-3B | 5ep, lr=1.0 | 10ep, lr=0.5 | Small model, slow convergence |
65| gpt-oss-20b | 2ep, lr=0.3 | 2ep, lr=0.3 | lr=1.0 overfits quickly |
66| Llama-3.3-70B | 3ep, lr=0.3 | 5ep, lr=0.5 | lr=2.0 causes catastrophic degradation |
67| Qwen-3-32B | 3ep, lr=0.3 | 3ep, lr=0.3 | Most fragile — more data can hurt |
68 
69**Key patterns**: OSS models need 2–5× more epochs than nano. Lower LR (0.3–0.5) is safer. More data doesn't always help.
70 
71## RFT Hyperparameters
72 
73| Parameter | Description | Recommended Start |
74|-----------|-------------|-------------------|
75| `reasoning_effort` | `"low"`, `"medium"`, `"high"` | `"medium"` |
76| `compute_multiplier` | Scales rollouts per step | `1.5` |
77| `learning_rate_multiplier` | Scales LR | `1.0` |
78| `n_epochs` | Data passes | `2–3` |
79| `eval_interval` | Eval every N steps | `5` |
80| `eval_samples` | Validation examples per eval | `10` |
81| `max_episode_steps` | Max tool calls + reasoning steps | `5–10` |
82

Preparing the source view

Microsoft Foundry Skill

finetuning/references/hyperparameters.md