Source from repo

Microsoft Foundry Skill

Build and deploy AI applications on Azure AI Foundry using Microsoft's model catalog and AI services

microsoftGitHub microsoftOfficialSource repo Original GitHub link Publisher page

Files

155

Skill

n/a

Size

976.3 KB

Entrypoint

SKILL.md

Format

git-repo

Open file

finetuning/references/training-curves.md

Syntax-highlighted preview of this file as included in the skill package.

Rendered Source

markdown95 linesFree

finetuning/references/training-curves.md

1# Training Curve Analysis
2 
3## SFT Metrics
4 
5| Column | What it means |
6|--------|---------------|
7| `train_loss` | Loss on training batch (should decrease) |
8| `train_mean_token_accuracy` | Token-level accuracy on training data |
9| `valid_loss` | Loss on validation set (**primary metric**) |
10| `valid_mean_token_accuracy` | Token-level accuracy on validation data |
11| `full_valid_loss` | Full-pass validation loss (more accurate, less frequent) |
12| `full_valid_mean_token_accuracy` | Full-pass token accuracy |
13 
14## Overfitting Detection
15 
16**Overfitting ratio** at each checkpoint: `valid_loss / train_loss`
17 
18| Ratio | Interpretation |
19|-------|---------------|
20| < 1.2 | Healthy — generalizes well |
21| 1.2–1.5 | Mild overfitting — acceptable for small datasets |
22| 1.5–2.0 | Moderate — consider reducing epochs |
23| > 2.0 | Severe — deploy an earlier checkpoint |
24 
25```python
26val_losses = [cp.metrics.valid_loss for cp in checkpoints if cp.metrics.valid_loss]
27best_val = min(val_losses)
28final_val = val_losses[-1]
29if final_val > best_val * 1.2:
30    print(f"⚠️ OVERFIT: Best={best_val:.4f}, final={final_val:.4f}")
31```
32 
33## Best Checkpoint Selection (SFT)
34 
35```python
36checkpoints = client.fine_tuning.jobs.checkpoints.list(job_id)
37best_cp = min(checkpoints.data, key=lambda cp: cp.metrics.valid_loss or float('inf'))
38print(f"Best: step {best_cp.step_number}, valid_loss={best_cp.metrics.valid_loss:.4f}, "
39      f"model={best_cp.fine_tuned_model_checkpoint}")
40```
41 
42## Diagnosis Table
43 
44| Observation | Diagnosis | Action |
45|-------------|-----------|--------|
46| Train loss barely decreases | LR too low or noisy data | Increase LR or clean data |
47| Train loss crashes to ~0 | LR too high or easy data | Decrease LR or add harder examples |
48| Valid loss rises after epoch 2 | Overfitting | Deploy epoch-2 checkpoint |
49| Valid loss plateaus after epoch 1 | Learned quickly | Try epoch=1 or lower LR |
50| Valid loss oscillates | Small batch or inconsistent data | Increase batch size or audit data |
51| Both losses stay high | Task too hard | Larger model or simplify task |
52| Large train-valid gap from start | Insufficient/mismatched data | Add diverse training data |
53 
54## RFT Metrics
55 
56| Column | What it means |
57|--------|---------------|
58| `train_mean_reward` | Average reward across rollouts (**primary** — should increase) |
59| `full_valid_mean_reward` | Validation reward (overfitting check) |
60| `completion_tokens_mean` | Average response length per rollout |
61| `reasoning_tokens_mean` | Average reasoning tokens (o-series models) |
62| `mean_unresponsive_rewards` | Rollouts with no scoreable output |
63| `train_sample_parse_error_count` | Grader couldn't parse output |
64| `train_other_error_count` | Grader logic bugs — should be 0 |
65 
66## RFT Reward Curve Patterns
67 
68- **Reward flat at ~0**: Grader broken or threshold too strict
69- **Reward always negative**: pass_threshold too high
70- **Reward immediately high + flat**: Threshold too lenient
71- **Train-valid reward gap > 0.10**: Possible reward hacking
72 
73### Token Growth
74- **Moderate** (tokens double): Normal — model becoming more thorough
75- **Excessive** (3x+): Grader may incentivize verbosity — check scoring dimensions
76- When comparing checkpoints, equal accuracy at fewer tokens is strictly better
77 
78### Parse Errors vs Logic Errors
79- `sample_parse_error_count`: Often high in agentic RFT (mid-reasoning captures). Training still works if reward is climbing.
80- `other_error_count`: Bugs in grader logic. Fix before continuing.
81 
82## RFT Checkpoint Selection
83 
84```python
85checkpoints = client.fine_tuning.jobs.checkpoints.list(job_id)
86for cp in checkpoints:
87    m = cp.metrics
88    tr = f"{m.train_mean_reward:.3f}" if m.train_mean_reward is not None else "n/a"
89    vr = f"{m.full_valid_mean_reward:.3f}" if m.full_valid_mean_reward is not None else "n/a"
90    ct = f"{m.completion_tokens_mean:.0f}" if m.completion_tokens_mean is not None else "n/a"
91    print(f"Step {cp.step_number}: train_reward={tr}, valid_reward={vr}, tokens={ct}")
92```
93 
94Don't rely solely on `valid_reward` for RFT — deploy 2–3 candidates (peak reward, final, mid-training) and evaluate with your real task harness including tool execution.
95

Microsoft Foundry Skill

finetuning/references/training-curves.md

Preparing the source view

Microsoft Foundry Skill

finetuning/references/training-curves.md