Source from repo

Microsoft Foundry Skill

Deploy, evaluate, and manage AI agents end-to-end on Microsoft Azure AI Foundry

microsoftGitHub microsoftOfficialSource repo Original GitHub link Publisher page

Files

154

Skill

n/a

Size

976.2 KB

Entrypoint

SKILL.md

Format

git-repo

Open file

finetuning/references/training-curves.md

Syntax-highlighted preview of this file as included in the skill package.

Rendered Source

markdown95 linesFree

finetuning/references/training-curves.md

1# Training Curve Analysis
2 
3## SFT Metrics
4 
5| Column | What it means |
6|--------|---------------|
7| `train_loss` | Loss on training batch (should decrease) |
8| `train_mean_token_accuracy` | Token-level accuracy on training data |
9| `valid_loss` | Loss on validation set (**primary metric**) |
10| `valid_mean_token_accuracy` | Token-level accuracy on validation data |
11| `full_valid_loss` | Full-pass validation loss (more accurate, less frequent) |
12| `full_valid_mean_token_accuracy` | Full-pass token accuracy |
13 
14## Overfitting Detection
15 
16**Overfitting ratio** at each checkpoint: `valid_loss / train_loss`
17 
18| Ratio | Interpretation |
19|-------|---------------|
20| < 1.2 | Healthy — generalizes well |
21| 1.2–1.5 | Mild overfitting — acceptable for small datasets |
22| 1.5–2.0 | Moderate — consider reducing epochs |
23| > 2.0 | Severe — deploy an earlier checkpoint |
24 
25```python
26val_losses = [cp.metrics.valid_loss for cp in checkpoints if cp.metrics.valid_loss]
27best_val = min(val_losses)
28final_val = val_losses[-1]
29if final_val > best_val * 1.2:
30    print(f"⚠️ OVERFIT: Best={best_val:.4f}, final={final_val:.4f}")
31```
32 
33## Best Checkpoint Selection (SFT)
34 
35```python
36checkpoints = client.fine_tuning.jobs.checkpoints.list(job_id)
37best_cp = min(checkpoints.data, key=lambda cp: cp.metrics.valid_loss or float('inf'))
38print(f"Best: step {best_cp.step_number}, valid_loss={best_cp.metrics.valid_loss:.4f}, "
39      f"model={best_cp.fine_tuned_model_checkpoint}")
40```
41 
42## Diagnosis Table
43 
44| Observation | Diagnosis | Action |
45|-------------|-----------|--------|
46| Train loss barely decreases | LR too low or noisy data | Increase LR or clean data |
47| Train loss crashes to ~0 | LR too high or easy data | Decrease LR or add harder examples |
48| Valid loss rises after epoch 2 | Overfitting | Deploy epoch-2 checkpoint |
49| Valid loss plateaus after epoch 1 | Learned quickly | Try epoch=1 or lower LR |
50| Valid loss oscillates | Small batch or inconsistent data | Increase batch size or audit data |
51| Both losses stay high | Task too hard | Larger model or simplify task |
52| Large train-valid gap from start | Insufficient/mismatched data | Add diverse training data |
53 
54## RFT Metrics
55 
56| Column | What it means |
57|--------|---------------|
58| `train_mean_reward` | Average reward across rollouts (**primary** — should increase) |
59| `full_valid_mean_reward` | Validation reward (overfitting check) |
60| `completion_tokens_mean` | Average response length per rollout |
61| `reasoning_tokens_mean` | Average reasoning tokens (o-series models) |
62| `mean_unresponsive_rewards` | Rollouts with no scoreable output |
63| `train_sample_parse_error_count` | Grader couldn't parse output |
64| `train_other_error_count` | Grader logic bugs — should be 0 |
65 
66## RFT Reward Curve Patterns
67 
68- **Reward flat at ~0**: Grader broken or threshold too strict
69- **Reward always negative**: pass_threshold too high
70- **Reward immediately high + flat**: Threshold too lenient
71- **Train-valid reward gap > 0.10**: Possible reward hacking
72 
73### Token Growth
74- **Moderate** (tokens double): Normal — model becoming more thorough
75- **Excessive** (3x+): Grader may incentivize verbosity — check scoring dimensions
76- When comparing checkpoints, equal accuracy at fewer tokens is strictly better
77 
78### Parse Errors vs Logic Errors
79- `sample_parse_error_count`: Often high in agentic RFT (mid-reasoning captures). Training still works if reward is climbing.
80- `other_error_count`: Bugs in grader logic. Fix before continuing.
81 
82## RFT Checkpoint Selection
83 
84```python
85checkpoints = client.fine_tuning.jobs.checkpoints.list(job_id)
86for cp in checkpoints:
87    m = cp.metrics
88    tr = f"{m.train_mean_reward:.3f}" if m.train_mean_reward is not None else "n/a"
89    vr = f"{m.full_valid_mean_reward:.3f}" if m.full_valid_mean_reward is not None else "n/a"
90    ct = f"{m.completion_tokens_mean:.0f}" if m.completion_tokens_mean is not None else "n/a"
91    print(f"Step {cp.step_number}: train_reward={tr}, valid_reward={vr}, tokens={ct}")
92```
93 
94Don't rely solely on `valid_reward` for RFT — deploy 2–3 candidates (peak reward, final, mid-training) and evaluate with your real task harness including tool execution.
95

Preparing the source view

Microsoft Foundry Skill

finetuning/references/training-curves.md