Loading source
Pulling the file list, source metadata, and syntax-aware rendering for this listing.
Source from repo
Build and deploy AI applications on Azure AI Foundry using Microsoft's model catalog and AI services
Files
Skill
Size
Entrypoint
Format
Open file
Syntax-highlighted preview of this file as included in the skill package.
finetuning/references/training-curves.md
1# Training Curve Analysis23## SFT Metrics45| Column | What it means |6|--------|---------------|7| `train_loss` | Loss on training batch (should decrease) |8| `train_mean_token_accuracy` | Token-level accuracy on training data |9| `valid_loss` | Loss on validation set (**primary metric**) |10| `valid_mean_token_accuracy` | Token-level accuracy on validation data |11| `full_valid_loss` | Full-pass validation loss (more accurate, less frequent) |12| `full_valid_mean_token_accuracy` | Full-pass token accuracy |1314## Overfitting Detection1516**Overfitting ratio** at each checkpoint: `valid_loss / train_loss`1718| Ratio | Interpretation |19|-------|---------------|20| < 1.2 | Healthy — generalizes well |21| 1.2–1.5 | Mild overfitting — acceptable for small datasets |22| 1.5–2.0 | Moderate — consider reducing epochs |23| > 2.0 | Severe — deploy an earlier checkpoint |2425```python26val_losses = [cp.metrics.valid_loss for cp in checkpoints if cp.metrics.valid_loss]27best_val = min(val_losses)28final_val = val_losses[-1]29if final_val > best_val * 1.2:30print(f"⚠️ OVERFIT: Best={best_val:.4f}, final={final_val:.4f}")31```3233## Best Checkpoint Selection (SFT)3435```python36checkpoints = client.fine_tuning.jobs.checkpoints.list(job_id)37best_cp = min(checkpoints.data, key=lambda cp: cp.metrics.valid_loss or float('inf'))38print(f"Best: step {best_cp.step_number}, valid_loss={best_cp.metrics.valid_loss:.4f}, "39f"model={best_cp.fine_tuned_model_checkpoint}")40```4142## Diagnosis Table4344| Observation | Diagnosis | Action |45|-------------|-----------|--------|46| Train loss barely decreases | LR too low or noisy data | Increase LR or clean data |47| Train loss crashes to ~0 | LR too high or easy data | Decrease LR or add harder examples |48| Valid loss rises after epoch 2 | Overfitting | Deploy epoch-2 checkpoint |49| Valid loss plateaus after epoch 1 | Learned quickly | Try epoch=1 or lower LR |50| Valid loss oscillates | Small batch or inconsistent data | Increase batch size or audit data |51| Both losses stay high | Task too hard | Larger model or simplify task |52| Large train-valid gap from start | Insufficient/mismatched data | Add diverse training data |5354## RFT Metrics5556| Column | What it means |57|--------|---------------|58| `train_mean_reward` | Average reward across rollouts (**primary** — should increase) |59| `full_valid_mean_reward` | Validation reward (overfitting check) |60| `completion_tokens_mean` | Average response length per rollout |61| `reasoning_tokens_mean` | Average reasoning tokens (o-series models) |62| `mean_unresponsive_rewards` | Rollouts with no scoreable output |63| `train_sample_parse_error_count` | Grader couldn't parse output |64| `train_other_error_count` | Grader logic bugs — should be 0 |6566## RFT Reward Curve Patterns6768- **Reward flat at ~0**: Grader broken or threshold too strict69- **Reward always negative**: pass_threshold too high70- **Reward immediately high + flat**: Threshold too lenient71- **Train-valid reward gap > 0.10**: Possible reward hacking7273### Token Growth74- **Moderate** (tokens double): Normal — model becoming more thorough75- **Excessive** (3x+): Grader may incentivize verbosity — check scoring dimensions76- When comparing checkpoints, equal accuracy at fewer tokens is strictly better7778### Parse Errors vs Logic Errors79- `sample_parse_error_count`: Often high in agentic RFT (mid-reasoning captures). Training still works if reward is climbing.80- `other_error_count`: Bugs in grader logic. Fix before continuing.8182## RFT Checkpoint Selection8384```python85checkpoints = client.fine_tuning.jobs.checkpoints.list(job_id)86for cp in checkpoints:87m = cp.metrics88tr = f"{m.train_mean_reward:.3f}" if m.train_mean_reward is not None else "n/a"89vr = f"{m.full_valid_mean_reward:.3f}" if m.full_valid_mean_reward is not None else "n/a"90ct = f"{m.completion_tokens_mean:.0f}" if m.completion_tokens_mean is not None else "n/a"91print(f"Step {cp.step_number}: train_reward={tr}, valid_reward={vr}, tokens={ct}")92```9394Don't rely solely on `valid_reward` for RFT — deploy 2–3 candidates (peak reward, final, mid-training) and evaluate with your real task harness including tool execution.95