Loading source
Pulling the file list, source metadata, and syntax-aware rendering for this listing.
Source from repo
Deploy, evaluate, and manage AI agents end-to-end on Microsoft Azure AI Foundry
Files
Skill
Size
Entrypoint
Format
Open file
Syntax-highlighted preview of this file as included in the skill package.
finetuning/references/training-curves.md
1# Training Curve Analysis23## SFT Metrics45| Column | What it means |6|--------|---------------|7| `train_loss` | Loss on training batch (should decrease) |8| `train_mean_token_accuracy` | Token-level accuracy on training data |9| `valid_loss` | Loss on validation set (**primary metric**) |10| `valid_mean_token_accuracy` | Token-level accuracy on validation data |11| `full_valid_loss` | Full-pass validation loss (more accurate, less frequent) |12| `full_valid_mean_token_accuracy` | Full-pass token accuracy |1314## Overfitting Detection1516**Overfitting ratio** at each checkpoint: `valid_loss / train_loss`1718| Ratio | Interpretation |19|-------|---------------|20| < 1.2 | Healthy — generalizes well |21| 1.2–1.5 | Mild overfitting — acceptable for small datasets |22| 1.5–2.0 | Moderate — consider reducing epochs |23| > 2.0 | Severe — deploy an earlier checkpoint |2425```python26val_losses = [cp.metrics.valid_loss for cp in checkpoints if cp.metrics.valid_loss]27best_val = min(val_losses)28final_val = val_losses[-1]29if final_val > best_val * 1.2:30print(f"⚠️ OVERFIT: Best={best_val:.4f}, final={final_val:.4f}")31```3233## Best Checkpoint Selection (SFT)3435```python36checkpoints = client.fine_tuning.jobs.checkpoints.list(job_id)37best_cp = min(checkpoints.data, key=lambda cp: cp.metrics.valid_loss or float('inf'))38print(f"Best: step {best_cp.step_number}, valid_loss={best_cp.metrics.valid_loss:.4f}, "39f"model={best_cp.fine_tuned_model_checkpoint}")40```4142## Diagnosis Table4344| Observation | Diagnosis | Action |45|-------------|-----------|--------|46| Train loss barely decreases | LR too low or noisy data | Increase LR or clean data |47| Train loss crashes to ~0 | LR too high or easy data | Decrease LR or add harder examples |48| Valid loss rises after epoch 2 | Overfitting | Deploy epoch-2 checkpoint |49| Valid loss plateaus after epoch 1 | Learned quickly | Try epoch=1 or lower LR |50| Valid loss oscillates | Small batch or inconsistent data | Increase batch size or audit data |51| Both losses stay high | Task too hard | Larger model or simplify task |52| Large train-valid gap from start | Insufficient/mismatched data | Add diverse training data |5354## RFT Metrics5556| Column | What it means |57|--------|---------------|58| `train_mean_reward` | Average reward across rollouts (**primary** — should increase) |59| `full_valid_mean_reward` | Validation reward (overfitting check) |60| `completion_tokens_mean` | Average response length per rollout |61| `reasoning_tokens_mean` | Average reasoning tokens (o-series models) |62| `mean_unresponsive_rewards` | Rollouts with no scoreable output |63| `train_sample_parse_error_count` | Grader couldn't parse output |64| `train_other_error_count` | Grader logic bugs — should be 0 |6566## RFT Reward Curve Patterns6768- **Reward flat at ~0**: Grader broken or threshold too strict69- **Reward always negative**: pass_threshold too high70- **Reward immediately high + flat**: Threshold too lenient71- **Train-valid reward gap > 0.10**: Possible reward hacking7273### Token Growth74- **Moderate** (tokens double): Normal — model becoming more thorough75- **Excessive** (3x+): Grader may incentivize verbosity — check scoring dimensions76- When comparing checkpoints, equal accuracy at fewer tokens is strictly better7778### Parse Errors vs Logic Errors79- `sample_parse_error_count`: Often high in agentic RFT (mid-reasoning captures). Training still works if reward is climbing.80- `other_error_count`: Bugs in grader logic. Fix before continuing.8182## RFT Checkpoint Selection8384```python85checkpoints = client.fine_tuning.jobs.checkpoints.list(job_id)86for cp in checkpoints:87m = cp.metrics88tr = f"{m.train_mean_reward:.3f}" if m.train_mean_reward is not None else "n/a"89vr = f"{m.full_valid_mean_reward:.3f}" if m.full_valid_mean_reward is not None else "n/a"90ct = f"{m.completion_tokens_mean:.0f}" if m.completion_tokens_mean is not None else "n/a"91print(f"Step {cp.step_number}: train_reward={tr}, valid_reward={vr}, tokens={ct}")92```9394Don't rely solely on `valid_reward` for RFT — deploy 2–3 candidates (peak reward, final, mid-training) and evaluate with your real task harness including tool execution.95