Loading source
Pulling the file list, source metadata, and syntax-aware rendering for this listing.
Source from repo
Build and deploy AI applications on Azure AI Foundry using Microsoft's model catalog and AI services
Files
Skill
Size
Entrypoint
Format
Open file
Syntax-highlighted preview of this file as included in the skill package.
finetuning/references/reward-hacking.md
1# Reward Hacking Prevention in RFT23## What Is Reward Hacking?45The model optimizes for the grader's scoring function rather than the actual task. The training grader becomes a proxy reward that diverges from true quality — the model games the proxy instead of improving.67**Core rule: Your training grader MUST produce the same ranking as your evaluation methodology.**89| If you evaluate with… | Then train with… | NOT with… |10|------------------------|------------------|-----------|11| LLM judge (semantic) | LLM judge | AST / regex / structural matching |12| Exact match | Exact match | Fuzzy or partial matching |13| Unit tests | Unit tests | Static analysis alone |1415Misaligned graders are the #1 cause of reward hacking.1617## Train-Val Gap Thresholds1819| Train-Val Gap | Status | Action |20|---------------|--------|--------|21| ≤ 0.05 | ✅ Healthy | Continue training |22| 0.05–0.10 | ⚠️ Warning | Monitor closely, check outputs qualitatively |23| > 0.10 | 🛑 Stop | Stop training — reward hacking is likely |2425## Pre-Training Checklist26271. **Baseline the grader**: Run training grader on base model outputs. Record scores as your floor.282. **Cross-validate graders**: If training grader ≠ eval grader, generate 50 outputs, score with both, compute Spearman ρ. Proceed only if ρ ≥ 0.8. If ρ < 0.6, fix alignment first.293. **Test hackability**: Generate 5 intentionally bad outputs that might score well. If grader scores any > 5/10, redesign it.304. **Set gap threshold**: Monitor train-val gap every eval_interval. Stop if > 0.10.3132## Grader Iteration Loop3334When reward hacking is detected:3536```371. STOP the training run38↓392. COLLECT "hacked" outputs (high train score, low eval score)40↓413. ANALYZE what pattern the model exploited42(structural mimicry? verbosity? keyword stuffing?)43↓444. UPDATE the grader to penalize that pattern45↓465. RE-BASELINE the updated grader on base model outputs47↓486. RESTART training with the improved grader49```5051## Red Flags Checklist5253Investigate immediately if **any** are true:5455- [ ] Train-val gap > 0.1056- [ ] Training reward increasing but eval quality stable or declining57- [ ] Model outputs are longer/more verbose than base model58- [ ] Outputs structurally match references but are semantically wrong59- [ ] Different LLM judges disagree on quality60- [ ] Conciseness/style scores dropping while correctness climbs61- [ ] Model produces "template" responses6263## Key Principles6465| Principle | Action |66|-----------|--------|67| Align graders | Training grader must rank outputs same as eval |68| Cross-validate first | Spearman ρ ≥ 0.8 between training and eval graders |69| Monitor train-val gap | ≤ 0.05 healthy, > 0.10 stop |70| Test hackability | Bad outputs should score < 5/10 |71| Prefer SFT when possible | Use RFT only for verifiable-answer tasks |72| Iterate graders, not models | Fix grader before restarting training |73