Loading source
Pulling the file list, source metadata, and syntax-aware rendering for this listing.
Source from repo
Deploy, evaluate, and manage AI agents end-to-end on Microsoft Azure AI Foundry
Files
Skill
Size
Entrypoint
Format
Open file
Syntax-highlighted preview of this file as included in the skill package.
finetuning/references/reward-hacking.md
1# Reward Hacking Prevention in RFT23## What Is Reward Hacking?45The model optimizes for the grader's scoring function rather than the actual task. The training grader becomes a proxy reward that diverges from true quality — the model games the proxy instead of improving.67**Core rule: Your training grader MUST produce the same ranking as your evaluation methodology.**89| If you evaluate with… | Then train with… | NOT with… |10|------------------------|------------------|-----------|11| LLM judge (semantic) | LLM judge | AST / regex / structural matching |12| Exact match | Exact match | Fuzzy or partial matching |13| Unit tests | Unit tests | Static analysis alone |1415Misaligned graders are the #1 cause of reward hacking.1617## Train-Val Gap Thresholds1819| Train-Val Gap | Status | Action |20|---------------|--------|--------|21| ≤ 0.05 | ✅ Healthy | Continue training |22| 0.05–0.10 | ⚠️ Warning | Monitor closely, check outputs qualitatively |23| > 0.10 | 🛑 Stop | Stop training — reward hacking is likely |2425## Pre-Training Checklist26271. **Baseline the grader**: Run training grader on base model outputs. Record scores as your floor.282. **Cross-validate graders**: If training grader ≠ eval grader, generate 50 outputs, score with both, compute Spearman ρ. Proceed only if ρ ≥ 0.8. If ρ < 0.6, fix alignment first.293. **Test hackability**: Generate 5 intentionally bad outputs that might score well. If grader scores any > 5/10, redesign it.304. **Set gap threshold**: Monitor train-val gap every eval_interval. Stop if > 0.10.3132## Grader Iteration Loop3334When reward hacking is detected:3536```371. STOP the training run38↓392. COLLECT "hacked" outputs (high train score, low eval score)40↓413. ANALYZE what pattern the model exploited42(structural mimicry? verbosity? keyword stuffing?)43↓444. UPDATE the grader to penalize that pattern45↓465. RE-BASELINE the updated grader on base model outputs47↓486. RESTART training with the improved grader49```5051## Red Flags Checklist5253Investigate immediately if **any** are true:5455- [ ] Train-val gap > 0.1056- [ ] Training reward increasing but eval quality stable or declining57- [ ] Model outputs are longer/more verbose than base model58- [ ] Outputs structurally match references but are semantically wrong59- [ ] Different LLM judges disagree on quality60- [ ] Conciseness/style scores dropping while correctness climbs61- [ ] Model produces "template" responses6263## Key Principles6465| Principle | Action |66|-----------|--------|67| Align graders | Training grader must rank outputs same as eval |68| Cross-validate first | Spearman ρ ≥ 0.8 between training and eval graders |69| Monitor train-val gap | ≤ 0.05 healthy, > 0.10 stop |70| Test hackability | Bad outputs should score < 5/10 |71| Prefer SFT when possible | Use RFT only for verifiable-answer tasks |72| Iterate graders, not models | Fix grader before restarting training |73