Source from repo

Microsoft Foundry Skill

Build and deploy AI applications on Azure AI Foundry using Microsoft's model catalog and AI services

microsoftGitHub microsoftOfficialSource repo Original GitHub link Publisher page

Files

155

Skill

n/a

Size

976.3 KB

Entrypoint

SKILL.md

Format

git-repo

Open file

finetuning/references/reward-hacking.md

Syntax-highlighted preview of this file as included in the skill package.

Rendered Source

markdown73 linesFree

finetuning/references/reward-hacking.md

1# Reward Hacking Prevention in RFT
2 
3## What Is Reward Hacking?
4 
5The model optimizes for the grader's scoring function rather than the actual task. The training grader becomes a proxy reward that diverges from true quality — the model games the proxy instead of improving.
6 
7**Core rule: Your training grader MUST produce the same ranking as your evaluation methodology.**
8 
9| If you evaluate with… | Then train with… | NOT with… |
10|------------------------|------------------|-----------|
11| LLM judge (semantic) | LLM judge | AST / regex / structural matching |
12| Exact match | Exact match | Fuzzy or partial matching |
13| Unit tests | Unit tests | Static analysis alone |
14 
15Misaligned graders are the #1 cause of reward hacking.
16 
17## Train-Val Gap Thresholds
18 
19| Train-Val Gap | Status | Action |
20|---------------|--------|--------|
21| ≤ 0.05 | ✅ Healthy | Continue training |
22| 0.05–0.10 | ⚠️ Warning | Monitor closely, check outputs qualitatively |
23| > 0.10 | 🛑 Stop | Stop training — reward hacking is likely |
24 
25## Pre-Training Checklist
26 
271. **Baseline the grader**: Run training grader on base model outputs. Record scores as your floor.
282. **Cross-validate graders**: If training grader ≠ eval grader, generate 50 outputs, score with both, compute Spearman ρ. Proceed only if ρ ≥ 0.8. If ρ < 0.6, fix alignment first.
293. **Test hackability**: Generate 5 intentionally bad outputs that might score well. If grader scores any > 5/10, redesign it.
304. **Set gap threshold**: Monitor train-val gap every eval_interval. Stop if > 0.10.
31 
32## Grader Iteration Loop
33 
34When reward hacking is detected:
35 
36```
371. STOP the training run
38        ↓
392. COLLECT "hacked" outputs (high train score, low eval score)
40        ↓
413. ANALYZE what pattern the model exploited
42   (structural mimicry? verbosity? keyword stuffing?)
43        ↓
444. UPDATE the grader to penalize that pattern
45        ↓
465. RE-BASELINE the updated grader on base model outputs
47        ↓
486. RESTART training with the improved grader
49```
50 
51## Red Flags Checklist
52 
53Investigate immediately if **any** are true:
54 
55- [ ] Train-val gap > 0.10
56- [ ] Training reward increasing but eval quality stable or declining
57- [ ] Model outputs are longer/more verbose than base model
58- [ ] Outputs structurally match references but are semantically wrong
59- [ ] Different LLM judges disagree on quality
60- [ ] Conciseness/style scores dropping while correctness climbs
61- [ ] Model produces "template" responses
62 
63## Key Principles
64 
65| Principle | Action |
66|-----------|--------|
67| Align graders | Training grader must rank outputs same as eval |
68| Cross-validate first | Spearman ρ ≥ 0.8 between training and eval graders |
69| Monitor train-val gap | ≤ 0.05 healthy, > 0.10 stop |
70| Test hackability | Bad outputs should score < 5/10 |
71| Prefer SFT when possible | Use RFT only for verifiable-answer tasks |
72| Iterate graders, not models | Fix grader before restarting training |
73

Microsoft Foundry Skill

finetuning/references/reward-hacking.md

Preparing the source view

Microsoft Foundry Skill

finetuning/references/reward-hacking.md