Source from repo

Microsoft Foundry Skill

Deploy, evaluate, and manage AI agents end-to-end on Microsoft Azure AI Foundry

microsoftGitHub microsoftOfficialSource repo Original GitHub link Publisher page

Files

154

Skill

n/a

Size

976.2 KB

Entrypoint

SKILL.md

Format

git-repo

Open file

finetuning/references/reward-hacking.md

Syntax-highlighted preview of this file as included in the skill package.

Rendered Source

markdown73 linesFree

finetuning/references/reward-hacking.md

1# Reward Hacking Prevention in RFT
2 
3## What Is Reward Hacking?
4 
5The model optimizes for the grader's scoring function rather than the actual task. The training grader becomes a proxy reward that diverges from true quality — the model games the proxy instead of improving.
6 
7**Core rule: Your training grader MUST produce the same ranking as your evaluation methodology.**
8 
9| If you evaluate with… | Then train with… | NOT with… |
10|------------------------|------------------|-----------|
11| LLM judge (semantic) | LLM judge | AST / regex / structural matching |
12| Exact match | Exact match | Fuzzy or partial matching |
13| Unit tests | Unit tests | Static analysis alone |
14 
15Misaligned graders are the #1 cause of reward hacking.
16 
17## Train-Val Gap Thresholds
18 
19| Train-Val Gap | Status | Action |
20|---------------|--------|--------|
21| ≤ 0.05 | ✅ Healthy | Continue training |
22| 0.05–0.10 | ⚠️ Warning | Monitor closely, check outputs qualitatively |
23| > 0.10 | 🛑 Stop | Stop training — reward hacking is likely |
24 
25## Pre-Training Checklist
26 
271. **Baseline the grader**: Run training grader on base model outputs. Record scores as your floor.
282. **Cross-validate graders**: If training grader ≠ eval grader, generate 50 outputs, score with both, compute Spearman ρ. Proceed only if ρ ≥ 0.8. If ρ < 0.6, fix alignment first.
293. **Test hackability**: Generate 5 intentionally bad outputs that might score well. If grader scores any > 5/10, redesign it.
304. **Set gap threshold**: Monitor train-val gap every eval_interval. Stop if > 0.10.
31 
32## Grader Iteration Loop
33 
34When reward hacking is detected:
35 
36```
371. STOP the training run
38        ↓
392. COLLECT "hacked" outputs (high train score, low eval score)
40        ↓
413. ANALYZE what pattern the model exploited
42   (structural mimicry? verbosity? keyword stuffing?)
43        ↓
444. UPDATE the grader to penalize that pattern
45        ↓
465. RE-BASELINE the updated grader on base model outputs
47        ↓
486. RESTART training with the improved grader
49```
50 
51## Red Flags Checklist
52 
53Investigate immediately if **any** are true:
54 
55- [ ] Train-val gap > 0.10
56- [ ] Training reward increasing but eval quality stable or declining
57- [ ] Model outputs are longer/more verbose than base model
58- [ ] Outputs structurally match references but are semantically wrong
59- [ ] Different LLM judges disagree on quality
60- [ ] Conciseness/style scores dropping while correctness climbs
61- [ ] Model produces "template" responses
62 
63## Key Principles
64 
65| Principle | Action |
66|-----------|--------|
67| Align graders | Training grader must rank outputs same as eval |
68| Cross-validate first | Spearman ρ ≥ 0.8 between training and eval graders |
69| Monitor train-val gap | ≤ 0.05 healthy, > 0.10 stop |
70| Test hackability | Bad outputs should score < 5/10 |
71| Prefer SFT when possible | Use RFT only for verifiable-answer tasks |
72| Iterate graders, not models | Fix grader before restarting training |
73

Preparing the source view

Microsoft Foundry Skill

finetuning/references/reward-hacking.md