RFT Grader Design Guide
Grader Type Selection
| Grader Type | Best For | Tradeoffs |
|---|---|---|
| Python grader (default) | Most tasks incl. tool-calling. Accesses output_text and output_tools. | Can't call external APIs or execute code. |
| Multi grader | Combining multiple scoring dimensions. | score_model component adds LLM cost per rollout. |
| Endpoint grader | Tasks requiring external API calls (test suites, DB queries). | HTTP latency, scaling risk. Under-provisioned endpoints can hang jobs. |
| String check | Exact-match tasks (classification, yes/no, numeric). | Binary 0/1 only — no partial credit. |
Start with Python grader unless you need external API calls. Python graders are fast, deterministic, reliable, and tool-aware (sample.output_tools provides tool call metadata).
Partial Credit Pattern
Binary pass/fail gives sparse reward. Decompose into 2–4 scored dimensions:
def grade(sample, item):
output_text = sample.get("output_text", "") or ""
expected = item.get("expected_answer", "")
score = 0.0
# Core correctness (highest weight)
if correct_action(output_text, expected):
score += 0.4
# Precision (exact amounts, specific values)
score += 0.3 * precision_score(output_text, expected)
# Reasoning quality (cited correct rules/facts)
score += 0.2 * reasoning_score(output_text, expected)
# Process quality (used the right tools)
if used_correct_tools(sample.get("output_tools", [])):
score += 0.1
return round(min(score, 1.0), 3)Weight Guidelines
| Dimension | Typical Weight | Examples |
|---|---|---|
| Core correctness | 0.3–0.5 | Right action/answer/classification |
| Precision | 0.2–0.3 | Exact amounts, correct format |
| Reasoning | 0.1–0.2 | Cited correct rules, justified decision |
| Process quality | 0.05–0.1 | Used right tools, followed steps |
Threshold Calibration Workflow
The pass_threshold determines what score counts as pass vs fail — the most important RFT hyperparameter.
- Run the base model on your training/validation set
- Score every output with your grader
- Compute pass rates at multiple thresholds:
for threshold in [0.5, 0.6, 0.7, 0.8, 0.85, 0.9, 0.95]:
pass_rate = sum(1 for s in scores if s >= threshold) / len(scores)
print(f" @{threshold}: pass={pass_rate:.0%}, fail={1 - pass_rate:.0%}")- Choose where 25–50% of base model rollouts fail:
| Failure Rate | Signal Quality |
|---|---|
| < 10% | ❌ Too easy — no learning signal |
| 10–25% | ⚠️ Weak signal |
| 25–50% | ✅ Good — enough failures to learn from |
| 50–70% | ⚠️ Harsh — mostly negative reward |
| > 70% | ❌ Too hard — training may diverge |
Always re-run calibration when you change your dataset.
Consistency Rules
When using multiple graders (Python for training, endpoint for debugging, local script for eval):
- Identical scoring logic — same weights, keywords, dimension breakdown
- Identical default scores — same behavior when no action found, no amounts expected
- Test with same examples — run 10 samples through all graders and verify scores match
Mismatched scoring causes the model to learn different behavior than what your evaluation measures.