Loading source
Pulling the file list, source metadata, and syntax-aware rendering for this listing.
Source from repo
Deploy, evaluate, and manage AI agents end-to-end on Microsoft Azure AI Foundry
Files
Skill
Size
Entrypoint
Format
Open file
Syntax-highlighted preview of this file as included in the skill package.
finetuning/references/grader-design.md
1# RFT Grader Design Guide23## Grader Type Selection45| Grader Type | Best For | Tradeoffs |6|------------|---------|-----------|7| **Python grader** (default) | Most tasks incl. tool-calling. Accesses `output_text` and `output_tools`. | Can't call external APIs or execute code. |8| **Multi grader** | Combining multiple scoring dimensions. | `score_model` component adds LLM cost per rollout. |9| **Endpoint grader** | Tasks requiring external API calls (test suites, DB queries). | HTTP latency, scaling risk. Under-provisioned endpoints can hang jobs. |10| **String check** | Exact-match tasks (classification, yes/no, numeric). | Binary 0/1 only — no partial credit. |1112Start with Python grader unless you need external API calls. Python graders are fast, deterministic, reliable, and tool-aware (`sample.output_tools` provides tool call metadata).1314## Partial Credit Pattern1516Binary pass/fail gives sparse reward. Decompose into 2–4 scored dimensions:1718```python19def grade(sample, item):20output_text = sample.get("output_text", "") or ""21expected = item.get("expected_answer", "")2223score = 0.02425# Core correctness (highest weight)26if correct_action(output_text, expected):27score += 0.42829# Precision (exact amounts, specific values)30score += 0.3 * precision_score(output_text, expected)3132# Reasoning quality (cited correct rules/facts)33score += 0.2 * reasoning_score(output_text, expected)3435# Process quality (used the right tools)36if used_correct_tools(sample.get("output_tools", [])):37score += 0.13839return round(min(score, 1.0), 3)40```4142### Weight Guidelines4344| Dimension | Typical Weight | Examples |45|-----------|---------------|----------|46| Core correctness | 0.3–0.5 | Right action/answer/classification |47| Precision | 0.2–0.3 | Exact amounts, correct format |48| Reasoning | 0.1–0.2 | Cited correct rules, justified decision |49| Process quality | 0.05–0.1 | Used right tools, followed steps |5051## Threshold Calibration Workflow5253The `pass_threshold` determines what score counts as pass vs fail — the most important RFT hyperparameter.54551. Run the **base model** on your training/validation set562. Score every output with your grader573. Compute pass rates at multiple thresholds:5859```python60for threshold in [0.5, 0.6, 0.7, 0.8, 0.85, 0.9, 0.95]:61pass_rate = sum(1 for s in scores if s >= threshold) / len(scores)62print(f" @{threshold}: pass={pass_rate:.0%}, fail={1 - pass_rate:.0%}")63```64654. Choose where **25–50% of base model rollouts fail**:6667| Failure Rate | Signal Quality |68|-------------|----------------|69| < 10% | ❌ Too easy — no learning signal |70| 10–25% | ⚠️ Weak signal |71| **25–50%** | ✅ Good — enough failures to learn from |72| 50–70% | ⚠️ Harsh — mostly negative reward |73| > 70% | ❌ Too hard — training may diverge |7475**Always re-run calibration when you change your dataset.**7677## Consistency Rules7879When using multiple graders (Python for training, endpoint for debugging, local script for eval):80811. **Identical scoring logic** — same weights, keywords, dimension breakdown822. **Identical default scores** — same behavior when no action found, no amounts expected833. **Test with same examples** — run 10 samples through all graders and verify scores match8485Mismatched scoring causes the model to learn different behavior than what your evaluation measures.86