Loading source
Pulling the file list, source metadata, and syntax-aware rendering for this listing.
Source from repo
Build and deploy AI applications on Azure AI Foundry using Microsoft's model catalog and AI services
Files
Skill
Size
Entrypoint
Format
Open file
Syntax-highlighted preview of this file as included in the skill package.
finetuning/references/grader-design.md
1# RFT Grader Design Guide23## Grader Type Selection45| Grader Type | Best For | Tradeoffs |6|------------|---------|-----------|7| **Python grader** (default) | Most tasks incl. tool-calling. Accesses `output_text` and `output_tools`. | Can't call external APIs or execute code. |8| **Multi grader** | Combining multiple scoring dimensions. | `score_model` component adds LLM cost per rollout. |9| **Endpoint grader** | Tasks requiring external API calls (test suites, DB queries). | HTTP latency, scaling risk. Under-provisioned endpoints can hang jobs. |10| **String check** | Exact-match tasks (classification, yes/no, numeric). | Binary 0/1 only — no partial credit. |1112Start with Python grader unless you need external API calls. Python graders are fast, deterministic, reliable, and tool-aware (`sample.output_tools` provides tool call metadata).1314## Partial Credit Pattern1516Binary pass/fail gives sparse reward. Decompose into 2–4 scored dimensions:1718```python19def grade(sample, item):20output_text = sample.get("output_text", "") or ""21expected = item.get("expected_answer", "")2223score = 0.02425# Core correctness (highest weight)26if correct_action(output_text, expected):27score += 0.42829# Precision (exact amounts, specific values)30score += 0.3 * precision_score(output_text, expected)3132# Reasoning quality (cited correct rules/facts)33score += 0.2 * reasoning_score(output_text, expected)3435# Process quality (used the right tools)36if used_correct_tools(sample.get("output_tools", [])):37score += 0.13839return round(min(score, 1.0), 3)40```4142### Weight Guidelines4344| Dimension | Typical Weight | Examples |45|-----------|---------------|----------|46| Core correctness | 0.3–0.5 | Right action/answer/classification |47| Precision | 0.2–0.3 | Exact amounts, correct format |48| Reasoning | 0.1–0.2 | Cited correct rules, justified decision |49| Process quality | 0.05–0.1 | Used right tools, followed steps |5051## Threshold Calibration Workflow5253The `pass_threshold` determines what score counts as pass vs fail — the most important RFT hyperparameter.54551. Run the **base model** on your training/validation set562. Score every output with your grader573. Compute pass rates at multiple thresholds:5859```python60for threshold in [0.5, 0.6, 0.7, 0.8, 0.85, 0.9, 0.95]:61pass_rate = sum(1 for s in scores if s >= threshold) / len(scores)62print(f" @{threshold}: pass={pass_rate:.0%}, fail={1 - pass_rate:.0%}")63```64654. Choose where **25–50% of base model rollouts fail**:6667| Failure Rate | Signal Quality |68|-------------|----------------|69| < 10% | ❌ Too easy — no learning signal |70| 10–25% | ⚠️ Weak signal |71| **25–50%** | ✅ Good — enough failures to learn from |72| 50–70% | ⚠️ Harsh — mostly negative reward |73| > 70% | ❌ Too hard — training may diverge |7475**Always re-run calibration when you change your dataset.**7677## Consistency Rules7879When using multiple graders (Python for training, endpoint for debugging, local script for eval):80811. **Identical scoring logic** — same weights, keywords, dimension breakdown822. **Identical default scores** — same behavior when no action found, no amounts expected833. **Test with same examples** — run 10 samples through all graders and verify scores match8485Mismatched scoring causes the model to learn different behavior than what your evaluation measures.86