Source from repo

Microsoft Foundry Skill

Build and deploy AI applications on Azure AI Foundry using Microsoft's model catalog and AI services

microsoftGitHub microsoftOfficialSource repo Original GitHub link Publisher page

Files

155

Skill

n/a

Size

976.3 KB

Entrypoint

SKILL.md

Format

git-repo

Open file

finetuning/references/grader-design.md

Syntax-highlighted preview of this file as included in the skill package.

Rendered Source

markdown86 linesFree

finetuning/references/grader-design.md

1# RFT Grader Design Guide
2 
3## Grader Type Selection
4 
5| Grader Type | Best For | Tradeoffs |
6|------------|---------|-----------|
7| **Python grader** (default) | Most tasks incl. tool-calling. Accesses `output_text` and `output_tools`. | Can't call external APIs or execute code. |
8| **Multi grader** | Combining multiple scoring dimensions. | `score_model` component adds LLM cost per rollout. |
9| **Endpoint grader** | Tasks requiring external API calls (test suites, DB queries). | HTTP latency, scaling risk. Under-provisioned endpoints can hang jobs. |
10| **String check** | Exact-match tasks (classification, yes/no, numeric). | Binary 0/1 only — no partial credit. |
11 
12Start with Python grader unless you need external API calls. Python graders are fast, deterministic, reliable, and tool-aware (`sample.output_tools` provides tool call metadata).
13 
14## Partial Credit Pattern
15 
16Binary pass/fail gives sparse reward. Decompose into 2–4 scored dimensions:
17 
18```python
19def grade(sample, item):
20    output_text = sample.get("output_text", "") or ""
21    expected = item.get("expected_answer", "")
22    
23    score = 0.0
24    
25    # Core correctness (highest weight)
26    if correct_action(output_text, expected):
27        score += 0.4
28    
29    # Precision (exact amounts, specific values)
30    score += 0.3 * precision_score(output_text, expected)
31    
32    # Reasoning quality (cited correct rules/facts)
33    score += 0.2 * reasoning_score(output_text, expected)
34    
35    # Process quality (used the right tools)
36    if used_correct_tools(sample.get("output_tools", [])):
37        score += 0.1
38    
39    return round(min(score, 1.0), 3)
40```
41 
42### Weight Guidelines
43 
44| Dimension | Typical Weight | Examples |
45|-----------|---------------|----------|
46| Core correctness | 0.3–0.5 | Right action/answer/classification |
47| Precision | 0.2–0.3 | Exact amounts, correct format |
48| Reasoning | 0.1–0.2 | Cited correct rules, justified decision |
49| Process quality | 0.05–0.1 | Used right tools, followed steps |
50 
51## Threshold Calibration Workflow
52 
53The `pass_threshold` determines what score counts as pass vs fail — the most important RFT hyperparameter.
54 
551. Run the **base model** on your training/validation set
562. Score every output with your grader
573. Compute pass rates at multiple thresholds:
58 
59```python
60for threshold in [0.5, 0.6, 0.7, 0.8, 0.85, 0.9, 0.95]:
61    pass_rate = sum(1 for s in scores if s >= threshold) / len(scores)
62    print(f"  @{threshold}: pass={pass_rate:.0%}, fail={1 - pass_rate:.0%}")
63```
64 
654. Choose where **25–50% of base model rollouts fail**:
66 
67| Failure Rate | Signal Quality |
68|-------------|----------------|
69| < 10% | ❌ Too easy — no learning signal |
70| 10–25% | ⚠️ Weak signal |
71| **25–50%** | ✅ Good — enough failures to learn from |
72| 50–70% | ⚠️ Harsh — mostly negative reward |
73| > 70% | ❌ Too hard — training may diverge |
74 
75**Always re-run calibration when you change your dataset.**
76 
77## Consistency Rules
78 
79When using multiple graders (Python for training, endpoint for debugging, local script for eval):
80 
811. **Identical scoring logic** — same weights, keywords, dimension breakdown
822. **Identical default scores** — same behavior when no action found, no amounts expected
833. **Test with same examples** — run 10 samples through all graders and verify scores match
84 
85Mismatched scoring causes the model to learn different behavior than what your evaluation measures.
86

Microsoft Foundry Skill

finetuning/references/grader-design.md

Preparing the source view

Microsoft Foundry Skill

finetuning/references/grader-design.md