Source from repo

Microsoft Foundry Skill

Deploy, evaluate, and manage AI agents end-to-end on Microsoft Azure AI Foundry

microsoftGitHub microsoftOfficialSource repo Original GitHub link Publisher page

Files

154

Skill

n/a

Size

976.2 KB

Entrypoint

SKILL.md

Format

git-repo

Open file

finetuning/references/grader-design.md

Syntax-highlighted preview of this file as included in the skill package.

Rendered Source

markdown86 linesFree

finetuning/references/grader-design.md

1# RFT Grader Design Guide
2 
3## Grader Type Selection
4 
5| Grader Type | Best For | Tradeoffs |
6|------------|---------|-----------|
7| **Python grader** (default) | Most tasks incl. tool-calling. Accesses `output_text` and `output_tools`. | Can't call external APIs or execute code. |
8| **Multi grader** | Combining multiple scoring dimensions. | `score_model` component adds LLM cost per rollout. |
9| **Endpoint grader** | Tasks requiring external API calls (test suites, DB queries). | HTTP latency, scaling risk. Under-provisioned endpoints can hang jobs. |
10| **String check** | Exact-match tasks (classification, yes/no, numeric). | Binary 0/1 only — no partial credit. |
11 
12Start with Python grader unless you need external API calls. Python graders are fast, deterministic, reliable, and tool-aware (`sample.output_tools` provides tool call metadata).
13 
14## Partial Credit Pattern
15 
16Binary pass/fail gives sparse reward. Decompose into 2–4 scored dimensions:
17 
18```python
19def grade(sample, item):
20    output_text = sample.get("output_text", "") or ""
21    expected = item.get("expected_answer", "")
22    
23    score = 0.0
24    
25    # Core correctness (highest weight)
26    if correct_action(output_text, expected):
27        score += 0.4
28    
29    # Precision (exact amounts, specific values)
30    score += 0.3 * precision_score(output_text, expected)
31    
32    # Reasoning quality (cited correct rules/facts)
33    score += 0.2 * reasoning_score(output_text, expected)
34    
35    # Process quality (used the right tools)
36    if used_correct_tools(sample.get("output_tools", [])):
37        score += 0.1
38    
39    return round(min(score, 1.0), 3)
40```
41 
42### Weight Guidelines
43 
44| Dimension | Typical Weight | Examples |
45|-----------|---------------|----------|
46| Core correctness | 0.3–0.5 | Right action/answer/classification |
47| Precision | 0.2–0.3 | Exact amounts, correct format |
48| Reasoning | 0.1–0.2 | Cited correct rules, justified decision |
49| Process quality | 0.05–0.1 | Used right tools, followed steps |
50 
51## Threshold Calibration Workflow
52 
53The `pass_threshold` determines what score counts as pass vs fail — the most important RFT hyperparameter.
54 
551. Run the **base model** on your training/validation set
562. Score every output with your grader
573. Compute pass rates at multiple thresholds:
58 
59```python
60for threshold in [0.5, 0.6, 0.7, 0.8, 0.85, 0.9, 0.95]:
61    pass_rate = sum(1 for s in scores if s >= threshold) / len(scores)
62    print(f"  @{threshold}: pass={pass_rate:.0%}, fail={1 - pass_rate:.0%}")
63```
64 
654. Choose where **25–50% of base model rollouts fail**:
66 
67| Failure Rate | Signal Quality |
68|-------------|----------------|
69| < 10% | ❌ Too easy — no learning signal |
70| 10–25% | ⚠️ Weak signal |
71| **25–50%** | ✅ Good — enough failures to learn from |
72| 50–70% | ⚠️ Harsh — mostly negative reward |
73| > 70% | ❌ Too hard — training may diverge |
74 
75**Always re-run calibration when you change your dataset.**
76 
77## Consistency Rules
78 
79When using multiple graders (Python for training, endpoint for debugging, local script for eval):
80 
811. **Identical scoring logic** — same weights, keywords, dimension breakdown
822. **Identical default scores** — same behavior when no action found, no amounts expected
833. **Test with same examples** — run 10 samples through all graders and verify scores match
84 
85Mismatched scoring causes the model to learn different behavior than what your evaluation measures.
86

Microsoft Foundry Skill

finetuning/references/grader-design.md

Preparing the source view

Microsoft Foundry Skill

finetuning/references/grader-design.md