Source from repo
Microsoft Foundry Skill

Deploy, evaluate, and manage AI agents end-to-end on Microsoft Azure AI Foundry
microsoftGitHub microsoftOfficialSource repo Original GitHub link Publisher page
Files
154
Skill
n/a
Size
976.2 KB
Entrypoint
SKILL.md
Format
git-repo
Open file
finetuning/references/evaluation.md

Syntax-highlighted preview of this file as included in the skill package.
Rendered Source
markdown162 linesFree
finetuning/references/evaluation.md
1# Evaluation Methodology
2 
3## Principles
4 
51. **Always establish a baseline**: Evaluate the base (un-tuned) model first. Without a baseline, you can't measure improvement.
62. **Use a held-out test set**: Never evaluate on training or validation data. The model has seen those.
73. **Use the same test set for every model**: This is the only way to compare results fairly.
84. **Use task-specific graders**: Built-in generic evaluators (Coherence, Fluency) measure general quality and won't detect fine-tuning improvements. Use custom graders (Python, score model, string check) for task-specific evaluation.
95. **Measure cost alongside accuracy**: Report completion tokens per response when comparing models or checkpoints. A model that achieves the same accuracy with fewer tokens is strictly better — cheaper inference and lower latency.
10 
11## Two-Layer Evaluation Strategy
12 
13Use the **Azure AI Evaluation SDK** (`azure-ai-evaluation`) for all evaluation.
14 
15| Layer | Purpose | Grader Type | When |
16|-------|---------|-------------|------|
17| **Task-specific** (primary) | Measure FT improvement | `AzureOpenAIScoreModelGrader`, `AzureOpenAIPythonGrader`, `AzureOpenAIStringCheckGrader` | Every eval |
18| **General quality** (guardrail) | Verify model didn't degrade | `CoherenceEvaluator`, `FluencyEvaluator` | Spot-check only |
19 
20Generic built-in evaluators (Coherence, Fluency, TaskAdherence) are guardrails, not metrics — they often show no difference between base and fine-tuned models even when domain-specific evaluation reveals clear improvement.
21 
22## Custom Graders (Primary FT Evaluation)
23 
24### 1. Score Model Grader (LLM judge with task-specific rubric)
25 
26Best for: subjective tasks (summarization, alignment, style).
27 
28```python
29from azure.ai.evaluation import AzureOpenAIScoreModelGrader
30 
31summarization_grader = AzureOpenAIScoreModelGrader(
32    model_config=model_config,
33    name="summarization_quality",
34    prompt="""Rate this news summary on a scale of 1-5.
35 
36Article: {{item.article}}
37Summary: {{sample.output_text}}
38 
39Criteria:
40- Captures ALL key facts (who, what, when, where)
41- No hallucinated information not in the article
42- Concise (under 3 sentences)
43 
44Score 1: Missing key facts or hallucinations
45Score 3: Captures main point but misses details
46Score 5: Perfect summary — all facts, no extras, concise
47 
48Return ONLY a number 1-5.""",
49    output_type="numeric",
50    pass_threshold=3,
51)
52```
53 
54### 2. Python Grader (programmatic/exact-match evaluation)
55 
56Best for: code generation, math, entity extraction, structured output.
57 
58```python
59from azure.ai.evaluation import AzureOpenAIPythonGrader
60 
61entity_grader = AzureOpenAIPythonGrader(
62    name="entity_extraction_accuracy",
63    source="""
64import json
65 
66def grade(item, sample):
67    try:
68        extracted = json.loads(sample["output_text"])
69        reference = json.loads(item["ground_truth"])
70    except (json.JSONDecodeError, KeyError):
71        return {"score": 0, "reason": "Invalid JSON output"}
72 
73    required_keys = ["people", "organizations", "locations", "dates"]
74    missing = [k for k in required_keys if k not in extracted]
75    if missing:
76        return {"score": 0.5, "reason": f"Missing keys: {missing}"}
77 
78    total, matched = 0, 0
79    for key in required_keys:
80        ref_set = set(str(v).lower() for v in reference.get(key, []))
81        ext_set = set(str(v).lower() for v in extracted.get(key, []))
82        total += len(ref_set)
83        matched += len(ref_set & ext_set)
84 
85    score = matched / total if total > 0 else 1.0
86    return {"score": score, "reason": f"{matched}/{total} entities matched"}
87""",
88    pass_threshold=0.7,
89)
90```
91 
92### 3. String Check Grader (pattern matching)
93 
94Best for: classification, format compliance, tool calling format.
95 
96```python
97from azure.ai.evaluation import AzureOpenAIStringCheckGrader
98 
99tool_format_grader = AzureOpenAIStringCheckGrader(
100    name="tool_call_format",
101    input="{{sample.output_text}}",
102    operation="like",          # or "eq", "starts_with", "contains"
103    reference="function_call",
104    pass_threshold=1,
105)
106 
107classification_grader = AzureOpenAIStringCheckGrader(
108    name="classification_accuracy",
109    input="{{sample.output_text}}",
110    operation="eq",
111    reference="{{item.expected_label}}",
112    pass_threshold=1,
113)
114```
115 
116## Running an Evaluation
117 
118The `evaluate()` function runs multiple graders over an entire dataset:
119 
120```python
121from azure.ai.evaluation import evaluate, F1ScoreEvaluator
122 
123result = evaluate(
124    data="eval_data.jsonl",
125    evaluators={
126        "task_grader": my_custom_score_grader,   # primary
127        "f1": F1ScoreEvaluator(),                 # token overlap
128    },
129    output_path="./eval_results.json",
130)
131 
132for metric, value in result["metrics"].items():
133    print(f"{metric}: {value}")
134```
135 
136## Test Set Design
137 
138- **Size**: 30–100 examples is sufficient.
139- **Diversity**: Cover easy/medium/hard, edge cases, and different sub-categories.
140- **Quality**: Reference answers must be gold-standard correct. A wrong reference penalizes correct outputs.
141 
142## Interpreting Results
143 
144| Score Type | Range | Meaning |
145|-----------|-------|---------|
146| AI quality (1–5) | 1–2 Poor, 3 Adequate, 4 Good, 5 Excellent | |
147| NLP (0–1) | <0.3 Wrong, 0.3–0.6 Partial, 0.6–0.8 Good, >0.8 Strong | |
148 
149With 50+ eval examples, a difference of ~0.3 points (on 1–5 scale) is usually meaningful.
150 
151## Evaluating RFT Models
152 
1531. **Evaluate with a DIFFERENT rubric than the training grader** — otherwise you measure overfitting to the grader.
1542. Use `F1ScoreEvaluator` for exact-match accuracy.
1553. Use `SimilarityEvaluator` to catch semantically correct but differently formatted answers.
1564. **Compare against the base model**, not just other fine-tunes.
157 
158## Reference
159 
160- [Azure AI Evaluation SDK docs](https://learn.microsoft.com/en-us/python/api/overview/azure/ai-evaluation-readme)
161- [Evaluation samples](https://github.com/Azure-Samples/azureai-samples/tree/main/scenarios/evaluate)
162
Preparing the source view

Microsoft Foundry Skill

finetuning/references/evaluation.md