Source from repo

Microsoft Foundry Skill

Build and deploy AI applications on Azure AI Foundry using Microsoft's model catalog and AI services

microsoftGitHub microsoftOfficialSource repo Original GitHub link Publisher page

Files

155

Skill

n/a

Size

976.3 KB

Entrypoint

SKILL.md

Format

git-repo

Open file

finetuning/references/evaluation.md

Syntax-highlighted preview of this file as included in the skill package.

Rendered Source

markdown162 linesFree

finetuning/references/evaluation.md

1# Evaluation Methodology
2 
3## Principles
4 
51. **Always establish a baseline**: Evaluate the base (un-tuned) model first. Without a baseline, you can't measure improvement.
62. **Use a held-out test set**: Never evaluate on training or validation data. The model has seen those.
73. **Use the same test set for every model**: This is the only way to compare results fairly.
84. **Use task-specific graders**: Built-in generic evaluators (Coherence, Fluency) measure general quality and won't detect fine-tuning improvements. Use custom graders (Python, score model, string check) for task-specific evaluation.
95. **Measure cost alongside accuracy**: Report completion tokens per response when comparing models or checkpoints. A model that achieves the same accuracy with fewer tokens is strictly better — cheaper inference and lower latency.
10 
11## Two-Layer Evaluation Strategy
12 
13Use the **Azure AI Evaluation SDK** (`azure-ai-evaluation`) for all evaluation.
14 
15| Layer | Purpose | Grader Type | When |
16|-------|---------|-------------|------|
17| **Task-specific** (primary) | Measure FT improvement | `AzureOpenAIScoreModelGrader`, `AzureOpenAIPythonGrader`, `AzureOpenAIStringCheckGrader` | Every eval |
18| **General quality** (guardrail) | Verify model didn't degrade | `CoherenceEvaluator`, `FluencyEvaluator` | Spot-check only |
19 
20Generic built-in evaluators (Coherence, Fluency, TaskAdherence) are guardrails, not metrics — they often show no difference between base and fine-tuned models even when domain-specific evaluation reveals clear improvement.
21 
22## Custom Graders (Primary FT Evaluation)
23 
24### 1. Score Model Grader (LLM judge with task-specific rubric)
25 
26Best for: subjective tasks (summarization, alignment, style).
27 
28```python
29from azure.ai.evaluation import AzureOpenAIScoreModelGrader
30 
31summarization_grader = AzureOpenAIScoreModelGrader(
32    model_config=model_config,
33    name="summarization_quality",
34    prompt="""Rate this news summary on a scale of 1-5.
35 
36Article: {{item.article}}
37Summary: {{sample.output_text}}
38 
39Criteria:
40- Captures ALL key facts (who, what, when, where)
41- No hallucinated information not in the article
42- Concise (under 3 sentences)
43 
44Score 1: Missing key facts or hallucinations
45Score 3: Captures main point but misses details
46Score 5: Perfect summary — all facts, no extras, concise
47 
48Return ONLY a number 1-5.""",
49    output_type="numeric",
50    pass_threshold=3,
51)
52```
53 
54### 2. Python Grader (programmatic/exact-match evaluation)
55 
56Best for: code generation, math, entity extraction, structured output.
57 
58```python
59from azure.ai.evaluation import AzureOpenAIPythonGrader
60 
61entity_grader = AzureOpenAIPythonGrader(
62    name="entity_extraction_accuracy",
63    source="""
64import json
65 
66def grade(item, sample):
67    try:
68        extracted = json.loads(sample["output_text"])
69        reference = json.loads(item["ground_truth"])
70    except (json.JSONDecodeError, KeyError):
71        return {"score": 0, "reason": "Invalid JSON output"}
72 
73    required_keys = ["people", "organizations", "locations", "dates"]
74    missing = [k for k in required_keys if k not in extracted]
75    if missing:
76        return {"score": 0.5, "reason": f"Missing keys: {missing}"}
77 
78    total, matched = 0, 0
79    for key in required_keys:
80        ref_set = set(str(v).lower() for v in reference.get(key, []))
81        ext_set = set(str(v).lower() for v in extracted.get(key, []))
82        total += len(ref_set)
83        matched += len(ref_set & ext_set)
84 
85    score = matched / total if total > 0 else 1.0
86    return {"score": score, "reason": f"{matched}/{total} entities matched"}
87""",
88    pass_threshold=0.7,
89)
90```
91 
92### 3. String Check Grader (pattern matching)
93 
94Best for: classification, format compliance, tool calling format.
95 
96```python
97from azure.ai.evaluation import AzureOpenAIStringCheckGrader
98 
99tool_format_grader = AzureOpenAIStringCheckGrader(
100    name="tool_call_format",
101    input="{{sample.output_text}}",
102    operation="like",          # or "eq", "starts_with", "contains"
103    reference="function_call",
104    pass_threshold=1,
105)
106 
107classification_grader = AzureOpenAIStringCheckGrader(
108    name="classification_accuracy",
109    input="{{sample.output_text}}",
110    operation="eq",
111    reference="{{item.expected_label}}",
112    pass_threshold=1,
113)
114```
115 
116## Running an Evaluation
117 
118The `evaluate()` function runs multiple graders over an entire dataset:
119 
120```python
121from azure.ai.evaluation import evaluate, F1ScoreEvaluator
122 
123result = evaluate(
124    data="eval_data.jsonl",
125    evaluators={
126        "task_grader": my_custom_score_grader,   # primary
127        "f1": F1ScoreEvaluator(),                 # token overlap
128    },
129    output_path="./eval_results.json",
130)
131 
132for metric, value in result["metrics"].items():
133    print(f"{metric}: {value}")
134```
135 
136## Test Set Design
137 
138- **Size**: 30–100 examples is sufficient.
139- **Diversity**: Cover easy/medium/hard, edge cases, and different sub-categories.
140- **Quality**: Reference answers must be gold-standard correct. A wrong reference penalizes correct outputs.
141 
142## Interpreting Results
143 
144| Score Type | Range | Meaning |
145|-----------|-------|---------|
146| AI quality (1–5) | 1–2 Poor, 3 Adequate, 4 Good, 5 Excellent | |
147| NLP (0–1) | <0.3 Wrong, 0.3–0.6 Partial, 0.6–0.8 Good, >0.8 Strong | |
148 
149With 50+ eval examples, a difference of ~0.3 points (on 1–5 scale) is usually meaningful.
150 
151## Evaluating RFT Models
152 
1531. **Evaluate with a DIFFERENT rubric than the training grader** — otherwise you measure overfitting to the grader.
1542. Use `F1ScoreEvaluator` for exact-match accuracy.
1553. Use `SimilarityEvaluator` to catch semantically correct but differently formatted answers.
1564. **Compare against the base model**, not just other fine-tunes.
157 
158## Reference
159 
160- [Azure AI Evaluation SDK docs](https://learn.microsoft.com/en-us/python/api/overview/azure/ai-evaluation-readme)
161- [Evaluation samples](https://github.com/Azure-Samples/azureai-samples/tree/main/scenarios/evaluate)
162

Loading source

Preparing the source view

Pulling the file list, source metadata, and syntax-aware rendering for this listing.

Marketplace

Source from repo

Microsoft Foundry Skill

Build and deploy AI applications on Azure AI Foundry using Microsoft's model catalog and AI services

microsoftGitHub microsoftOfficialSource repo Original GitHub link Publisher page

Files

155

Skill

n/a

Size

976.3 KB

Entrypoint

SKILL.md

Format

git-repo

Open file

finetuning/references/evaluation.md

Syntax-highlighted preview of this file as included in the skill package.

Rendered Source

markdown162 linesFree

finetuning/references/evaluation.md

1# Evaluation Methodology
2 
3## Principles
4 
51. **Always establish a baseline**: Evaluate the base (un-tuned) model first. Without a baseline, you can't measure improvement.
62. **Use a held-out test set**: Never evaluate on training or validation data. The model has seen those.
73. **Use the same test set for every model**: This is the only way to compare results fairly.
84. **Use task-specific graders**: Built-in generic evaluators (Coherence, Fluency) measure general quality and won't detect fine-tuning improvements. Use custom graders (Python, score model, string check) for task-specific evaluation.
95. **Measure cost alongside accuracy**: Report completion tokens per response when comparing models or checkpoints. A model that achieves the same accuracy with fewer tokens is strictly better — cheaper inference and lower latency.
10 
11## Two-Layer Evaluation Strategy
12 
13Use the **Azure AI Evaluation SDK** (`azure-ai-evaluation`) for all evaluation.
14 
15| Layer | Purpose | Grader Type | When |
16|-------|---------|-------------|------|
17| **Task-specific** (primary) | Measure FT improvement | `AzureOpenAIScoreModelGrader`, `AzureOpenAIPythonGrader`, `AzureOpenAIStringCheckGrader` | Every eval |
18| **General quality** (guardrail) | Verify model didn't degrade | `CoherenceEvaluator`, `FluencyEvaluator` | Spot-check only |
19 
20Generic built-in evaluators (Coherence, Fluency, TaskAdherence) are guardrails, not metrics — they often show no difference between base and fine-tuned models even when domain-specific evaluation reveals clear improvement.
21 
22## Custom Graders (Primary FT Evaluation)
23 
24### 1. Score Model Grader (LLM judge with task-specific rubric)
25 
26Best for: subjective tasks (summarization, alignment, style).
27 
28```python
29from azure.ai.evaluation import AzureOpenAIScoreModelGrader
30 
31summarization_grader = AzureOpenAIScoreModelGrader(
32    model_config=model_config,
33    name="summarization_quality",
34    prompt="""Rate this news summary on a scale of 1-5.
35 
36Article: {{item.article}}
37Summary: {{sample.output_text}}
38 
39Criteria:
40- Captures ALL key facts (who, what, when, where)
41- No hallucinated information not in the article
42- Concise (under 3 sentences)
43 
44Score 1: Missing key facts or hallucinations
45Score 3: Captures main point but misses details
46Score 5: Perfect summary — all facts, no extras, concise
47 
48Return ONLY a number 1-5.""",
49    output_type="numeric",
50    pass_threshold=3,
51)
52```
53 
54### 2. Python Grader (programmatic/exact-match evaluation)
55 
56Best for: code generation, math, entity extraction, structured output.
57 
58```python
59from azure.ai.evaluation import AzureOpenAIPythonGrader
60 
61entity_grader = AzureOpenAIPythonGrader(
62    name="entity_extraction_accuracy",
63    source="""
64import json
65 
66def grade(item, sample):
67    try:
68        extracted = json.loads(sample["output_text"])
69        reference = json.loads(item["ground_truth"])
70    except (json.JSONDecodeError, KeyError):
71        return {"score": 0, "reason": "Invalid JSON output"}
72 
73    required_keys = ["people", "organizations", "locations", "dates"]
74    missing = [k for k in required_keys if k not in extracted]
75    if missing:
76        return {"score": 0.5, "reason": f"Missing keys: {missing}"}
77 
78    total, matched = 0, 0
79    for key in required_keys:
80        ref_set = set(str(v).lower() for v in reference.get(key, []))
81        ext_set = set(str(v).lower() for v in extracted.get(key, []))
82        total += len(ref_set)
83        matched += len(ref_set & ext_set)
84 
85    score = matched / total if total > 0 else 1.0
86    return {"score": score, "reason": f"{matched}/{total} entities matched"}
87""",
88    pass_threshold=0.7,
89)
90```
91 
92### 3. String Check Grader (pattern matching)
93 
94Best for: classification, format compliance, tool calling format.
95 
96```python
97from azure.ai.evaluation import AzureOpenAIStringCheckGrader
98 
99tool_format_grader = AzureOpenAIStringCheckGrader(
100    name="tool_call_format",
101    input="{{sample.output_text}}",
102    operation="like",          # or "eq", "starts_with", "contains"
103    reference="function_call",
104    pass_threshold=1,
105)
106 
107classification_grader = AzureOpenAIStringCheckGrader(
108    name="classification_accuracy",
109    input="{{sample.output_text}}",
110    operation="eq",
111    reference="{{item.expected_label}}",
112    pass_threshold=1,
113)
114```
115 
116## Running an Evaluation
117 
118The `evaluate()` function runs multiple graders over an entire dataset:
119 
120```python
121from azure.ai.evaluation import evaluate, F1ScoreEvaluator
122 
123result = evaluate(
124    data="eval_data.jsonl",
125    evaluators={
126        "task_grader": my_custom_score_grader,   # primary
127        "f1": F1ScoreEvaluator(),                 # token overlap
128    },
129    output_path="./eval_results.json",
130)
131 
132for metric, value in result["metrics"].items():
133    print(f"{metric}: {value}")
134```
135 
136## Test Set Design
137 
138- **Size**: 30–100 examples is sufficient.
139- **Diversity**: Cover easy/medium/hard, edge cases, and different sub-categories.
140- **Quality**: Reference answers must be gold-standard correct. A wrong reference penalizes correct outputs.
141 
142## Interpreting Results
143 
144| Score Type | Range | Meaning |
145|-----------|-------|---------|
146| AI quality (1–5) | 1–2 Poor, 3 Adequate, 4 Good, 5 Excellent | |
147| NLP (0–1) | <0.3 Wrong, 0.3–0.6 Partial, 0.6–0.8 Good, >0.8 Strong | |
148 
149With 50+ eval examples, a difference of ~0.3 points (on 1–5 scale) is usually meaningful.
150 
151## Evaluating RFT Models
152 
1531. **Evaluate with a DIFFERENT rubric than the training grader** — otherwise you measure overfitting to the grader.
1542. Use `F1ScoreEvaluator` for exact-match accuracy.
1553. Use `SimilarityEvaluator` to catch semantically correct but differently formatted answers.
1564. **Compare against the base model**, not just other fine-tunes.
157 
158## Reference
159 
160- [Azure AI Evaluation SDK docs](https://learn.microsoft.com/en-us/python/api/overview/azure/ai-evaluation-readme)
161- [Evaluation samples](https://github.com/Azure-Samples/azureai-samples/tree/main/scenarios/evaluate)
162