Loading source
Pulling the file list, source metadata, and syntax-aware rendering for this listing.
Source from repo
Build and deploy AI applications on Azure AI Foundry using Microsoft's model catalog and AI services
Files
Skill
Size
Entrypoint
Format
Open file
Syntax-highlighted preview of this file as included in the skill package.
finetuning/references/evaluation.md
1# Evaluation Methodology23## Principles451. **Always establish a baseline**: Evaluate the base (un-tuned) model first. Without a baseline, you can't measure improvement.62. **Use a held-out test set**: Never evaluate on training or validation data. The model has seen those.73. **Use the same test set for every model**: This is the only way to compare results fairly.84. **Use task-specific graders**: Built-in generic evaluators (Coherence, Fluency) measure general quality and won't detect fine-tuning improvements. Use custom graders (Python, score model, string check) for task-specific evaluation.95. **Measure cost alongside accuracy**: Report completion tokens per response when comparing models or checkpoints. A model that achieves the same accuracy with fewer tokens is strictly better — cheaper inference and lower latency.1011## Two-Layer Evaluation Strategy1213Use the **Azure AI Evaluation SDK** (`azure-ai-evaluation`) for all evaluation.1415| Layer | Purpose | Grader Type | When |16|-------|---------|-------------|------|17| **Task-specific** (primary) | Measure FT improvement | `AzureOpenAIScoreModelGrader`, `AzureOpenAIPythonGrader`, `AzureOpenAIStringCheckGrader` | Every eval |18| **General quality** (guardrail) | Verify model didn't degrade | `CoherenceEvaluator`, `FluencyEvaluator` | Spot-check only |1920Generic built-in evaluators (Coherence, Fluency, TaskAdherence) are guardrails, not metrics — they often show no difference between base and fine-tuned models even when domain-specific evaluation reveals clear improvement.2122## Custom Graders (Primary FT Evaluation)2324### 1. Score Model Grader (LLM judge with task-specific rubric)2526Best for: subjective tasks (summarization, alignment, style).2728```python29from azure.ai.evaluation import AzureOpenAIScoreModelGrader3031summarization_grader = AzureOpenAIScoreModelGrader(32model_config=model_config,33name="summarization_quality",34prompt="""Rate this news summary on a scale of 1-5.3536Article: {{item.article}}37Summary: {{sample.output_text}}3839Criteria:40- Captures ALL key facts (who, what, when, where)41- No hallucinated information not in the article42- Concise (under 3 sentences)4344Score 1: Missing key facts or hallucinations45Score 3: Captures main point but misses details46Score 5: Perfect summary — all facts, no extras, concise4748Return ONLY a number 1-5.""",49output_type="numeric",50pass_threshold=3,51)52```5354### 2. Python Grader (programmatic/exact-match evaluation)5556Best for: code generation, math, entity extraction, structured output.5758```python59from azure.ai.evaluation import AzureOpenAIPythonGrader6061entity_grader = AzureOpenAIPythonGrader(62name="entity_extraction_accuracy",63source="""64import json6566def grade(item, sample):67try:68extracted = json.loads(sample["output_text"])69reference = json.loads(item["ground_truth"])70except (json.JSONDecodeError, KeyError):71return {"score": 0, "reason": "Invalid JSON output"}7273required_keys = ["people", "organizations", "locations", "dates"]74missing = [k for k in required_keys if k not in extracted]75if missing:76return {"score": 0.5, "reason": f"Missing keys: {missing}"}7778total, matched = 0, 079for key in required_keys:80ref_set = set(str(v).lower() for v in reference.get(key, []))81ext_set = set(str(v).lower() for v in extracted.get(key, []))82total += len(ref_set)83matched += len(ref_set & ext_set)8485score = matched / total if total > 0 else 1.086return {"score": score, "reason": f"{matched}/{total} entities matched"}87""",88pass_threshold=0.7,89)90```9192### 3. String Check Grader (pattern matching)9394Best for: classification, format compliance, tool calling format.9596```python97from azure.ai.evaluation import AzureOpenAIStringCheckGrader9899tool_format_grader = AzureOpenAIStringCheckGrader(100name="tool_call_format",101input="{{sample.output_text}}",102operation="like", # or "eq", "starts_with", "contains"103reference="function_call",104pass_threshold=1,105)106107classification_grader = AzureOpenAIStringCheckGrader(108name="classification_accuracy",109input="{{sample.output_text}}",110operation="eq",111reference="{{item.expected_label}}",112pass_threshold=1,113)114```115116## Running an Evaluation117118The `evaluate()` function runs multiple graders over an entire dataset:119120```python121from azure.ai.evaluation import evaluate, F1ScoreEvaluator122123result = evaluate(124data="eval_data.jsonl",125evaluators={126"task_grader": my_custom_score_grader, # primary127"f1": F1ScoreEvaluator(), # token overlap128},129output_path="./eval_results.json",130)131132for metric, value in result["metrics"].items():133print(f"{metric}: {value}")134```135136## Test Set Design137138- **Size**: 30–100 examples is sufficient.139- **Diversity**: Cover easy/medium/hard, edge cases, and different sub-categories.140- **Quality**: Reference answers must be gold-standard correct. A wrong reference penalizes correct outputs.141142## Interpreting Results143144| Score Type | Range | Meaning |145|-----------|-------|---------|146| AI quality (1–5) | 1–2 Poor, 3 Adequate, 4 Good, 5 Excellent | |147| NLP (0–1) | <0.3 Wrong, 0.3–0.6 Partial, 0.6–0.8 Good, >0.8 Strong | |148149With 50+ eval examples, a difference of ~0.3 points (on 1–5 scale) is usually meaningful.150151## Evaluating RFT Models1521531. **Evaluate with a DIFFERENT rubric than the training grader** — otherwise you measure overfitting to the grader.1542. Use `F1ScoreEvaluator` for exact-match accuracy.1553. Use `SimilarityEvaluator` to catch semantically correct but differently formatted answers.1564. **Compare against the base model**, not just other fine-tunes.157158## Reference159160- [Azure AI Evaluation SDK docs](https://learn.microsoft.com/en-us/python/api/overview/azure/ai-evaluation-readme)161- [Evaluation samples](https://github.com/Azure-Samples/azureai-samples/tree/main/scenarios/evaluate)162