Loading source
Pulling the file list, source metadata, and syntax-aware rendering for this listing.
Source from repo
Deploy, evaluate, and manage AI agents end-to-end on Microsoft Azure AI Foundry
Files
Skill
Size
Entrypoint
Format
Open file
Syntax-highlighted preview of this file as included in the skill package.
finetuning/references/evaluation.md
1# Evaluation Methodology23## Principles451. **Always establish a baseline**: Evaluate the base (un-tuned) model first. Without a baseline, you can't measure improvement.62. **Use a held-out test set**: Never evaluate on training or validation data. The model has seen those.73. **Use the same test set for every model**: This is the only way to compare results fairly.84. **Use task-specific graders**: Built-in generic evaluators (Coherence, Fluency) measure general quality and won't detect fine-tuning improvements. Use custom graders (Python, score model, string check) for task-specific evaluation.95. **Measure cost alongside accuracy**: Report completion tokens per response when comparing models or checkpoints. A model that achieves the same accuracy with fewer tokens is strictly better — cheaper inference and lower latency.1011## Two-Layer Evaluation Strategy1213Use the **Azure AI Evaluation SDK** (`azure-ai-evaluation`) for all evaluation.1415| Layer | Purpose | Grader Type | When |16|-------|---------|-------------|------|17| **Task-specific** (primary) | Measure FT improvement | `AzureOpenAIScoreModelGrader`, `AzureOpenAIPythonGrader`, `AzureOpenAIStringCheckGrader` | Every eval |18| **General quality** (guardrail) | Verify model didn't degrade | `CoherenceEvaluator`, `FluencyEvaluator` | Spot-check only |1920Generic built-in evaluators (Coherence, Fluency, TaskAdherence) are guardrails, not metrics — they often show no difference between base and fine-tuned models even when domain-specific evaluation reveals clear improvement.2122## Custom Graders (Primary FT Evaluation)2324### 1. Score Model Grader (LLM judge with task-specific rubric)2526Best for: subjective tasks (summarization, alignment, style).2728```python29from azure.ai.evaluation import AzureOpenAIScoreModelGrader3031summarization_grader = AzureOpenAIScoreModelGrader(32model_config=model_config,33name="summarization_quality",34prompt="""Rate this news summary on a scale of 1-5.3536Article: {{item.article}}37Summary: {{sample.output_text}}3839Criteria:40- Captures ALL key facts (who, what, when, where)41- No hallucinated information not in the article42- Concise (under 3 sentences)4344Score 1: Missing key facts or hallucinations45Score 3: Captures main point but misses details46Score 5: Perfect summary — all facts, no extras, concise4748Return ONLY a number 1-5.""",49output_type="numeric",50pass_threshold=3,51)52```5354### 2. Python Grader (programmatic/exact-match evaluation)5556Best for: code generation, math, entity extraction, structured output.5758```python59from azure.ai.evaluation import AzureOpenAIPythonGrader6061entity_grader = AzureOpenAIPythonGrader(62name="entity_extraction_accuracy",63source="""64import json6566def grade(item, sample):67try:68extracted = json.loads(sample["output_text"])69reference = json.loads(item["ground_truth"])70except (json.JSONDecodeError, KeyError):71return {"score": 0, "reason": "Invalid JSON output"}7273required_keys = ["people", "organizations", "locations", "dates"]74missing = [k for k in required_keys if k not in extracted]75if missing:76return {"score": 0.5, "reason": f"Missing keys: {missing}"}7778total, matched = 0, 079for key in required_keys:80ref_set = set(str(v).lower() for v in reference.get(key, []))81ext_set = set(str(v).lower() for v in extracted.get(key, []))82total += len(ref_set)83matched += len(ref_set & ext_set)8485score = matched / total if total > 0 else 1.086return {"score": score, "reason": f"{matched}/{total} entities matched"}87""",88pass_threshold=0.7,89)90```9192### 3. String Check Grader (pattern matching)9394Best for: classification, format compliance, tool calling format.9596```python97from azure.ai.evaluation import AzureOpenAIStringCheckGrader9899tool_format_grader = AzureOpenAIStringCheckGrader(100name="tool_call_format",101input="{{sample.output_text}}",102operation="like", # or "eq", "starts_with", "contains"103reference="function_call",104pass_threshold=1,105)106107classification_grader = AzureOpenAIStringCheckGrader(108name="classification_accuracy",109input="{{sample.output_text}}",110operation="eq",111reference="{{item.expected_label}}",112pass_threshold=1,113)114```115116## Running an Evaluation117118The `evaluate()` function runs multiple graders over an entire dataset:119120```python121from azure.ai.evaluation import evaluate, F1ScoreEvaluator122123result = evaluate(124data="eval_data.jsonl",125evaluators={126"task_grader": my_custom_score_grader, # primary127"f1": F1ScoreEvaluator(), # token overlap128},129output_path="./eval_results.json",130)131132for metric, value in result["metrics"].items():133print(f"{metric}: {value}")134```135136## Test Set Design137138- **Size**: 30–100 examples is sufficient.139- **Diversity**: Cover easy/medium/hard, edge cases, and different sub-categories.140- **Quality**: Reference answers must be gold-standard correct. A wrong reference penalizes correct outputs.141142## Interpreting Results143144| Score Type | Range | Meaning |145|-----------|-------|---------|146| AI quality (1–5) | 1–2 Poor, 3 Adequate, 4 Good, 5 Excellent | |147| NLP (0–1) | <0.3 Wrong, 0.3–0.6 Partial, 0.6–0.8 Good, >0.8 Strong | |148149With 50+ eval examples, a difference of ~0.3 points (on 1–5 scale) is usually meaningful.150151## Evaluating RFT Models1521531. **Evaluate with a DIFFERENT rubric than the training grader** — otherwise you measure overfitting to the grader.1542. Use `F1ScoreEvaluator` for exact-match accuracy.1553. Use `SimilarityEvaluator` to catch semantically correct but differently formatted answers.1564. **Compare against the base model**, not just other fine-tunes.157158## Reference159160- [Azure AI Evaluation SDK docs](https://learn.microsoft.com/en-us/python/api/overview/azure/ai-evaluation-readme)161- [Evaluation samples](https://github.com/Azure-Samples/azureai-samples/tree/main/scenarios/evaluate)162