Loading source
Pulling the file list, source metadata, and syntax-aware rendering for this listing.
Source from repo
A comprehensive collection of Agent Skills for context engineering, multi-agent architectures, and production agent systems.
Files
Skill
Size
Entrypoint
Format
Open file
Syntax-highlighted preview of this file as included in the skill package.
skills/advanced-evaluation/SKILL.md
1---2name: advanced-evaluation3description: This skill should be used when the user asks to "implement LLM-as-judge", "compare model outputs", "create evaluation rubrics", "mitigate evaluation bias", or mentions direct scoring, pairwise comparison, position bias, evaluation pipelines, or automated quality assessment.4---56# Advanced Evaluation78This skill covers production-grade techniques for evaluating LLM outputs using LLMs as judges. It synthesizes research from academic papers, industry practices, and practical implementation experience into actionable patterns for building reliable evaluation systems.910**Key insight**: LLM-as-a-Judge is not a single technique but a family of approaches, each suited to different evaluation contexts. Choosing the right approach and mitigating known biases is the core competency this skill develops.1112## When to Activate1314Activate this skill when:1516- Building automated evaluation pipelines for LLM outputs17- Comparing multiple model responses to select the best one18- Establishing consistent quality standards across evaluation teams19- Debugging evaluation systems that show inconsistent results20- Designing A/B tests for prompt or model changes21- Creating rubrics for human or automated evaluation22- Analyzing correlation between automated and human judgments2324## Core Concepts2526### The Evaluation Taxonomy2728Select between two primary approaches based on whether ground truth exists:2930**Direct Scoring** — Use when objective criteria exist (factual accuracy, instruction following, toxicity). A single LLM rates one response on a defined scale. Achieves moderate-to-high reliability for well-defined criteria. Watch for score calibration drift and inconsistent scale interpretation.3132**Pairwise Comparison** — Use for subjective preferences (tone, style, persuasiveness). An LLM compares two responses and selects the better one. Achieves higher human-judge agreement than direct scoring for preference tasks (Zheng et al., 2023). Watch for position bias and length bias.3334### The Bias Landscape3536Mitigate these systematic biases in every evaluation system:3738**Position Bias**: First-position responses get preferential treatment. Mitigate by evaluating twice with swapped positions, then apply majority vote or consistency check.3940**Length Bias**: Longer responses score higher regardless of quality. Mitigate by explicitly prompting to ignore length and applying length-normalized scoring.4142**Self-Enhancement Bias**: Models rate their own outputs higher. Mitigate by using different models for generation and evaluation.4344**Verbosity Bias**: Excessive detail scores higher even when unnecessary. Mitigate with criteria-specific rubrics that penalize irrelevant detail.4546**Authority Bias**: Confident tone scores higher regardless of accuracy. Mitigate by requiring evidence citation and adding a fact-checking layer.4748### Metric Selection Framework4950Match metrics to the evaluation task structure:5152| Task Type | Primary Metrics | Secondary Metrics |53|-----------|-----------------|-------------------|54| Binary classification (pass/fail) | Recall, Precision, F1 | Cohen's kappa |55| Ordinal scale (1-5 rating) | Spearman's rho, Kendall's tau | Cohen's kappa (weighted) |56| Pairwise preference | Agreement rate, Position consistency | Confidence calibration |57| Multi-label | Macro-F1, Micro-F1 | Per-label precision/recall |5859Prioritize systematic disagreement patterns over absolute agreement rates because a judge that consistently disagrees with humans on specific criteria is more problematic than one with random noise.6061## Evaluation Approaches6263### Direct Scoring Implementation6465Build direct scoring with three components: clear criteria, a calibrated scale, and structured output format.6667**Criteria Definition Pattern**:68```69Criterion: [Name]70Description: [What this criterion measures]71Weight: [Relative importance, 0-1]72```7374**Scale Calibration** — Choose scale granularity based on rubric detail:75- 1-3: Binary with neutral option, lowest cognitive load76- 1-5: Standard Likert, best balance of granularity and reliability77- 1-10: Use only with detailed per-level rubrics because calibration is harder7879**Prompt Structure for Direct Scoring**:80```81You are an expert evaluator assessing response quality.8283## Task84Evaluate the following response against each criterion.8586## Original Prompt87{prompt}8889## Response to Evaluate90{response}9192## Criteria93{for each criterion: name, description, weight}9495## Instructions96For each criterion:971. Find specific evidence in the response982. Score according to the rubric (1-{max} scale)993. Justify your score with evidence1004. Suggest one specific improvement101102## Output Format103Respond with structured JSON containing scores, justifications, and summary.104```105106Always require justification before the score in all scoring prompts because research shows this improves reliability by 15-25% compared to score-first approaches.107108### Pairwise Comparison Implementation109110Apply position bias mitigation in every pairwise evaluation:1111121. First pass: Response A in first position, Response B in second1132. Second pass: Response B in first position, Response A in second1143. Consistency check: If passes disagree, return TIE with reduced confidence1154. Final verdict: Consistent winner with averaged confidence116117**Prompt Structure for Pairwise Comparison**:118```119You are an expert evaluator comparing two AI responses.120121## Critical Instructions122- Do NOT prefer responses because they are longer123- Do NOT prefer responses based on position (first vs second)124- Focus ONLY on quality according to the specified criteria125- Ties are acceptable when responses are genuinely equivalent126127## Original Prompt128{prompt}129130## Response A131{response_a}132133## Response B134{response_b}135136## Comparison Criteria137{criteria list}138139## Instructions1401. Analyze each response independently first1412. Compare them on each criterion1423. Determine overall winner with confidence level143144## Output Format145JSON with per-criterion comparison, overall winner, confidence (0-1), and reasoning.146```147148**Confidence Calibration** — Map confidence to position consistency:149- Both passes agree: confidence = average of individual confidences150- Passes disagree: confidence = 0.5, verdict = TIE151152### Rubric Generation153154Generate rubrics to reduce evaluation variance by 40-60% compared to open-ended scoring.155156**Include these rubric components**:1571. **Level descriptions**: Clear boundaries for each score level1582. **Characteristics**: Observable features that define each level1593. **Examples**: Representative text for each level (optional but valuable)1604. **Edge cases**: Guidance for ambiguous situations1615. **Scoring guidelines**: General principles for consistent application162163**Set strictness calibration** for the use case:164- **Lenient**: Lower passing bar, appropriate for encouraging iteration165- **Balanced**: Typical production expectations166- **Strict**: High standards for safety-critical or high-stakes evaluation167168Adapt rubrics to the domain — use domain-specific terminology. A code readability rubric mentions variables, functions, and comments. A medical accuracy rubric references clinical terminology and evidence standards.169170## Practical Guidance171172### Evaluation Pipeline Design173174Build production evaluation systems with these layers: Criteria Loader (rubrics + weights) -> Primary Scorer (direct or pairwise) -> Bias Mitigation (position swap, etc.) -> Confidence Scoring (calibration) -> Output (scores + justifications + confidence). See [Evaluation Pipeline Diagram](./references/evaluation-pipeline.md) for the full visual layout.175176### Decision Framework: Direct vs. Pairwise177178Apply this decision tree:179180```181Is there an objective ground truth?182+-- Yes -> Direct Scoring183| Examples: factual accuracy, instruction following, format compliance184|185+-- No -> Is it a preference or quality judgment?186+-- Yes -> Pairwise Comparison187| Examples: tone, style, persuasiveness, creativity188|189+-- No -> Consider reference-based evaluation190Examples: summarization (compare to source), translation (compare to reference)191```192193### Scaling Evaluation194195For high-volume evaluation, apply one of these strategies:1961971. **Panel of LLMs (PoLL)**: Use multiple models as judges and aggregate votes to reduce individual model bias. More expensive but more reliable for high-stakes decisions.1981992. **Hierarchical evaluation**: Use a fast cheap model for screening and an expensive model for edge cases. Requires calibration of the screening threshold.2002013. **Human-in-the-loop**: Automate clear cases and route low-confidence decisions to human review. Design feedback loops to improve automated evaluation over time.202203## Examples204205### Example 1: Direct Scoring for Accuracy206207**Input**:208```209Prompt: "What causes seasons on Earth?"210Response: "Seasons are caused by Earth's tilted axis. As Earth orbits the Sun,211different hemispheres receive more direct sunlight at different times of year."212Criterion: Factual Accuracy (weight: 1.0)213Scale: 1-5214```215216**Output**:217```json218{219"criterion": "Factual Accuracy",220"score": 5,221"evidence": [222"Correctly identifies axial tilt as primary cause",223"Correctly explains differential sunlight by hemisphere",224"No factual errors present"225],226"justification": "Response accurately explains the cause of seasons with correct227scientific reasoning. Both the axial tilt and its effect on sunlight distribution228are correctly described.",229"improvement": "Could add the specific tilt angle (23.5 degrees) for completeness."230}231```232233### Example 2: Pairwise Comparison with Position Swap234235**Input**:236```237Prompt: "Explain machine learning to a beginner"238Response A: [Technical explanation with jargon]239Response B: [Simple analogy-based explanation]240Criteria: ["clarity", "accessibility"]241```242243**First Pass (A first)**:244```json245{ "winner": "B", "confidence": 0.8 }246```247248**Second Pass (B first)**:249```json250{ "winner": "A", "confidence": 0.6 }251```252(Note: Winner is A because B was in first position)253254**Mapped Second Pass**:255```json256{ "winner": "B", "confidence": 0.6 }257```258259**Final Result**:260```json261{262"winner": "B",263"confidence": 0.7,264"positionConsistency": {265"consistent": true,266"firstPassWinner": "B",267"secondPassWinner": "B"268}269}270```271272### Example 3: Rubric Generation273274**Input**:275```276criterionName: "Code Readability"277criterionDescription: "How easy the code is to understand and maintain"278domain: "software engineering"279scale: "1-5"280strictness: "balanced"281```282283**Output** (abbreviated):284```json285{286"levels": [287{288"score": 1,289"label": "Poor",290"description": "Code is difficult to understand without significant effort",291"characteristics": [292"No meaningful variable or function names",293"No comments or documentation",294"Deeply nested or convoluted logic"295]296},297{298"score": 3,299"label": "Adequate",300"description": "Code is understandable with some effort",301"characteristics": [302"Most variables have meaningful names",303"Basic comments present for complex sections",304"Logic is followable but could be cleaner"305]306},307{308"score": 5,309"label": "Excellent",310"description": "Code is immediately clear and maintainable",311"characteristics": [312"All names are descriptive and consistent",313"Comprehensive documentation",314"Clean, modular structure"315]316}317],318"edgeCases": [319{320"situation": "Code is well-structured but uses domain-specific abbreviations",321"guidance": "Score based on readability for domain experts, not general audience"322}323]324}325```326327## Guidelines3283291. **Always require justification before scores** - Chain-of-thought prompting improves reliability by 15-25%3303312. **Always swap positions in pairwise comparison** - Single-pass comparison is corrupted by position bias3323333. **Match scale granularity to rubric specificity** - Don't use 1-10 without detailed level descriptions3343354. **Separate objective and subjective criteria** - Use direct scoring for objective, pairwise for subjective3363375. **Include confidence scores** - Calibrate to position consistency and evidence strength3383396. **Define edge cases explicitly** - Ambiguous situations cause the most evaluation variance3403417. **Use domain-specific rubrics** - Generic rubrics produce generic (less useful) evaluations3423438. **Validate against human judgments** - Automated evaluation is only valuable if it correlates with human assessment3443459. **Monitor for systematic bias** - Track disagreement patterns by criterion, response type, model34634710. **Design for iteration** - Evaluation systems improve with feedback loops348349## Gotchas3503511. **Scoring without justification**: Scores lack grounding and are difficult to debug. Always require evidence-based justification before the score.3523532. **Single-pass pairwise comparison**: Position bias corrupts results when positions are not swapped. Always evaluate twice with swapped positions and check consistency.3543553. **Overloaded criteria**: Criteria that measure multiple things at once produce unreliable scores. Enforce one criterion = one measurable aspect.3563574. **Missing edge case guidance**: Evaluators handle ambiguous cases inconsistently without explicit instructions. Include edge cases in rubrics with clear resolution rules.3583595. **Ignoring confidence calibration**: High-confidence wrong judgments are worse than low-confidence ones. Calibrate confidence to position consistency and evidence strength.3603616. **Rubric drift**: Rubrics become miscalibrated as quality standards evolve or model capabilities improve. Schedule periodic rubric reviews and re-anchor score levels against fresh human-annotated examples.3623637. **Evaluation prompt sensitivity**: Minor wording changes in evaluation prompts (e.g., reordering instructions, changing phrasing) can cause 10-20% score swings. Version-control evaluation prompts and run regression tests before deploying prompt changes.3643658. **Uncontrolled length bias**: Longer responses systematically score higher even when conciseness is preferred. Add explicit length-neutrality instructions to evaluation prompts and validate with length-controlled test pairs.366367## Integration368369This skill integrates with:370371- **context-fundamentals** - Evaluation prompts require effective context structure372- **tool-design** - Evaluation tools need proper schemas and error handling373- **context-optimization** - Evaluation prompts can be optimized for token efficiency374- **evaluation** (foundational) - This skill extends the foundational evaluation concepts375376## References377378Internal reference:379- [LLM-as-Judge Implementation Patterns](./references/implementation-patterns.md) - Read when: building an evaluation pipeline from scratch or integrating LLM judges into CI/CD380- [Bias Mitigation Techniques](./references/bias-mitigation.md) - Read when: evaluation results show inconsistent or suspicious scoring patterns381- [Metric Selection Guide](./references/metrics-guide.md) - Read when: choosing statistical metrics to validate evaluation reliability382- [Evaluation Pipeline Diagram](./references/evaluation-pipeline.md) - Read when: designing the architecture of a multi-stage evaluation system383384External research:385- [Eugene Yan: Evaluating the Effectiveness of LLM-Evaluators](https://eugeneyan.com/writing/llm-evaluators/) - Read when: surveying the state of the art in LLM evaluation386- [Judging LLM-as-a-Judge (Zheng et al., 2023)](https://arxiv.org/abs/2306.05685) - Read when: understanding position bias and MT-Bench methodology387- [G-Eval: NLG Evaluation using GPT-4 (Liu et al., 2023)](https://arxiv.org/abs/2303.16634) - Read when: implementing chain-of-thought evaluation scoring388- [Large Language Models are not Fair Evaluators (Wang et al., 2023)](https://arxiv.org/abs/2305.17926) - Read when: diagnosing systematic bias in evaluation outputs389390Related skills in this collection:391- evaluation - Foundational evaluation concepts392- context-fundamentals - Context structure for evaluation prompts393- tool-design - Building evaluation tools394395---396397## Skill Metadata398399**Created**: 2025-12-24400**Last Updated**: 2026-03-17401**Author**: Agent Skills for Context Engineering Contributors402**Version**: 2.0.0403