Source from repo
Agent Skills for Context Engineering

A comprehensive collection of Agent Skills for context engineering, multi-agent architectures, and production agent systems.
muratcankoylanGitHub muratcankoylanSource repo Original GitHub link
Files
241
Skill
n/a
Size
2.6 MB
Entrypoint
SKILL.md
Format
git-repo
Open file
skills/advanced-evaluation/SKILL.md

Syntax-highlighted preview of this file as included in the skill package.
Rendered Source
markdown403 linesFree
skills/advanced-evaluation/SKILL.md
1---
2name: advanced-evaluation
3description: This skill should be used when the user asks to "implement LLM-as-judge", "compare model outputs", "create evaluation rubrics", "mitigate evaluation bias", or mentions direct scoring, pairwise comparison, position bias, evaluation pipelines, or automated quality assessment.
4---
5 
6# Advanced Evaluation
7 
8This skill covers production-grade techniques for evaluating LLM outputs using LLMs as judges. It synthesizes research from academic papers, industry practices, and practical implementation experience into actionable patterns for building reliable evaluation systems.
9 
10**Key insight**: LLM-as-a-Judge is not a single technique but a family of approaches, each suited to different evaluation contexts. Choosing the right approach and mitigating known biases is the core competency this skill develops.
11 
12## When to Activate
13 
14Activate this skill when:
15 
16- Building automated evaluation pipelines for LLM outputs
17- Comparing multiple model responses to select the best one
18- Establishing consistent quality standards across evaluation teams
19- Debugging evaluation systems that show inconsistent results
20- Designing A/B tests for prompt or model changes
21- Creating rubrics for human or automated evaluation
22- Analyzing correlation between automated and human judgments
23 
24## Core Concepts
25 
26### The Evaluation Taxonomy
27 
28Select between two primary approaches based on whether ground truth exists:
29 
30**Direct Scoring** — Use when objective criteria exist (factual accuracy, instruction following, toxicity). A single LLM rates one response on a defined scale. Achieves moderate-to-high reliability for well-defined criteria. Watch for score calibration drift and inconsistent scale interpretation.
31 
32**Pairwise Comparison** — Use for subjective preferences (tone, style, persuasiveness). An LLM compares two responses and selects the better one. Achieves higher human-judge agreement than direct scoring for preference tasks (Zheng et al., 2023). Watch for position bias and length bias.
33 
34### The Bias Landscape
35 
36Mitigate these systematic biases in every evaluation system:
37 
38**Position Bias**: First-position responses get preferential treatment. Mitigate by evaluating twice with swapped positions, then apply majority vote or consistency check.
39 
40**Length Bias**: Longer responses score higher regardless of quality. Mitigate by explicitly prompting to ignore length and applying length-normalized scoring.
41 
42**Self-Enhancement Bias**: Models rate their own outputs higher. Mitigate by using different models for generation and evaluation.
43 
44**Verbosity Bias**: Excessive detail scores higher even when unnecessary. Mitigate with criteria-specific rubrics that penalize irrelevant detail.
45 
46**Authority Bias**: Confident tone scores higher regardless of accuracy. Mitigate by requiring evidence citation and adding a fact-checking layer.
47 
48### Metric Selection Framework
49 
50Match metrics to the evaluation task structure:
51 
52| Task Type | Primary Metrics | Secondary Metrics |
53|-----------|-----------------|-------------------|
54| Binary classification (pass/fail) | Recall, Precision, F1 | Cohen's kappa |
55| Ordinal scale (1-5 rating) | Spearman's rho, Kendall's tau | Cohen's kappa (weighted) |
56| Pairwise preference | Agreement rate, Position consistency | Confidence calibration |
57| Multi-label | Macro-F1, Micro-F1 | Per-label precision/recall |
58 
59Prioritize systematic disagreement patterns over absolute agreement rates because a judge that consistently disagrees with humans on specific criteria is more problematic than one with random noise.
60 
61## Evaluation Approaches
62 
63### Direct Scoring Implementation
64 
65Build direct scoring with three components: clear criteria, a calibrated scale, and structured output format.
66 
67**Criteria Definition Pattern**:
68```
69Criterion: [Name]
70Description: [What this criterion measures]
71Weight: [Relative importance, 0-1]
72```
73 
74**Scale Calibration** — Choose scale granularity based on rubric detail:
75- 1-3: Binary with neutral option, lowest cognitive load
76- 1-5: Standard Likert, best balance of granularity and reliability
77- 1-10: Use only with detailed per-level rubrics because calibration is harder
78 
79**Prompt Structure for Direct Scoring**:
80```
81You are an expert evaluator assessing response quality.
82 
83## Task
84Evaluate the following response against each criterion.
85 
86## Original Prompt
87{prompt}
88 
89## Response to Evaluate
90{response}
91 
92## Criteria
93{for each criterion: name, description, weight}
94 
95## Instructions
96For each criterion:
971. Find specific evidence in the response
982. Score according to the rubric (1-{max} scale)
993. Justify your score with evidence
1004. Suggest one specific improvement
101 
102## Output Format
103Respond with structured JSON containing scores, justifications, and summary.
104```
105 
106Always require justification before the score in all scoring prompts because research shows this improves reliability by 15-25% compared to score-first approaches.
107 
108### Pairwise Comparison Implementation
109 
110Apply position bias mitigation in every pairwise evaluation:
111 
1121. First pass: Response A in first position, Response B in second
1132. Second pass: Response B in first position, Response A in second
1143. Consistency check: If passes disagree, return TIE with reduced confidence
1154. Final verdict: Consistent winner with averaged confidence
116 
117**Prompt Structure for Pairwise Comparison**:
118```
119You are an expert evaluator comparing two AI responses.
120 
121## Critical Instructions
122- Do NOT prefer responses because they are longer
123- Do NOT prefer responses based on position (first vs second)
124- Focus ONLY on quality according to the specified criteria
125- Ties are acceptable when responses are genuinely equivalent
126 
127## Original Prompt
128{prompt}
129 
130## Response A
131{response_a}
132 
133## Response B
134{response_b}
135 
136## Comparison Criteria
137{criteria list}
138 
139## Instructions
1401. Analyze each response independently first
1412. Compare them on each criterion
1423. Determine overall winner with confidence level
143 
144## Output Format
145JSON with per-criterion comparison, overall winner, confidence (0-1), and reasoning.
146```
147 
148**Confidence Calibration** — Map confidence to position consistency:
149- Both passes agree: confidence = average of individual confidences
150- Passes disagree: confidence = 0.5, verdict = TIE
151 
152### Rubric Generation
153 
154Generate rubrics to reduce evaluation variance by 40-60% compared to open-ended scoring.
155 
156**Include these rubric components**:
1571. **Level descriptions**: Clear boundaries for each score level
1582. **Characteristics**: Observable features that define each level
1593. **Examples**: Representative text for each level (optional but valuable)
1604. **Edge cases**: Guidance for ambiguous situations
1615. **Scoring guidelines**: General principles for consistent application
162 
163**Set strictness calibration** for the use case:
164- **Lenient**: Lower passing bar, appropriate for encouraging iteration
165- **Balanced**: Typical production expectations
166- **Strict**: High standards for safety-critical or high-stakes evaluation
167 
168Adapt rubrics to the domain — use domain-specific terminology. A code readability rubric mentions variables, functions, and comments. A medical accuracy rubric references clinical terminology and evidence standards.
169 
170## Practical Guidance
171 
172### Evaluation Pipeline Design
173 
174Build production evaluation systems with these layers: Criteria Loader (rubrics + weights) -> Primary Scorer (direct or pairwise) -> Bias Mitigation (position swap, etc.) -> Confidence Scoring (calibration) -> Output (scores + justifications + confidence). See [Evaluation Pipeline Diagram](./references/evaluation-pipeline.md) for the full visual layout.
175 
176### Decision Framework: Direct vs. Pairwise
177 
178Apply this decision tree:
179 
180```
181Is there an objective ground truth?
182+-- Yes -> Direct Scoring
183|   Examples: factual accuracy, instruction following, format compliance
184|
185+-- No -> Is it a preference or quality judgment?
186    +-- Yes -> Pairwise Comparison
187    |   Examples: tone, style, persuasiveness, creativity
188    |
189    +-- No -> Consider reference-based evaluation
190        Examples: summarization (compare to source), translation (compare to reference)
191```
192 
193### Scaling Evaluation
194 
195For high-volume evaluation, apply one of these strategies:
196 
1971. **Panel of LLMs (PoLL)**: Use multiple models as judges and aggregate votes to reduce individual model bias. More expensive but more reliable for high-stakes decisions.
198 
1992. **Hierarchical evaluation**: Use a fast cheap model for screening and an expensive model for edge cases. Requires calibration of the screening threshold.
200 
2013. **Human-in-the-loop**: Automate clear cases and route low-confidence decisions to human review. Design feedback loops to improve automated evaluation over time.
202 
203## Examples
204 
205### Example 1: Direct Scoring for Accuracy
206 
207**Input**:
208```
209Prompt: "What causes seasons on Earth?"
210Response: "Seasons are caused by Earth's tilted axis. As Earth orbits the Sun,
211different hemispheres receive more direct sunlight at different times of year."
212Criterion: Factual Accuracy (weight: 1.0)
213Scale: 1-5
214```
215 
216**Output**:
217```json
218{
219  "criterion": "Factual Accuracy",
220  "score": 5,
221  "evidence": [
222    "Correctly identifies axial tilt as primary cause",
223    "Correctly explains differential sunlight by hemisphere",
224    "No factual errors present"
225  ],
226  "justification": "Response accurately explains the cause of seasons with correct
227scientific reasoning. Both the axial tilt and its effect on sunlight distribution
228are correctly described.",
229  "improvement": "Could add the specific tilt angle (23.5 degrees) for completeness."
230}
231```
232 
233### Example 2: Pairwise Comparison with Position Swap
234 
235**Input**:
236```
237Prompt: "Explain machine learning to a beginner"
238Response A: [Technical explanation with jargon]
239Response B: [Simple analogy-based explanation]
240Criteria: ["clarity", "accessibility"]
241```
242 
243**First Pass (A first)**:
244```json
245{ "winner": "B", "confidence": 0.8 }
246```
247 
248**Second Pass (B first)**:
249```json
250{ "winner": "A", "confidence": 0.6 }
251```
252(Note: Winner is A because B was in first position)
253 
254**Mapped Second Pass**:
255```json
256{ "winner": "B", "confidence": 0.6 }
257```
258 
259**Final Result**:
260```json
261{
262  "winner": "B",
263  "confidence": 0.7,
264  "positionConsistency": {
265    "consistent": true,
266    "firstPassWinner": "B",
267    "secondPassWinner": "B"
268  }
269}
270```
271 
272### Example 3: Rubric Generation
273 
274**Input**:
275```
276criterionName: "Code Readability"
277criterionDescription: "How easy the code is to understand and maintain"
278domain: "software engineering"
279scale: "1-5"
280strictness: "balanced"
281```
282 
283**Output** (abbreviated):
284```json
285{
286  "levels": [
287    {
288      "score": 1,
289      "label": "Poor",
290      "description": "Code is difficult to understand without significant effort",
291      "characteristics": [
292        "No meaningful variable or function names",
293        "No comments or documentation",
294        "Deeply nested or convoluted logic"
295      ]
296    },
297    {
298      "score": 3,
299      "label": "Adequate",
300      "description": "Code is understandable with some effort",
301      "characteristics": [
302        "Most variables have meaningful names",
303        "Basic comments present for complex sections",
304        "Logic is followable but could be cleaner"
305      ]
306    },
307    {
308      "score": 5,
309      "label": "Excellent",
310      "description": "Code is immediately clear and maintainable",
311      "characteristics": [
312        "All names are descriptive and consistent",
313        "Comprehensive documentation",
314        "Clean, modular structure"
315      ]
316    }
317  ],
318  "edgeCases": [
319    {
320      "situation": "Code is well-structured but uses domain-specific abbreviations",
321      "guidance": "Score based on readability for domain experts, not general audience"
322    }
323  ]
324}
325```
326 
327## Guidelines
328 
3291. **Always require justification before scores** - Chain-of-thought prompting improves reliability by 15-25%
330 
3312. **Always swap positions in pairwise comparison** - Single-pass comparison is corrupted by position bias
332 
3333. **Match scale granularity to rubric specificity** - Don't use 1-10 without detailed level descriptions
334 
3354. **Separate objective and subjective criteria** - Use direct scoring for objective, pairwise for subjective
336 
3375. **Include confidence scores** - Calibrate to position consistency and evidence strength
338 
3396. **Define edge cases explicitly** - Ambiguous situations cause the most evaluation variance
340 
3417. **Use domain-specific rubrics** - Generic rubrics produce generic (less useful) evaluations
342 
3438. **Validate against human judgments** - Automated evaluation is only valuable if it correlates with human assessment
344 
3459. **Monitor for systematic bias** - Track disagreement patterns by criterion, response type, model
346 
34710. **Design for iteration** - Evaluation systems improve with feedback loops
348 
349## Gotchas
350 
3511. **Scoring without justification**: Scores lack grounding and are difficult to debug. Always require evidence-based justification before the score.
352 
3532. **Single-pass pairwise comparison**: Position bias corrupts results when positions are not swapped. Always evaluate twice with swapped positions and check consistency.
354 
3553. **Overloaded criteria**: Criteria that measure multiple things at once produce unreliable scores. Enforce one criterion = one measurable aspect.
356 
3574. **Missing edge case guidance**: Evaluators handle ambiguous cases inconsistently without explicit instructions. Include edge cases in rubrics with clear resolution rules.
358 
3595. **Ignoring confidence calibration**: High-confidence wrong judgments are worse than low-confidence ones. Calibrate confidence to position consistency and evidence strength.
360 
3616. **Rubric drift**: Rubrics become miscalibrated as quality standards evolve or model capabilities improve. Schedule periodic rubric reviews and re-anchor score levels against fresh human-annotated examples.
362 
3637. **Evaluation prompt sensitivity**: Minor wording changes in evaluation prompts (e.g., reordering instructions, changing phrasing) can cause 10-20% score swings. Version-control evaluation prompts and run regression tests before deploying prompt changes.
364 
3658. **Uncontrolled length bias**: Longer responses systematically score higher even when conciseness is preferred. Add explicit length-neutrality instructions to evaluation prompts and validate with length-controlled test pairs.
366 
367## Integration
368 
369This skill integrates with:
370 
371- **context-fundamentals** - Evaluation prompts require effective context structure
372- **tool-design** - Evaluation tools need proper schemas and error handling
373- **context-optimization** - Evaluation prompts can be optimized for token efficiency
374- **evaluation** (foundational) - This skill extends the foundational evaluation concepts
375 
376## References
377 
378Internal reference:
379- [LLM-as-Judge Implementation Patterns](./references/implementation-patterns.md) - Read when: building an evaluation pipeline from scratch or integrating LLM judges into CI/CD
380- [Bias Mitigation Techniques](./references/bias-mitigation.md) - Read when: evaluation results show inconsistent or suspicious scoring patterns
381- [Metric Selection Guide](./references/metrics-guide.md) - Read when: choosing statistical metrics to validate evaluation reliability
382- [Evaluation Pipeline Diagram](./references/evaluation-pipeline.md) - Read when: designing the architecture of a multi-stage evaluation system
383 
384External research:
385- [Eugene Yan: Evaluating the Effectiveness of LLM-Evaluators](https://eugeneyan.com/writing/llm-evaluators/) - Read when: surveying the state of the art in LLM evaluation
386- [Judging LLM-as-a-Judge (Zheng et al., 2023)](https://arxiv.org/abs/2306.05685) - Read when: understanding position bias and MT-Bench methodology
387- [G-Eval: NLG Evaluation using GPT-4 (Liu et al., 2023)](https://arxiv.org/abs/2303.16634) - Read when: implementing chain-of-thought evaluation scoring
388- [Large Language Models are not Fair Evaluators (Wang et al., 2023)](https://arxiv.org/abs/2305.17926) - Read when: diagnosing systematic bias in evaluation outputs
389 
390Related skills in this collection:
391- evaluation - Foundational evaluation concepts
392- context-fundamentals - Context structure for evaluation prompts
393- tool-design - Building evaluation tools
394 
395---
396 
397## Skill Metadata
398 
399**Created**: 2025-12-24
400**Last Updated**: 2026-03-17
401**Author**: Agent Skills for Context Engineering Contributors
402**Version**: 2.0.0
403
Preparing the source view

Agent Skills for Context Engineering

skills/advanced-evaluation/SKILL.md