Loading source
Pulling the file list, source metadata, and syntax-aware rendering for this listing.
Source from repo
A comprehensive collection of Agent Skills for context engineering, multi-agent architectures, and production agent systems.
Files
Skill
Size
Entrypoint
Format
Open file
Syntax-highlighted preview of this file as included in the skill package.
examples/llm-as-judge-skills/skills/llm-evaluator/llm-evaluator.md
1# LLM-Evaluator Skill23## Overview45LLM-Evaluators (LLM-as-a-Judge) are large language models designed to evaluate the quality of another LLM's response to an instruction or query. This skill provides the foundational knowledge for building evaluation systems.67## Key Considerations89### Baseline Selection10- **Human Annotators**: Aim for LLM-human correlation to match human-human correlation. LLM-evaluators are orders of magnitude faster and cheaper than human annotation.11- **Finetuned Classifiers**: Goal is to achieve similar recall and precision as a finetuned classifier. More challenging baseline as these are optimized for specific tasks.1213### Scoring Approaches1415| Approach | Use Case | Reliability |16|----------|----------|-------------|17| **Direct Scoring** | Objective tasks (factuality, toxicity, instruction-following) | More suitable for binary classification |18| **Pairwise Comparison** | Subjective evaluations (tone, persuasiveness, coherence) | More reliable for preference tasks |19| **Reference-Based** | Comparing against gold standard | Requires ground truth reference |2021### Evaluation Metrics2223**Classification Metrics** (Preferred for binary tasks):24- Recall and Precision25- F1 Score26- Cohen's κ (Kappa)2728**Correlation Metrics** (For Likert scale tasks):29- Spearman's ρ (rho)30- Kendall's τ (tau)3132## Known Biases33341. **Position Bias**: LLM-evaluators tend to prefer responses in certain positions during pairwise comparison (usually first position)352. **Verbosity Bias**: Favor longer, more verbose responses even if not higher quality363. **Self-Enhancement Bias**: LLM-evaluators prefer answers generated by themselves3738## Mitigation Strategies3940- Swap response positions and average results41- Normalize for length when evaluating42- Use a Panel of LLMs (PoLL) instead of single judge43- Include "don't overthink" instructions44- Use CoT + n-shot prompts for reliability4546## Implementation Pattern4748```typescript49interface EvaluatorConfig {50scoringApproach: 'direct' | 'pairwise' | 'reference-based';51criteria: EvaluationCriteria[];52metrics: MetricType[];53useCoT: boolean;54nShot: number;55}5657interface EvaluationCriteria {58name: string;59description: string;60rubric: RubricLevel[];61}6263interface RubricLevel {64score: number;65description: string;66}67```6869## References7071Key papers reviewed:72- Constitutional AI (Anthropic)73- G-Eval: NLG Evaluation using GPT-474- SelfCheckGPT: Zero-Resource Hallucination Detection75- Prometheus: Fine-grained Evaluation Capability76- MT-Bench and Chatbot Arena7778