LLM-Evaluator Skill
Overview
LLM-Evaluators (LLM-as-a-Judge) are large language models designed to evaluate the quality of another LLM's response to an instruction or query. This skill provides the foundational knowledge for building evaluation systems.
Key Considerations
Baseline Selection
- Human Annotators: Aim for LLM-human correlation to match human-human correlation. LLM-evaluators are orders of magnitude faster and cheaper than human annotation.
- Finetuned Classifiers: Goal is to achieve similar recall and precision as a finetuned classifier. More challenging baseline as these are optimized for specific tasks.
Scoring Approaches
| Approach | Use Case | Reliability |
|---|---|---|
| Direct Scoring | Objective tasks (factuality, toxicity, instruction-following) | More suitable for binary classification |
| Pairwise Comparison | Subjective evaluations (tone, persuasiveness, coherence) | More reliable for preference tasks |
| Reference-Based | Comparing against gold standard | Requires ground truth reference |
Evaluation Metrics
Classification Metrics (Preferred for binary tasks):
- Recall and Precision
- F1 Score
- Cohen's κ (Kappa)
Correlation Metrics (For Likert scale tasks):
- Spearman's ρ (rho)
- Kendall's τ (tau)
Known Biases
- Position Bias: LLM-evaluators tend to prefer responses in certain positions during pairwise comparison (usually first position)
- Verbosity Bias: Favor longer, more verbose responses even if not higher quality
- Self-Enhancement Bias: LLM-evaluators prefer answers generated by themselves
Mitigation Strategies
- Swap response positions and average results
- Normalize for length when evaluating
- Use a Panel of LLMs (PoLL) instead of single judge
- Include "don't overthink" instructions
- Use CoT + n-shot prompts for reliability
Implementation Pattern
interface EvaluatorConfig {
scoringApproach: 'direct' | 'pairwise' | 'reference-based';
criteria: EvaluationCriteria[];
metrics: MetricType[];
useCoT: boolean;
nShot: number;
}
interface EvaluationCriteria {
name: string;
description: string;
rubric: RubricLevel[];
}
interface RubricLevel {
score: number;
description: string;
}References
Key papers reviewed:
- Constitutional AI (Anthropic)
- G-Eval: NLG Evaluation using GPT-4
- SelfCheckGPT: Zero-Resource Hallucination Detection
- Prometheus: Fine-grained Evaluation Capability
- MT-Bench and Chatbot Arena