LLM-Evaluator Skill

Overview

LLM-Evaluators (LLM-as-a-Judge) are large language models designed to evaluate the quality of another LLM's response to an instruction or query. This skill provides the foundational knowledge for building evaluation systems.

Key Considerations

Baseline Selection

Human Annotators: Aim for LLM-human correlation to match human-human correlation. LLM-evaluators are orders of magnitude faster and cheaper than human annotation.
Finetuned Classifiers: Goal is to achieve similar recall and precision as a finetuned classifier. More challenging baseline as these are optimized for specific tasks.

Scoring Approaches

Approach	Use Case	Reliability
Direct Scoring	Objective tasks (factuality, toxicity, instruction-following)	More suitable for binary classification
Pairwise Comparison	Subjective evaluations (tone, persuasiveness, coherence)	More reliable for preference tasks
Reference-Based	Comparing against gold standard	Requires ground truth reference

Evaluation Metrics

Classification Metrics (Preferred for binary tasks):

Recall and Precision
F1 Score
Cohen's κ (Kappa)

Correlation Metrics (For Likert scale tasks):

Spearman's ρ (rho)
Kendall's τ (tau)

Known Biases

Position Bias: LLM-evaluators tend to prefer responses in certain positions during pairwise comparison (usually first position)
Verbosity Bias: Favor longer, more verbose responses even if not higher quality
Self-Enhancement Bias: LLM-evaluators prefer answers generated by themselves

Mitigation Strategies

Swap response positions and average results
Normalize for length when evaluating
Use a Panel of LLMs (PoLL) instead of single judge
Include "don't overthink" instructions
Use CoT + n-shot prompts for reliability

Implementation Pattern

interface EvaluatorConfig {
  scoringApproach: 'direct' | 'pairwise' | 'reference-based';
  criteria: EvaluationCriteria[];
  metrics: MetricType[];
  useCoT: boolean;
  nShot: number;
}

interface EvaluationCriteria {
  name: string;
  description: string;
  rubric: RubricLevel[];
}

interface RubricLevel {
  score: number;
  description: string;
}

References

Key papers reviewed:

Constitutional AI (Anthropic)
G-Eval: NLG Evaluation using GPT-4
SelfCheckGPT: Zero-Resource Hallucination Detection
Prometheus: Fine-grained Evaluation Capability
MT-Bench and Chatbot Arena

Preparing the source view

Agent Skills for Context Engineering

examples/llm-as-judge-skills/skills/llm-evaluator/llm-evaluator.md