LLM-as-a-Judge Skills
A practical implementation of LLM evaluation skills built using insights from Eugene Yan's LLM-Evaluators research and Vercel AI SDK 6.
   
π― Purpose
This repository demonstrates how to build production-ready LLM evaluation skills as part of the Agent Skills for Context Engineering project. It serves as a practical example of:
- Skill Development: How to transform research insights into executable agent skills
- Tool Design: Best practices for building AI tools with proper schemas and error handling
- Evaluation Patterns: Implementation of LLM-as-a-Judge patterns for quality assessment
Part of the Context Engineering Ecosystem
This project is an example implementation to be added to:
It builds upon the foundational skills from:
- π
skills/context-fundamentals- Context engineering principles - π§
skills/tool-design- Tool design best practices
π Background & Research
The LLM-as-a-Judge Problem
Evaluating AI-generated content is challenging. Traditional metrics (BLEU, ROUGE) often miss nuances that matter. Eugene Yan's research on LLM-Evaluators identifies practical patterns for using LLMs to judge LLM outputs.
Key insights we implemented:
| Insight | Implementation |
|---|---|
| Direct scoring works best for objective criteria | directScore tool with rubric support |
| Pairwise comparison is more reliable for preferences | pairwiseCompare tool with position swapping |
| Position bias affects pairwise judgments | Automatic position swapping in comparisons |
| Chain-of-thought improves reliability | All evaluations require justification with evidence |
| Clear rubrics reduce variance | generateRubric tool for consistent standards |
Vercel AI SDK 6 Patterns
We leveraged AI SDK 6's new patterns:
- Agent Abstraction: Reusable
EvaluatorAgentclass with multiple capabilities - Type-safe Tools: Zod schemas for all inputs/outputs
- Structured Output: JSON responses parsed and validated
- Error Handling: Graceful degradation when API calls fail
ποΈ What We Built
Architecture Overview
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β LLM-as-a-Judge Skills β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β β
β βββββββββββββββ βββββββββββββββ βββββββββββββββββββββββββββ β
β β Skills β β Prompts β β Tools β β
β β (MD docs) βββββΆβ (templates)βββββΆβ (TypeScript impl) β β
β βββββββββββββββ βββββββββββββββ βββββββββββββββββββββββββββ β
β β β β
β β βΌ β
β β βββββββββββββββββββββββββββ β
β βββββββββββββββββββββββββββββββΆβ EvaluatorAgent β β
β β βββ score() β β
β β βββ compare() β β
β β βββ generateRubric() β β
β β βββ chat() β β
β βββββββββββββββββββββββββββ β
β β β
β βΌ β
β βββββββββββββββββββββββββββ β
β β OpenAI GPT-5.2 API β β
β βββββββββββββββββββββββββββ β
β β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββDirectory Structure
llm-as-judge-skills/
βββ skills/ # Foundational knowledge (MD docs)
β βββ llm-evaluator/ # LLM-as-a-Judge patterns
β β βββ llm-evaluator.md # Evaluation methods, metrics, bias mitigation
β βββ context-fundamentals/ # Context engineering principles
β β βββ context-fundamentals.md # Managing context effectively
β βββ tool-design/ # Tool design best practices
β βββ tool-design.md # Schema design, error handling
β
βββ prompts/ # Prompt templates
β βββ evaluation/
β β βββ direct-scoring-prompt.md # Scoring prompt template
β β βββ pairwise-comparison-prompt.md # Comparison prompt template
β βββ research/
β β βββ research-synthesis-prompt.md
β βββ agent-system/
β βββ orchestrator-prompt.md
β
βββ tools/ # Tool documentation (MD)
β βββ evaluation/
β β βββ direct-score.md # Direct scoring tool spec
β β βββ pairwise-compare.md # Pairwise comparison spec
β β βββ generate-rubric.md # Rubric generation spec
β βββ research/
β β βββ web-search.md
β β βββ read-url.md
β βββ orchestration/
β βββ delegate-to-agent.md
β
βββ agents/ # Agent documentation (MD)
β βββ evaluator-agent/
β β βββ evaluator-agent.md
β βββ research-agent/
β β βββ research-agent.md
β βββ orchestrator-agent/
β βββ orchestrator-agent.md
β
βββ src/ # TypeScript implementation
β βββ tools/evaluation/
β β βββ direct-score.ts # 165 lines - Direct scoring implementation
β β βββ pairwise-compare.ts # 255 lines - Pairwise with bias mitigation
β β βββ generate-rubric.ts # 162 lines - Rubric generation
β βββ agents/
β β βββ evaluator.ts # 112 lines - EvaluatorAgent class
β βββ config/
β β βββ index.ts # Configuration and validation
β βββ index.ts # Main exports
β
βββ tests/ # Test suite
β βββ evaluation.test.ts # 9 tests for tools
β βββ skills.test.ts # 10 tests for skills
β βββ setup.ts # Test configuration
β
βββ examples/ # Usage examples
βββ basic-evaluation.ts
βββ pairwise-comparison.ts
βββ generate-rubric.ts
βββ full-evaluation-workflow.tsπ§ Core Tools Implemented
1. Direct Score Tool (directScore)
Purpose: Evaluate a single response against defined criteria with numerical scores.
When to Use:
- Factual accuracy checks
- Instruction following assessment
- Content quality grading
- Compliance verification
Implementation Highlights:
// From src/tools/evaluation/direct-score.ts
const systemPrompt = `You are an expert evaluator. Assess the response against each criterion.
For each criterion:
1. Find specific evidence in the response
2. Score according to the rubric (1-5 scale)
3. Justify your score
4. Suggest one improvement
Be objective and consistent. Base scores on explicit evidence.`;Key Features:
- Weighted criteria support
- Chain-of-thought justification required
- Evidence extraction from response
- Improvement suggestions per criterion
- Configurable rubrics (1-3, 1-5, 1-10 scales)
Example Usage:
const result = await executeDirectScore({
response: 'Quantum entanglement is like having two magical coins...',
prompt: 'Explain quantum entanglement to a high school student',
criteria: [
{ name: 'Accuracy', description: 'Scientific correctness', weight: 0.4 },
{ name: 'Clarity', description: 'Understandable for audience', weight: 0.3 },
{ name: 'Engagement', description: 'Interesting and memorable', weight: 0.3 }
],
rubric: { scale: '1-5' }
});
// Output:
// {
// success: true,
// scores: [
// { criterion: 'Accuracy', score: 4, justification: '...', evidence: [...] },
// { criterion: 'Clarity', score: 5, justification: '...', evidence: [...] },
// { criterion: 'Engagement', score: 4, justification: '...', evidence: [...] }
// ],
// overallScore: 4.33,
// weightedScore: 4.3,
// summary: { assessment: '...', strengths: [...], weaknesses: [...] }
// }2. Pairwise Compare Tool (pairwiseCompare)
Purpose: Compare two responses and determine which is better, with position bias mitigation.
When to Use:
- A/B testing responses
- Preference evaluation
- Style and tone assessment
- Ranking quality differences
Implementation Highlights:
// Position bias mitigation: evaluate twice with swapped positions
if (input.swapPositions) {
// First pass: A first, B second
const pass1 = await evaluatePair(input.responseA, input.responseB, ...);
// Second pass: B first, A second
const pass2 = await evaluatePair(input.responseB, input.responseA, ...);
// Map pass2 result back and check consistency
const pass2WinnerMapped = pass2.winner === 'A' ? 'B' : pass2.winner === 'B' ? 'A' : 'TIE';
const consistent = pass1.winner === pass2WinnerMapped;
// If inconsistent, return TIE with lower confidence
if (!consistent) {
finalWinner = 'TIE';
finalConfidence = 0.5;
}
}Key Features:
- Position Swapping: Automatically runs evaluation twice with swapped positions
- Consistency Check: Detects when position affects judgment
- Confidence Scoring: 0-1 confidence based on consistency
- Per-criterion Comparison: Detailed breakdown for each aspect
- Bias-aware Prompting: Explicit instructions to ignore length and position
Example Usage:
const result = await executePairwiseCompare({
responseA: GOOD_RESPONSE,
responseB: POOR_RESPONSE,
prompt: 'Explain quantum entanglement',
criteria: ['accuracy', 'clarity', 'completeness', 'engagement'],
allowTie: true,
swapPositions: true // Enable position bias mitigation
});
// Output:
// {
// success: true,
// winner: 'A',
// confidence: 0.85,
// positionConsistency: { consistent: true, firstPassWinner: 'A', secondPassWinner: 'A' },
// comparison: [
// { criterion: 'accuracy', winner: 'A', reasoning: '...' },
// { criterion: 'clarity', winner: 'A', reasoning: '...' },
// ...
// ]
// }3. Generate Rubric Tool (generateRubric)
Purpose: Create detailed scoring rubrics for consistent evaluation standards.
When to Use:
- Establishing evaluation criteria
- Training human evaluators
- Ensuring consistency across evaluations
- Documenting quality standards
Implementation Highlights:
// Strictness affects the generated rubric:
// - lenient: Lower bar for passing scores
// - balanced: Fair, typical expectations
// - strict: High standards, critical evaluation
const userPrompt = `Create a scoring rubric for:
**Criterion**: ${input.criterionName}
**Description**: ${input.criterionDescription}
**Scale**: ${input.scale}
**Domain**: ${input.domain}
Generate:
1. Clear descriptions for each score level
2. Specific characteristics that define each level
3. Brief example text for each level
4. General scoring guidelines
5. Edge cases with guidance`;Key Features:
- Domain-specific terminology
- Configurable strictness levels
- Example generation for each level
- Edge case guidance
- Scoring guidelines
Example Usage:
const result = await executeGenerateRubric({
criterionName: 'Code Readability',
criterionDescription: 'How easy the code is to understand and maintain',
scale: '1-5',
domain: 'software engineering',
includeExamples: true,
strictness: 'balanced'
});
// Output:
// {
// success: true,
// levels: [
// { score: 1, label: 'Poor', description: '...', characteristics: [...], example: '...' },
// { score: 2, label: 'Below Average', ... },
// { score: 3, label: 'Average', ... },
// { score: 4, label: 'Good', ... },
// { score: 5, label: 'Excellent', ... }
// ],
// scoringGuidelines: [...],
// edgeCases: [{ situation: '...', guidance: '...' }]
// }4. Evaluator Agent
Purpose: High-level agent that combines all evaluation tools with conversational capability.
Implementation:
export class EvaluatorAgent {
private model: string;
private temperature: number;
constructor(config?: EvaluatorAgentConfig) {
this.model = config?.model || 'gpt-5.2';
this.temperature = config?.temperature || 0.3;
}
// Score a response
async score(input: DirectScoreInput) { ... }
// Compare two responses
async compare(input: PairwiseCompareInput) { ... }
// Generate a rubric
async generateRubric(input: GenerateRubricInput) { ... }
// Full workflow: generate rubric then score
async evaluateWithGeneratedRubric(response, prompt, criteria) { ... }
// Chat-based evaluation
async chat(userMessage: string) { ... }
}π Test Results
All 19 tests pass successfully. Here are the actual test logs from our test run:
Test Output
> [email protected] test
> vitest run --testTimeout=120000
RUN v2.1.9 /Users/muratcankoylan/app_readwren
β tests/skills.test.ts (10 tests) 159317ms
β LLM Evaluator Skill Tests > Direct Scoring Skill > should use chain-of-thought in scoring 4439ms
β LLM Evaluator Skill Tests > Direct Scoring Skill > should handle multiple weighted criteria 7218ms
β LLM Evaluator Skill Tests > Pairwise Comparison Skill > should mitigate position bias with swap 13002ms
β LLM Evaluator Skill Tests > Pairwise Comparison Skill > should identify clear winner for quality difference 25914ms
β LLM Evaluator Skill Tests > Rubric Generation Skill > should generate domain-specific rubrics 37165ms
β LLM Evaluator Skill Tests > Rubric Generation Skill > should provide edge case guidance 29088ms
β LLM Evaluator Skill Tests > Context Fundamentals Skill Application > should utilize provided context in evaluation 11133ms
β Skill Input/Output Validation > should validate DirectScore input schema 4733ms
β Skill Input/Output Validation > should validate PairwiseCompare output structure 4123ms
β Skill Input/Output Validation > should validate GenerateRubric output structure 22500ms
β tests/evaluation.test.ts (9 tests) 216353ms
β Direct Score Tool > should score a response against criteria 13219ms
β Direct Score Tool > should provide lower scores for poor responses 14834ms
β Pairwise Compare Tool > should correctly identify the better response 29254ms
β Pairwise Compare Tool > should handle similar responses appropriately 14418ms
β Pairwise Compare Tool > should provide comparison details for each criterion 9931ms
β Generate Rubric Tool > should generate a complete rubric 24106ms
β Generate Rubric Tool > should respect strictness setting 57919ms
β Evaluator Agent > should provide integrated evaluation workflow 48112ms
β Evaluator Agent > should support chat-based evaluation 4558ms
Test Files 2 passed (2)
Tests 19 passed (19)
Start at 00:25:16
Duration 216.66s (transform 68ms, setup 32ms, collect 148ms, tests 375.67s, environment 0ms, prepare 105ms)Test Coverage Summary
| Test Category | Tests | Pass Rate | Avg Duration |
|---|---|---|---|
| Direct Scoring | 4 | 100% | 9.9s |
| Pairwise Comparison | 4 | 100% | 17.9s |
| Rubric Generation | 4 | 100% | 33.2s |
| Context Integration | 1 | 100% | 11.1s |
| Agent Integration | 2 | 100% | 26.3s |
| Schema Validation | 4 | 100% | 8.8s |
π Key Learnings
1. Position Bias is Real
During testing, we confirmed Eugene Yan's research findings:
Test: "should mitigate position bias with swap" - 13002ms
Result: Position consistency check correctly detected and mitigated biasWhen comparing identical responses, the system correctly returns TIE. When comparing clearly different quality responses, the winner is consistent across position swaps.
2. Chain-of-Thought Improves Quality
Tests confirm that requiring justification produces more reliable evaluations:
Test: "should use chain-of-thought in scoring" - 4439ms
Result: All scores include justifications >20 characters with specific evidence3. Domain-Specific Rubrics Matter
The rubric generator adapts to the specified domain:
Test: "should generate domain-specific rubrics" - 37165ms
Result: Software engineering rubric included terms like "variable", "function", "comment"4. Weighted Criteria Enable Nuanced Evaluation
Test: "should handle multiple weighted criteria" - 7218ms
Result: weightedScore differs from overallScore when weights are unequal5. Context Affects Evaluation
The context fundamentals skill proves valuable:
Test: "should utilize provided context in evaluation" - 11133ms
Result: Medical context allowed technical terminology to score wellπ Quick Start
Installation
git clone https://github.com/muratcankoylan/llm-as-judge-skills.git
cd llm-as-judge-skills
npm installConfiguration
Create a .env file:
OPENAI_API_KEY=your_openai_api_key_here
OPENAI_MODEL=gpt-5.2 Run Tests
npm testBasic Usage
import { EvaluatorAgent } from './src/agents/evaluator';
const agent = new EvaluatorAgent();
// Score a response
const scoreResult = await agent.score({
response: 'Your AI-generated response',
prompt: 'The original prompt',
criteria: [
{ name: 'Accuracy', description: 'Factual correctness', weight: 1 }
]
});
console.log(`Score: ${scoreResult.overallScore}/5`);
// Compare two responses
const compareResult = await agent.compare({
responseA: 'First response',
responseB: 'Second response',
prompt: 'The prompt',
criteria: ['quality', 'completeness'],
allowTie: true,
swapPositions: true
});
console.log(`Winner: ${compareResult.winner} (confidence: ${compareResult.confidence})`);π Integration with Agent Skills Repository
This project is designed to be added to the examples section of the main repository:
Agent-Skills-for-Context-Engineering/
βββ skills/
β βββ context-fundamentals/ # Foundation (referenced by this project)
β βββ tool-design/ # Foundation (referenced by this project)
βββ examples/
β βββ llm-as-judge-skills/ # β This project
β βββ README.md
β βββ skills/
β βββ tools/
β βββ agents/
β βββ src/How This Example Demonstrates the Framework
- Skills β Prompts β Tools: Shows the progression from knowledge (MD files) to executable code
- Context Engineering: Applies context fundamentals in evaluation prompts
- Tool Design Patterns: Implements Zod schemas, error handling, and clear interfaces
- Agent Architecture: Uses AI SDK patterns for agent abstraction
π API Reference
DirectScoreInput
interface DirectScoreInput {
response: string; // The response to evaluate
prompt: string; // Original prompt
context?: string; // Additional context
criteria: Array<{
name: string; // Criterion name
description: string; // What it measures
weight: number; // Relative importance (0-1)
}>;
rubric?: {
scale: '1-3' | '1-5' | '1-10';
levelDescriptions?: Record<string, string>;
};
}PairwiseCompareInput
interface PairwiseCompareInput {
responseA: string; // First response
responseB: string; // Second response
prompt: string; // Original prompt
context?: string; // Additional context
criteria: string[]; // Comparison aspects
allowTie?: boolean; // Allow tie verdict (default: true)
swapPositions?: boolean; // Mitigate position bias (default: true)
}GenerateRubricInput
interface GenerateRubricInput {
criterionName: string; // Name of criterion
criterionDescription: string; // What it measures
scale?: '1-3' | '1-5' | '1-10';
domain?: string; // Domain for terminology
includeExamples?: boolean; // Generate examples
strictness?: 'lenient' | 'balanced' | 'strict';
}π οΈ Development
Scripts
npm run build # Compile TypeScript
npm run dev # Watch mode
npm test # Run tests
npm run lint # ESLint
npm run format # Prettier
npm run typecheck # Type checkAdding New Tools
- Create
src/tools/<category>/<tool-name>.ts - Define input/output Zod schemas
- Implement execute function
- Export from
src/tools/<category>/index.ts - Add documentation in
tools/<category>/<tool-name>.md - Write tests
π License
MIT License - see LICENSE for details.
π Acknowledgments
- Eugene Yan - LLM-as-a-Judge research
- Vercel AI SDK - Agent patterns and tooling
- Agent Skills for Context Engineering - Foundation framework