LLM-as-a-Judge Skills

A practical implementation of LLM evaluation skills built using insights from Eugene Yan's LLM-Evaluators research and Vercel AI SDK 6.

![License: MIT](https://opensource.org/licenses/MIT) ![TypeScript](https://www.typescriptlang.org/) ![AI SDK](https://sdk.vercel.ai/) ![Tests](#test-results)

🎯 Purpose

This repository demonstrates how to build production-ready LLM evaluation skills as part of the Agent Skills for Context Engineering project. It serves as a practical example of:

Skill Development: How to transform research insights into executable agent skills
Tool Design: Best practices for building AI tools with proper schemas and error handling
Evaluation Patterns: Implementation of LLM-as-a-Judge patterns for quality assessment

Part of the Context Engineering Ecosystem

This project is an example implementation to be added to:

📁 Agent-Skills-for-Context-Engineering/examples/

It builds upon the foundational skills from:

📚 skills/context-fundamentals - Context engineering principles
🔧 skills/tool-design - Tool design best practices

📖 Background & Research

The LLM-as-a-Judge Problem

Evaluating AI-generated content is challenging. Traditional metrics (BLEU, ROUGE) often miss nuances that matter. Eugene Yan's research on LLM-Evaluators identifies practical patterns for using LLMs to judge LLM outputs.

Key insights we implemented:

Insight	Implementation
Direct scoring works best for objective criteria	`directScore` tool with rubric support
Pairwise comparison is more reliable for preferences	`pairwiseCompare` tool with position swapping
Position bias affects pairwise judgments	Automatic position swapping in comparisons
Chain-of-thought improves reliability	All evaluations require justification with evidence
Clear rubrics reduce variance	`generateRubric` tool for consistent standards

Vercel AI SDK 6 Patterns

We leveraged AI SDK 6's new patterns:

Agent Abstraction: Reusable EvaluatorAgent class with multiple capabilities
Type-safe Tools: Zod schemas for all inputs/outputs
Structured Output: JSON responses parsed and validated
Error Handling: Graceful degradation when API calls fail

🏗️ What We Built

Architecture Overview

┌─────────────────────────────────────────────────────────────────────┐
│                        LLM-as-a-Judge Skills                         │
├─────────────────────────────────────────────────────────────────────┤
│                                                                       │
│  ┌─────────────┐    ┌─────────────┐    ┌─────────────────────────┐  │
│  │   Skills    │    │   Prompts   │    │         Tools           │  │
│  │  (MD docs)  │───▶│  (templates)│───▶│  (TypeScript impl)      │  │
│  └─────────────┘    └─────────────┘    └─────────────────────────┘  │
│         │                                         │                   │
│         │                                         ▼                   │
│         │                              ┌─────────────────────────┐  │
│         └─────────────────────────────▶│    EvaluatorAgent       │  │
│                                         │  ├── score()            │  │
│                                         │  ├── compare()          │  │
│                                         │  ├── generateRubric()   │  │
│                                         │  └── chat()             │  │
│                                         └─────────────────────────┘  │
│                                                     │                 │
│                                                     ▼                 │
│                                         ┌─────────────────────────┐  │
│                                         │   OpenAI GPT-5.2 API     │  │
│                                         └─────────────────────────┘  │
│                                                                       │
└─────────────────────────────────────────────────────────────────────┘

Directory Structure

llm-as-judge-skills/
├── skills/                          # Foundational knowledge (MD docs)
│   ├── llm-evaluator/               # LLM-as-a-Judge patterns
│   │   └── llm-evaluator.md         # Evaluation methods, metrics, bias mitigation
│   ├── context-fundamentals/        # Context engineering principles
│   │   └── context-fundamentals.md  # Managing context effectively
│   └── tool-design/                 # Tool design best practices
│       └── tool-design.md           # Schema design, error handling
│
├── prompts/                         # Prompt templates
│   ├── evaluation/
│   │   ├── direct-scoring-prompt.md      # Scoring prompt template
│   │   └── pairwise-comparison-prompt.md # Comparison prompt template
│   ├── research/
│   │   └── research-synthesis-prompt.md
│   └── agent-system/
│       └── orchestrator-prompt.md
│
├── tools/                           # Tool documentation (MD)
│   ├── evaluation/
│   │   ├── direct-score.md          # Direct scoring tool spec
│   │   ├── pairwise-compare.md      # Pairwise comparison spec
│   │   └── generate-rubric.md       # Rubric generation spec
│   ├── research/
│   │   ├── web-search.md
│   │   └── read-url.md
│   └── orchestration/
│       └── delegate-to-agent.md
│
├── agents/                          # Agent documentation (MD)
│   ├── evaluator-agent/
│   │   └── evaluator-agent.md
│   ├── research-agent/
│   │   └── research-agent.md
│   └── orchestrator-agent/
│       └── orchestrator-agent.md
│
├── src/                             # TypeScript implementation
│   ├── tools/evaluation/
│   │   ├── direct-score.ts          # 165 lines - Direct scoring implementation
│   │   ├── pairwise-compare.ts      # 255 lines - Pairwise with bias mitigation
│   │   └── generate-rubric.ts       # 162 lines - Rubric generation
│   ├── agents/
│   │   └── evaluator.ts             # 112 lines - EvaluatorAgent class
│   ├── config/
│   │   └── index.ts                 # Configuration and validation
│   └── index.ts                     # Main exports
│
├── tests/                           # Test suite
│   ├── evaluation.test.ts           # 9 tests for tools
│   ├── skills.test.ts               # 10 tests for skills
│   └── setup.ts                     # Test configuration
│
└── examples/                        # Usage examples
    ├── basic-evaluation.ts
    ├── pairwise-comparison.ts
    ├── generate-rubric.ts
    └── full-evaluation-workflow.ts

🔧 Core Tools Implemented

1. Direct Score Tool (`directScore`)

Purpose: Evaluate a single response against defined criteria with numerical scores.

When to Use:

Factual accuracy checks
Instruction following assessment
Content quality grading
Compliance verification

Implementation Highlights:

// From src/tools/evaluation/direct-score.ts

const systemPrompt = `You are an expert evaluator. Assess the response against each criterion.
For each criterion:
1. Find specific evidence in the response
2. Score according to the rubric (1-5 scale)
3. Justify your score
4. Suggest one improvement

Be objective and consistent. Base scores on explicit evidence.`;

Key Features:

Weighted criteria support
Chain-of-thought justification required
Evidence extraction from response
Improvement suggestions per criterion
Configurable rubrics (1-3, 1-5, 1-10 scales)

Example Usage:

const result = await executeDirectScore({
  response: 'Quantum entanglement is like having two magical coins...',
  prompt: 'Explain quantum entanglement to a high school student',
  criteria: [
    { name: 'Accuracy', description: 'Scientific correctness', weight: 0.4 },
    { name: 'Clarity', description: 'Understandable for audience', weight: 0.3 },
    { name: 'Engagement', description: 'Interesting and memorable', weight: 0.3 }
  ],
  rubric: { scale: '1-5' }
});

// Output:
// {
//   success: true,
//   scores: [
//     { criterion: 'Accuracy', score: 4, justification: '...', evidence: [...] },
//     { criterion: 'Clarity', score: 5, justification: '...', evidence: [...] },
//     { criterion: 'Engagement', score: 4, justification: '...', evidence: [...] }
//   ],
//   overallScore: 4.33,
//   weightedScore: 4.3,
//   summary: { assessment: '...', strengths: [...], weaknesses: [...] }
// }

2. Pairwise Compare Tool (`pairwiseCompare`)

Purpose: Compare two responses and determine which is better, with position bias mitigation.

When to Use:

A/B testing responses
Preference evaluation
Style and tone assessment
Ranking quality differences

Implementation Highlights:

// Position bias mitigation: evaluate twice with swapped positions
if (input.swapPositions) {
  // First pass: A first, B second
  const pass1 = await evaluatePair(input.responseA, input.responseB, ...);
  
  // Second pass: B first, A second
  const pass2 = await evaluatePair(input.responseB, input.responseA, ...);
  
  // Map pass2 result back and check consistency
  const pass2WinnerMapped = pass2.winner === 'A' ? 'B' : pass2.winner === 'B' ? 'A' : 'TIE';
  const consistent = pass1.winner === pass2WinnerMapped;
  
  // If inconsistent, return TIE with lower confidence
  if (!consistent) {
    finalWinner = 'TIE';
    finalConfidence = 0.5;
  }
}

Key Features:

Position Swapping: Automatically runs evaluation twice with swapped positions
Consistency Check: Detects when position affects judgment
Confidence Scoring: 0-1 confidence based on consistency
Per-criterion Comparison: Detailed breakdown for each aspect
Bias-aware Prompting: Explicit instructions to ignore length and position

Example Usage:

const result = await executePairwiseCompare({
  responseA: GOOD_RESPONSE,
  responseB: POOR_RESPONSE,
  prompt: 'Explain quantum entanglement',
  criteria: ['accuracy', 'clarity', 'completeness', 'engagement'],
  allowTie: true,
  swapPositions: true  // Enable position bias mitigation
});

// Output:
// {
//   success: true,
//   winner: 'A',
//   confidence: 0.85,
//   positionConsistency: { consistent: true, firstPassWinner: 'A', secondPassWinner: 'A' },
//   comparison: [
//     { criterion: 'accuracy', winner: 'A', reasoning: '...' },
//     { criterion: 'clarity', winner: 'A', reasoning: '...' },
//     ...
//   ]
// }

3. Generate Rubric Tool (`generateRubric`)

Purpose: Create detailed scoring rubrics for consistent evaluation standards.

When to Use:

Establishing evaluation criteria
Training human evaluators
Ensuring consistency across evaluations
Documenting quality standards

Implementation Highlights:

// Strictness affects the generated rubric:
// - lenient: Lower bar for passing scores
// - balanced: Fair, typical expectations
// - strict: High standards, critical evaluation

const userPrompt = `Create a scoring rubric for:
**Criterion**: ${input.criterionName}
**Description**: ${input.criterionDescription}
**Scale**: ${input.scale}
**Domain**: ${input.domain}

Generate:
1. Clear descriptions for each score level
2. Specific characteristics that define each level
3. Brief example text for each level
4. General scoring guidelines
5. Edge cases with guidance`;

Key Features:

Domain-specific terminology
Configurable strictness levels
Example generation for each level
Edge case guidance
Scoring guidelines

Example Usage:

const result = await executeGenerateRubric({
  criterionName: 'Code Readability',
  criterionDescription: 'How easy the code is to understand and maintain',
  scale: '1-5',
  domain: 'software engineering',
  includeExamples: true,
  strictness: 'balanced'
});

// Output:
// {
//   success: true,
//   levels: [
//     { score: 1, label: 'Poor', description: '...', characteristics: [...], example: '...' },
//     { score: 2, label: 'Below Average', ... },
//     { score: 3, label: 'Average', ... },
//     { score: 4, label: 'Good', ... },
//     { score: 5, label: 'Excellent', ... }
//   ],
//   scoringGuidelines: [...],
//   edgeCases: [{ situation: '...', guidance: '...' }]
// }

4. Evaluator Agent

Purpose: High-level agent that combines all evaluation tools with conversational capability.

Implementation:

export class EvaluatorAgent {
  private model: string;
  private temperature: number;

  constructor(config?: EvaluatorAgentConfig) {
    this.model = config?.model || 'gpt-5.2';
    this.temperature = config?.temperature || 0.3;
  }

  // Score a response
  async score(input: DirectScoreInput) { ... }

  // Compare two responses
  async compare(input: PairwiseCompareInput) { ... }

  // Generate a rubric
  async generateRubric(input: GenerateRubricInput) { ... }

  // Full workflow: generate rubric then score
  async evaluateWithGeneratedRubric(response, prompt, criteria) { ... }

  // Chat-based evaluation
  async chat(userMessage: string) { ... }
}

📊 Test Results

All 19 tests pass successfully. Here are the actual test logs from our test run:

Test Output

> [email protected] test
> vitest run --testTimeout=120000

 RUN  v2.1.9 /Users/muratcankoylan/app_readwren

 ✓ tests/skills.test.ts (10 tests) 159317ms
   ✓ LLM Evaluator Skill Tests > Direct Scoring Skill > should use chain-of-thought in scoring 4439ms
   ✓ LLM Evaluator Skill Tests > Direct Scoring Skill > should handle multiple weighted criteria 7218ms
   ✓ LLM Evaluator Skill Tests > Pairwise Comparison Skill > should mitigate position bias with swap 13002ms
   ✓ LLM Evaluator Skill Tests > Pairwise Comparison Skill > should identify clear winner for quality difference 25914ms
   ✓ LLM Evaluator Skill Tests > Rubric Generation Skill > should generate domain-specific rubrics 37165ms
   ✓ LLM Evaluator Skill Tests > Rubric Generation Skill > should provide edge case guidance 29088ms
   ✓ LLM Evaluator Skill Tests > Context Fundamentals Skill Application > should utilize provided context in evaluation 11133ms
   ✓ Skill Input/Output Validation > should validate DirectScore input schema 4733ms
   ✓ Skill Input/Output Validation > should validate PairwiseCompare output structure 4123ms
   ✓ Skill Input/Output Validation > should validate GenerateRubric output structure 22500ms

 ✓ tests/evaluation.test.ts (9 tests) 216353ms
   ✓ Direct Score Tool > should score a response against criteria 13219ms
   ✓ Direct Score Tool > should provide lower scores for poor responses 14834ms
   ✓ Pairwise Compare Tool > should correctly identify the better response 29254ms
   ✓ Pairwise Compare Tool > should handle similar responses appropriately 14418ms
   ✓ Pairwise Compare Tool > should provide comparison details for each criterion 9931ms
   ✓ Generate Rubric Tool > should generate a complete rubric 24106ms
   ✓ Generate Rubric Tool > should respect strictness setting 57919ms
   ✓ Evaluator Agent > should provide integrated evaluation workflow 48112ms
   ✓ Evaluator Agent > should support chat-based evaluation 4558ms

 Test Files  2 passed (2)
      Tests  19 passed (19)
   Start at  00:25:16
   Duration  216.66s (transform 68ms, setup 32ms, collect 148ms, tests 375.67s, environment 0ms, prepare 105ms)

Test Coverage Summary

Test Category	Tests	Pass Rate	Avg Duration
Direct Scoring	4	100%	9.9s
Pairwise Comparison	4	100%	17.9s
Rubric Generation	4	100%	33.2s
Context Integration	1	100%	11.1s
Agent Integration	2	100%	26.3s
Schema Validation	4	100%	8.8s

📚 Key Learnings

1. Position Bias is Real

During testing, we confirmed Eugene Yan's research findings:

Test: "should mitigate position bias with swap" - 13002ms
Result: Position consistency check correctly detected and mitigated bias

When comparing identical responses, the system correctly returns TIE. When comparing clearly different quality responses, the winner is consistent across position swaps.

2. Chain-of-Thought Improves Quality

Tests confirm that requiring justification produces more reliable evaluations:

Test: "should use chain-of-thought in scoring" - 4439ms
Result: All scores include justifications >20 characters with specific evidence

3. Domain-Specific Rubrics Matter

The rubric generator adapts to the specified domain:

Test: "should generate domain-specific rubrics" - 37165ms
Result: Software engineering rubric included terms like "variable", "function", "comment"

4. Weighted Criteria Enable Nuanced Evaluation

Test: "should handle multiple weighted criteria" - 7218ms
Result: weightedScore differs from overallScore when weights are unequal

5. Context Affects Evaluation

The context fundamentals skill proves valuable:

Test: "should utilize provided context in evaluation" - 11133ms
Result: Medical context allowed technical terminology to score well

🚀 Quick Start

Installation

git clone https://github.com/muratcankoylan/llm-as-judge-skills.git
cd llm-as-judge-skills
npm install

Configuration

Create a .env file:

OPENAI_API_KEY=your_openai_api_key_here
OPENAI_MODEL=gpt-5.2

Run Tests

npm test

Basic Usage

import { EvaluatorAgent } from './src/agents/evaluator';

const agent = new EvaluatorAgent();

// Score a response
const scoreResult = await agent.score({
  response: 'Your AI-generated response',
  prompt: 'The original prompt',
  criteria: [
    { name: 'Accuracy', description: 'Factual correctness', weight: 1 }
  ]
});

console.log(`Score: ${scoreResult.overallScore}/5`);

// Compare two responses
const compareResult = await agent.compare({
  responseA: 'First response',
  responseB: 'Second response',
  prompt: 'The prompt',
  criteria: ['quality', 'completeness'],
  allowTie: true,
  swapPositions: true
});

console.log(`Winner: ${compareResult.winner} (confidence: ${compareResult.confidence})`);

🔗 Integration with Agent Skills Repository

This project is designed to be added to the examples section of the main repository:

Agent-Skills-for-Context-Engineering/
├── skills/
│   ├── context-fundamentals/     # Foundation (referenced by this project)
│   └── tool-design/              # Foundation (referenced by this project)
├── examples/
│   └── llm-as-judge-skills/      # ← This project
│       ├── README.md
│       ├── skills/
│       ├── tools/
│       ├── agents/
│       └── src/

How This Example Demonstrates the Framework

Skills → Prompts → Tools: Shows the progression from knowledge (MD files) to executable code
Context Engineering: Applies context fundamentals in evaluation prompts
Tool Design Patterns: Implements Zod schemas, error handling, and clear interfaces
Agent Architecture: Uses AI SDK patterns for agent abstraction

📋 API Reference

DirectScoreInput

interface DirectScoreInput {
  response: string;              // The response to evaluate
  prompt: string;                // Original prompt
  context?: string;              // Additional context
  criteria: Array<{
    name: string;                // Criterion name
    description: string;         // What it measures
    weight: number;              // Relative importance (0-1)
  }>;
  rubric?: {
    scale: '1-3' | '1-5' | '1-10';
    levelDescriptions?: Record<string, string>;
  };
}

PairwiseCompareInput

interface PairwiseCompareInput {
  responseA: string;             // First response
  responseB: string;             // Second response
  prompt: string;                // Original prompt
  context?: string;              // Additional context
  criteria: string[];            // Comparison aspects
  allowTie?: boolean;            // Allow tie verdict (default: true)
  swapPositions?: boolean;       // Mitigate position bias (default: true)
}

GenerateRubricInput

interface GenerateRubricInput {
  criterionName: string;         // Name of criterion
  criterionDescription: string;  // What it measures
  scale?: '1-3' | '1-5' | '1-10';
  domain?: string;               // Domain for terminology
  includeExamples?: boolean;     // Generate examples
  strictness?: 'lenient' | 'balanced' | 'strict';
}

🛠️ Development

Scripts

npm run build       # Compile TypeScript
npm run dev         # Watch mode
npm test            # Run tests
npm run lint        # ESLint
npm run format      # Prettier
npm run typecheck   # Type check

Adding New Tools

Create src/tools/<category>/<tool-name>.ts
Define input/output Zod schemas
Implement execute function
Export from src/tools/<category>/index.ts
Add documentation in tools/<category>/<tool-name>.md
Write tests

📄 License

MIT License - see LICENSE for details.

🙏 Acknowledgments

Eugene Yan - LLM-as-a-Judge research
Vercel AI SDK - Agent patterns and tooling
Agent Skills for Context Engineering - Foundation framework

Preparing the source view

Agent Skills for Context Engineering

examples/llm-as-judge-skills/README.md

LLM-as-a-Judge Skills

🎯 Purpose

Part of the Context Engineering Ecosystem

📖 Background & Research

The LLM-as-a-Judge Problem

Vercel AI SDK 6 Patterns

🏗️ What We Built

Architecture Overview

Directory Structure

🔧 Core Tools Implemented

1. Direct Score Tool (directScore)

2. Pairwise Compare Tool (pairwiseCompare)

3. Generate Rubric Tool (generateRubric)

4. Evaluator Agent

📊 Test Results

Test Output

Test Coverage Summary

📚 Key Learnings

1. Position Bias is Real

2. Chain-of-Thought Improves Quality

3. Domain-Specific Rubrics Matter

4. Weighted Criteria Enable Nuanced Evaluation

5. Context Affects Evaluation

🚀 Quick Start

Installation

Configuration

Run Tests

Basic Usage

🔗 Integration with Agent Skills Repository

How This Example Demonstrates the Framework

📋 API Reference

DirectScoreInput

PairwiseCompareInput

GenerateRubricInput

🛠️ Development

Scripts

Adding New Tools

📄 License

🙏 Acknowledgments

1. Direct Score Tool (`directScore`)

2. Pairwise Compare Tool (`pairwiseCompare`)

3. Generate Rubric Tool (`generateRubric`)