Loading source
Pulling the file list, source metadata, and syntax-aware rendering for this listing.
Source from repo
A comprehensive collection of Agent Skills for context engineering, multi-agent architectures, and production agent systems.
Files
Skill
Size
Entrypoint
Format
Open file
Syntax-highlighted preview of this file as included in the skill package.
examples/llm-as-judge-skills/README.md
1# LLM-as-a-Judge Skills23> A practical implementation of LLM evaluation skills built using insights from [Eugene Yan's LLM-Evaluators research](https://eugeneyan.com/writing/llm-evaluators/) and [Vercel AI SDK 6](https://vercel.com/blog/ai-sdk-6).45[](https://opensource.org/licenses/MIT)6[](https://www.typescriptlang.org/)7[](https://sdk.vercel.ai/)8[](#test-results)910## 🎯 Purpose1112This repository demonstrates how to build **production-ready LLM evaluation skills** as part of the [Agent Skills for Context Engineering](https://github.com/muratcankoylan/Agent-Skills-for-Context-Engineering) project. It serves as a practical example of:13141. **Skill Development**: How to transform research insights into executable agent skills152. **Tool Design**: Best practices for building AI tools with proper schemas and error handling163. **Evaluation Patterns**: Implementation of LLM-as-a-Judge patterns for quality assessment1718### Part of the Context Engineering Ecosystem1920This project is an example implementation to be added to:21- 📁 [`Agent-Skills-for-Context-Engineering/examples/`](https://github.com/muratcankoylan/Agent-Skills-for-Context-Engineering/tree/main/examples)2223It builds upon the foundational skills from:24- 📚 [`skills/context-fundamentals`](https://github.com/muratcankoylan/Agent-Skills-for-Context-Engineering/tree/main/skills/context-fundamentals) - Context engineering principles25- 🔧 [`skills/tool-design`](https://github.com/muratcankoylan/Agent-Skills-for-Context-Engineering/tree/main/skills/tool-design) - Tool design best practices2627---2829## 📖 Background & Research3031### The LLM-as-a-Judge Problem3233Evaluating AI-generated content is challenging. Traditional metrics (BLEU, ROUGE) often miss nuances that matter. Eugene Yan's research on [LLM-Evaluators](https://eugeneyan.com/writing/llm-evaluators/) identifies practical patterns for using LLMs to judge LLM outputs.3435**Key insights we implemented:**3637| Insight | Implementation |38|---------|----------------|39| Direct scoring works best for objective criteria | `directScore` tool with rubric support |40| Pairwise comparison is more reliable for preferences | `pairwiseCompare` tool with position swapping |41| Position bias affects pairwise judgments | Automatic position swapping in comparisons |42| Chain-of-thought improves reliability | All evaluations require justification with evidence |43| Clear rubrics reduce variance | `generateRubric` tool for consistent standards |4445### Vercel AI SDK 6 Patterns4647We leveraged AI SDK 6's new patterns:4849- **Agent Abstraction**: Reusable `EvaluatorAgent` class with multiple capabilities50- **Type-safe Tools**: Zod schemas for all inputs/outputs51- **Structured Output**: JSON responses parsed and validated52- **Error Handling**: Graceful degradation when API calls fail5354---5556## 🏗️ What We Built5758### Architecture Overview5960```61┌─────────────────────────────────────────────────────────────────────┐62│ LLM-as-a-Judge Skills │63├─────────────────────────────────────────────────────────────────────┤64│ │65│ ┌─────────────┐ ┌─────────────┐ ┌─────────────────────────┐ │66│ │ Skills │ │ Prompts │ │ Tools │ │67│ │ (MD docs) │───▶│ (templates)│───▶│ (TypeScript impl) │ │68│ └─────────────┘ └─────────────┘ └─────────────────────────┘ │69│ │ │ │70│ │ ▼ │71│ │ ┌─────────────────────────┐ │72│ └─────────────────────────────▶│ EvaluatorAgent │ │73│ │ ├── score() │ │74│ │ ├── compare() │ │75│ │ ├── generateRubric() │ │76│ │ └── chat() │ │77│ └─────────────────────────┘ │78│ │ │79│ ▼ │80│ ┌─────────────────────────┐ │81│ │ OpenAI GPT-5.2 API │ │82│ └─────────────────────────┘ │83│ │84└─────────────────────────────────────────────────────────────────────┘85```8687### Directory Structure8889```90llm-as-judge-skills/91├── skills/ # Foundational knowledge (MD docs)92│ ├── llm-evaluator/ # LLM-as-a-Judge patterns93│ │ └── llm-evaluator.md # Evaluation methods, metrics, bias mitigation94│ ├── context-fundamentals/ # Context engineering principles95│ │ └── context-fundamentals.md # Managing context effectively96│ └── tool-design/ # Tool design best practices97│ └── tool-design.md # Schema design, error handling98│99├── prompts/ # Prompt templates100│ ├── evaluation/101│ │ ├── direct-scoring-prompt.md # Scoring prompt template102│ │ └── pairwise-comparison-prompt.md # Comparison prompt template103│ ├── research/104│ │ └── research-synthesis-prompt.md105│ └── agent-system/106│ └── orchestrator-prompt.md107│108├── tools/ # Tool documentation (MD)109│ ├── evaluation/110│ │ ├── direct-score.md # Direct scoring tool spec111│ │ ├── pairwise-compare.md # Pairwise comparison spec112│ │ └── generate-rubric.md # Rubric generation spec113│ ├── research/114│ │ ├── web-search.md115│ │ └── read-url.md116│ └── orchestration/117│ └── delegate-to-agent.md118│119├── agents/ # Agent documentation (MD)120│ ├── evaluator-agent/121│ │ └── evaluator-agent.md122│ ├── research-agent/123│ │ └── research-agent.md124│ └── orchestrator-agent/125│ └── orchestrator-agent.md126│127├── src/ # TypeScript implementation128│ ├── tools/evaluation/129│ │ ├── direct-score.ts # 165 lines - Direct scoring implementation130│ │ ├── pairwise-compare.ts # 255 lines - Pairwise with bias mitigation131│ │ └── generate-rubric.ts # 162 lines - Rubric generation132│ ├── agents/133│ │ └── evaluator.ts # 112 lines - EvaluatorAgent class134│ ├── config/135│ │ └── index.ts # Configuration and validation136│ └── index.ts # Main exports137│138├── tests/ # Test suite139│ ├── evaluation.test.ts # 9 tests for tools140│ ├── skills.test.ts # 10 tests for skills141│ └── setup.ts # Test configuration142│143└── examples/ # Usage examples144├── basic-evaluation.ts145├── pairwise-comparison.ts146├── generate-rubric.ts147└── full-evaluation-workflow.ts148```149150---151152## 🔧 Core Tools Implemented153154### 1. Direct Score Tool (`directScore`)155156**Purpose**: Evaluate a single response against defined criteria with numerical scores.157158**When to Use**:159- Factual accuracy checks160- Instruction following assessment161- Content quality grading162- Compliance verification163164**Implementation Highlights**:165166```typescript167// From src/tools/evaluation/direct-score.ts168169const systemPrompt = `You are an expert evaluator. Assess the response against each criterion.170For each criterion:1711. Find specific evidence in the response1722. Score according to the rubric (1-5 scale)1733. Justify your score1744. Suggest one improvement175176Be objective and consistent. Base scores on explicit evidence.`;177```178179**Key Features**:180- Weighted criteria support181- Chain-of-thought justification required182- Evidence extraction from response183- Improvement suggestions per criterion184- Configurable rubrics (1-3, 1-5, 1-10 scales)185186**Example Usage**:187188```typescript189const result = await executeDirectScore({190response: 'Quantum entanglement is like having two magical coins...',191prompt: 'Explain quantum entanglement to a high school student',192criteria: [193{ name: 'Accuracy', description: 'Scientific correctness', weight: 0.4 },194{ name: 'Clarity', description: 'Understandable for audience', weight: 0.3 },195{ name: 'Engagement', description: 'Interesting and memorable', weight: 0.3 }196],197rubric: { scale: '1-5' }198});199200// Output:201// {202// success: true,203// scores: [204// { criterion: 'Accuracy', score: 4, justification: '...', evidence: [...] },205// { criterion: 'Clarity', score: 5, justification: '...', evidence: [...] },206// { criterion: 'Engagement', score: 4, justification: '...', evidence: [...] }207// ],208// overallScore: 4.33,209// weightedScore: 4.3,210// summary: { assessment: '...', strengths: [...], weaknesses: [...] }211// }212```213214---215216### 2. Pairwise Compare Tool (`pairwiseCompare`)217218**Purpose**: Compare two responses and determine which is better, with position bias mitigation.219220**When to Use**:221- A/B testing responses222- Preference evaluation223- Style and tone assessment224- Ranking quality differences225226**Implementation Highlights**:227228```typescript229// Position bias mitigation: evaluate twice with swapped positions230if (input.swapPositions) {231// First pass: A first, B second232const pass1 = await evaluatePair(input.responseA, input.responseB, ...);233234// Second pass: B first, A second235const pass2 = await evaluatePair(input.responseB, input.responseA, ...);236237// Map pass2 result back and check consistency238const pass2WinnerMapped = pass2.winner === 'A' ? 'B' : pass2.winner === 'B' ? 'A' : 'TIE';239const consistent = pass1.winner === pass2WinnerMapped;240241// If inconsistent, return TIE with lower confidence242if (!consistent) {243finalWinner = 'TIE';244finalConfidence = 0.5;245}246}247```248249**Key Features**:250- **Position Swapping**: Automatically runs evaluation twice with swapped positions251- **Consistency Check**: Detects when position affects judgment252- **Confidence Scoring**: 0-1 confidence based on consistency253- **Per-criterion Comparison**: Detailed breakdown for each aspect254- **Bias-aware Prompting**: Explicit instructions to ignore length and position255256**Example Usage**:257258```typescript259const result = await executePairwiseCompare({260responseA: GOOD_RESPONSE,261responseB: POOR_RESPONSE,262prompt: 'Explain quantum entanglement',263criteria: ['accuracy', 'clarity', 'completeness', 'engagement'],264allowTie: true,265swapPositions: true // Enable position bias mitigation266});267268// Output:269// {270// success: true,271// winner: 'A',272// confidence: 0.85,273// positionConsistency: { consistent: true, firstPassWinner: 'A', secondPassWinner: 'A' },274// comparison: [275// { criterion: 'accuracy', winner: 'A', reasoning: '...' },276// { criterion: 'clarity', winner: 'A', reasoning: '...' },277// ...278// ]279// }280```281282---283284### 3. Generate Rubric Tool (`generateRubric`)285286**Purpose**: Create detailed scoring rubrics for consistent evaluation standards.287288**When to Use**:289- Establishing evaluation criteria290- Training human evaluators291- Ensuring consistency across evaluations292- Documenting quality standards293294**Implementation Highlights**:295296```typescript297// Strictness affects the generated rubric:298// - lenient: Lower bar for passing scores299// - balanced: Fair, typical expectations300// - strict: High standards, critical evaluation301302const userPrompt = `Create a scoring rubric for:303**Criterion**: ${input.criterionName}304**Description**: ${input.criterionDescription}305**Scale**: ${input.scale}306**Domain**: ${input.domain}307308Generate:3091. Clear descriptions for each score level3102. Specific characteristics that define each level3113. Brief example text for each level3124. General scoring guidelines3135. Edge cases with guidance`;314```315316**Key Features**:317- Domain-specific terminology318- Configurable strictness levels319- Example generation for each level320- Edge case guidance321- Scoring guidelines322323**Example Usage**:324325```typescript326const result = await executeGenerateRubric({327criterionName: 'Code Readability',328criterionDescription: 'How easy the code is to understand and maintain',329scale: '1-5',330domain: 'software engineering',331includeExamples: true,332strictness: 'balanced'333});334335// Output:336// {337// success: true,338// levels: [339// { score: 1, label: 'Poor', description: '...', characteristics: [...], example: '...' },340// { score: 2, label: 'Below Average', ... },341// { score: 3, label: 'Average', ... },342// { score: 4, label: 'Good', ... },343// { score: 5, label: 'Excellent', ... }344// ],345// scoringGuidelines: [...],346// edgeCases: [{ situation: '...', guidance: '...' }]347// }348```349350---351352### 4. Evaluator Agent353354**Purpose**: High-level agent that combines all evaluation tools with conversational capability.355356**Implementation**:357358```typescript359export class EvaluatorAgent {360private model: string;361private temperature: number;362363constructor(config?: EvaluatorAgentConfig) {364this.model = config?.model || 'gpt-5.2';365this.temperature = config?.temperature || 0.3;366}367368// Score a response369async score(input: DirectScoreInput) { ... }370371// Compare two responses372async compare(input: PairwiseCompareInput) { ... }373374// Generate a rubric375async generateRubric(input: GenerateRubricInput) { ... }376377// Full workflow: generate rubric then score378async evaluateWithGeneratedRubric(response, prompt, criteria) { ... }379380// Chat-based evaluation381async chat(userMessage: string) { ... }382}383```384385---386387## 📊 Test Results388389All 19 tests pass successfully. Here are the actual test logs from our test run:390391### Test Output392393```394> [email protected] test395> vitest run --testTimeout=120000396397RUN v2.1.9 /Users/muratcankoylan/app_readwren398399✓ tests/skills.test.ts (10 tests) 159317ms400✓ LLM Evaluator Skill Tests > Direct Scoring Skill > should use chain-of-thought in scoring 4439ms401✓ LLM Evaluator Skill Tests > Direct Scoring Skill > should handle multiple weighted criteria 7218ms402✓ LLM Evaluator Skill Tests > Pairwise Comparison Skill > should mitigate position bias with swap 13002ms403✓ LLM Evaluator Skill Tests > Pairwise Comparison Skill > should identify clear winner for quality difference 25914ms404✓ LLM Evaluator Skill Tests > Rubric Generation Skill > should generate domain-specific rubrics 37165ms405✓ LLM Evaluator Skill Tests > Rubric Generation Skill > should provide edge case guidance 29088ms406✓ LLM Evaluator Skill Tests > Context Fundamentals Skill Application > should utilize provided context in evaluation 11133ms407✓ Skill Input/Output Validation > should validate DirectScore input schema 4733ms408✓ Skill Input/Output Validation > should validate PairwiseCompare output structure 4123ms409✓ Skill Input/Output Validation > should validate GenerateRubric output structure 22500ms410411✓ tests/evaluation.test.ts (9 tests) 216353ms412✓ Direct Score Tool > should score a response against criteria 13219ms413✓ Direct Score Tool > should provide lower scores for poor responses 14834ms414✓ Pairwise Compare Tool > should correctly identify the better response 29254ms415✓ Pairwise Compare Tool > should handle similar responses appropriately 14418ms416✓ Pairwise Compare Tool > should provide comparison details for each criterion 9931ms417✓ Generate Rubric Tool > should generate a complete rubric 24106ms418✓ Generate Rubric Tool > should respect strictness setting 57919ms419✓ Evaluator Agent > should provide integrated evaluation workflow 48112ms420✓ Evaluator Agent > should support chat-based evaluation 4558ms421422Test Files 2 passed (2)423Tests 19 passed (19)424Start at 00:25:16425Duration 216.66s (transform 68ms, setup 32ms, collect 148ms, tests 375.67s, environment 0ms, prepare 105ms)426```427428### Test Coverage Summary429430| Test Category | Tests | Pass Rate | Avg Duration |431|--------------|-------|-----------|--------------|432| Direct Scoring | 4 | 100% | 9.9s |433| Pairwise Comparison | 4 | 100% | 17.9s |434| Rubric Generation | 4 | 100% | 33.2s |435| Context Integration | 1 | 100% | 11.1s |436| Agent Integration | 2 | 100% | 26.3s |437| Schema Validation | 4 | 100% | 8.8s |438439---440441## 📚 Key Learnings442443### 1. Position Bias is Real444445During testing, we confirmed Eugene Yan's research findings:446447```448Test: "should mitigate position bias with swap" - 13002ms449Result: Position consistency check correctly detected and mitigated bias450```451452When comparing identical responses, the system correctly returns `TIE`. When comparing clearly different quality responses, the winner is consistent across position swaps.453454### 2. Chain-of-Thought Improves Quality455456Tests confirm that requiring justification produces more reliable evaluations:457458```459Test: "should use chain-of-thought in scoring" - 4439ms460Result: All scores include justifications >20 characters with specific evidence461```462463### 3. Domain-Specific Rubrics Matter464465The rubric generator adapts to the specified domain:466467```468Test: "should generate domain-specific rubrics" - 37165ms469Result: Software engineering rubric included terms like "variable", "function", "comment"470```471472### 4. Weighted Criteria Enable Nuanced Evaluation473474```475Test: "should handle multiple weighted criteria" - 7218ms476Result: weightedScore differs from overallScore when weights are unequal477```478479### 5. Context Affects Evaluation480481The context fundamentals skill proves valuable:482483```484Test: "should utilize provided context in evaluation" - 11133ms485Result: Medical context allowed technical terminology to score well486```487488---489490## 🚀 Quick Start491492### Installation493494```bash495git clone https://github.com/muratcankoylan/llm-as-judge-skills.git496cd llm-as-judge-skills497npm install498```499500### Configuration501502Create a `.env` file:503504```bash505OPENAI_API_KEY=your_openai_api_key_here506OPENAI_MODEL=gpt-5.2507```508509### Run Tests510511```bash512npm test513```514515### Basic Usage516517```typescript518import { EvaluatorAgent } from './src/agents/evaluator';519520const agent = new EvaluatorAgent();521522// Score a response523const scoreResult = await agent.score({524response: 'Your AI-generated response',525prompt: 'The original prompt',526criteria: [527{ name: 'Accuracy', description: 'Factual correctness', weight: 1 }528]529});530531console.log(`Score: ${scoreResult.overallScore}/5`);532533// Compare two responses534const compareResult = await agent.compare({535responseA: 'First response',536responseB: 'Second response',537prompt: 'The prompt',538criteria: ['quality', 'completeness'],539allowTie: true,540swapPositions: true541});542543console.log(`Winner: ${compareResult.winner} (confidence: ${compareResult.confidence})`);544```545546---547548## 🔗 Integration with Agent Skills Repository549550This project is designed to be added to the examples section of the main repository:551552```553Agent-Skills-for-Context-Engineering/554├── skills/555│ ├── context-fundamentals/ # Foundation (referenced by this project)556│ └── tool-design/ # Foundation (referenced by this project)557├── examples/558│ └── llm-as-judge-skills/ # ← This project559│ ├── README.md560│ ├── skills/561│ ├── tools/562│ ├── agents/563│ └── src/564```565566### How This Example Demonstrates the Framework5675681. **Skills → Prompts → Tools**: Shows the progression from knowledge (MD files) to executable code5692. **Context Engineering**: Applies context fundamentals in evaluation prompts5703. **Tool Design Patterns**: Implements Zod schemas, error handling, and clear interfaces5714. **Agent Architecture**: Uses AI SDK patterns for agent abstraction572573---574575## 📋 API Reference576577### DirectScoreInput578579```typescript580interface DirectScoreInput {581response: string; // The response to evaluate582prompt: string; // Original prompt583context?: string; // Additional context584criteria: Array<{585name: string; // Criterion name586description: string; // What it measures587weight: number; // Relative importance (0-1)588}>;589rubric?: {590scale: '1-3' | '1-5' | '1-10';591levelDescriptions?: Record<string, string>;592};593}594```595596### PairwiseCompareInput597598```typescript599interface PairwiseCompareInput {600responseA: string; // First response601responseB: string; // Second response602prompt: string; // Original prompt603context?: string; // Additional context604criteria: string[]; // Comparison aspects605allowTie?: boolean; // Allow tie verdict (default: true)606swapPositions?: boolean; // Mitigate position bias (default: true)607}608```609610### GenerateRubricInput611612```typescript613interface GenerateRubricInput {614criterionName: string; // Name of criterion615criterionDescription: string; // What it measures616scale?: '1-3' | '1-5' | '1-10';617domain?: string; // Domain for terminology618includeExamples?: boolean; // Generate examples619strictness?: 'lenient' | 'balanced' | 'strict';620}621```622623---624625## 🛠️ Development626627### Scripts628629```bash630npm run build # Compile TypeScript631npm run dev # Watch mode632npm test # Run tests633npm run lint # ESLint634npm run format # Prettier635npm run typecheck # Type check636```637638### Adding New Tools6396401. Create `src/tools/<category>/<tool-name>.ts`6412. Define input/output Zod schemas6423. Implement execute function6434. Export from `src/tools/<category>/index.ts`6445. Add documentation in `tools/<category>/<tool-name>.md`6456. Write tests646647---648649## 📄 License650651MIT License - see [LICENSE](LICENSE) for details.652653---654655## 🙏 Acknowledgments656657- [Eugene Yan](https://eugeneyan.com/writing/llm-evaluators/) - LLM-as-a-Judge research658- [Vercel AI SDK](https://sdk.vercel.ai/) - Agent patterns and tooling659- [Agent Skills for Context Engineering](https://github.com/muratcankoylan/Agent-Skills-for-Context-Engineering) - Foundation framework660