Source from repo

Agent Skills for Context Engineering

A comprehensive collection of Agent Skills for context engineering, multi-agent architectures, and production agent systems.

muratcankoylanGitHub muratcankoylanSource repo Original GitHub link

Files

241

Skill

n/a

Size

2.6 MB

Entrypoint

SKILL.md

Format

git-repo

Open file

examples/llm-as-judge-skills/README.md

Syntax-highlighted preview of this file as included in the skill package.

Rendered Source

markdown660 linesFree

examples/llm-as-judge-skills/README.md

1# LLM-as-a-Judge Skills
2 
3> A practical implementation of LLM evaluation skills built using insights from [Eugene Yan's LLM-Evaluators research](https://eugeneyan.com/writing/llm-evaluators/) and [Vercel AI SDK 6](https://vercel.com/blog/ai-sdk-6).
4 
5[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)
6[![TypeScript](https://img.shields.io/badge/TypeScript-5.6-blue.svg)](https://www.typescriptlang.org/)
7[![AI SDK](https://img.shields.io/badge/AI%20SDK-4.1-green.svg)](https://sdk.vercel.ai/)
8[![Tests](https://img.shields.io/badge/Tests-19%20passed-brightgreen.svg)](#test-results)
9 
10## 🎯 Purpose
11 
12This repository demonstrates how to build **production-ready LLM evaluation skills** as part of the [Agent Skills for Context Engineering](https://github.com/muratcankoylan/Agent-Skills-for-Context-Engineering) project. It serves as a practical example of:
13 
141. **Skill Development**: How to transform research insights into executable agent skills
152. **Tool Design**: Best practices for building AI tools with proper schemas and error handling
163. **Evaluation Patterns**: Implementation of LLM-as-a-Judge patterns for quality assessment
17 
18### Part of the Context Engineering Ecosystem
19 
20This project is an example implementation to be added to:
21- 📁 [`Agent-Skills-for-Context-Engineering/examples/`](https://github.com/muratcankoylan/Agent-Skills-for-Context-Engineering/tree/main/examples)
22 
23It builds upon the foundational skills from:
24- 📚 [`skills/context-fundamentals`](https://github.com/muratcankoylan/Agent-Skills-for-Context-Engineering/tree/main/skills/context-fundamentals) - Context engineering principles
25- 🔧 [`skills/tool-design`](https://github.com/muratcankoylan/Agent-Skills-for-Context-Engineering/tree/main/skills/tool-design) - Tool design best practices
26 
27---
28 
29## 📖 Background & Research
30 
31### The LLM-as-a-Judge Problem
32 
33Evaluating AI-generated content is challenging. Traditional metrics (BLEU, ROUGE) often miss nuances that matter. Eugene Yan's research on [LLM-Evaluators](https://eugeneyan.com/writing/llm-evaluators/) identifies practical patterns for using LLMs to judge LLM outputs.
34 
35**Key insights we implemented:**
36 
37| Insight | Implementation |
38|---------|----------------|
39| Direct scoring works best for objective criteria | `directScore` tool with rubric support |
40| Pairwise comparison is more reliable for preferences | `pairwiseCompare` tool with position swapping |
41| Position bias affects pairwise judgments | Automatic position swapping in comparisons |
42| Chain-of-thought improves reliability | All evaluations require justification with evidence |
43| Clear rubrics reduce variance | `generateRubric` tool for consistent standards |
44 
45### Vercel AI SDK 6 Patterns
46 
47We leveraged AI SDK 6's new patterns:
48 
49- **Agent Abstraction**: Reusable `EvaluatorAgent` class with multiple capabilities
50- **Type-safe Tools**: Zod schemas for all inputs/outputs
51- **Structured Output**: JSON responses parsed and validated
52- **Error Handling**: Graceful degradation when API calls fail
53 
54---
55 
56## 🏗️ What We Built
57 
58### Architecture Overview
59 
60```
61┌─────────────────────────────────────────────────────────────────────┐
62│                        LLM-as-a-Judge Skills                         │
63├─────────────────────────────────────────────────────────────────────┤
64│                                                                       │
65│  ┌─────────────┐    ┌─────────────┐    ┌─────────────────────────┐  │
66│  │   Skills    │    │   Prompts   │    │         Tools           │  │
67│  │  (MD docs)  │───▶│  (templates)│───▶│  (TypeScript impl)      │  │
68│  └─────────────┘    └─────────────┘    └─────────────────────────┘  │
69│         │                                         │                   │
70│         │                                         ▼                   │
71│         │                              ┌─────────────────────────┐  │
72│         └─────────────────────────────▶│    EvaluatorAgent       │  │
73│                                         │  ├── score()            │  │
74│                                         │  ├── compare()          │  │
75│                                         │  ├── generateRubric()   │  │
76│                                         │  └── chat()             │  │
77│                                         └─────────────────────────┘  │
78│                                                     │                 │
79│                                                     ▼                 │
80│                                         ┌─────────────────────────┐  │
81│                                         │   OpenAI GPT-5.2 API     │  │
82│                                         └─────────────────────────┘  │
83│                                                                       │
84└─────────────────────────────────────────────────────────────────────┘
85```
86 
87### Directory Structure
88 
89```
90llm-as-judge-skills/
91├── skills/                          # Foundational knowledge (MD docs)
92│   ├── llm-evaluator/               # LLM-as-a-Judge patterns
93│   │   └── llm-evaluator.md         # Evaluation methods, metrics, bias mitigation
94│   ├── context-fundamentals/        # Context engineering principles
95│   │   └── context-fundamentals.md  # Managing context effectively
96│   └── tool-design/                 # Tool design best practices
97│       └── tool-design.md           # Schema design, error handling
98│
99├── prompts/                         # Prompt templates
100│   ├── evaluation/
101│   │   ├── direct-scoring-prompt.md      # Scoring prompt template
102│   │   └── pairwise-comparison-prompt.md # Comparison prompt template
103│   ├── research/
104│   │   └── research-synthesis-prompt.md
105│   └── agent-system/
106│       └── orchestrator-prompt.md
107│
108├── tools/                           # Tool documentation (MD)
109│   ├── evaluation/
110│   │   ├── direct-score.md          # Direct scoring tool spec
111│   │   ├── pairwise-compare.md      # Pairwise comparison spec
112│   │   └── generate-rubric.md       # Rubric generation spec
113│   ├── research/
114│   │   ├── web-search.md
115│   │   └── read-url.md
116│   └── orchestration/
117│       └── delegate-to-agent.md
118│
119├── agents/                          # Agent documentation (MD)
120│   ├── evaluator-agent/
121│   │   └── evaluator-agent.md
122│   ├── research-agent/
123│   │   └── research-agent.md
124│   └── orchestrator-agent/
125│       └── orchestrator-agent.md
126│
127├── src/                             # TypeScript implementation
128│   ├── tools/evaluation/
129│   │   ├── direct-score.ts          # 165 lines - Direct scoring implementation
130│   │   ├── pairwise-compare.ts      # 255 lines - Pairwise with bias mitigation
131│   │   └── generate-rubric.ts       # 162 lines - Rubric generation
132│   ├── agents/
133│   │   └── evaluator.ts             # 112 lines - EvaluatorAgent class
134│   ├── config/
135│   │   └── index.ts                 # Configuration and validation
136│   └── index.ts                     # Main exports
137│
138├── tests/                           # Test suite
139│   ├── evaluation.test.ts           # 9 tests for tools
140│   ├── skills.test.ts               # 10 tests for skills
141│   └── setup.ts                     # Test configuration
142│
143└── examples/                        # Usage examples
144    ├── basic-evaluation.ts
145    ├── pairwise-comparison.ts
146    ├── generate-rubric.ts
147    └── full-evaluation-workflow.ts
148```
149 
150---
151 
152## 🔧 Core Tools Implemented
153 
154### 1. Direct Score Tool (`directScore`)
155 
156**Purpose**: Evaluate a single response against defined criteria with numerical scores.
157 
158**When to Use**:
159- Factual accuracy checks
160- Instruction following assessment
161- Content quality grading
162- Compliance verification
163 
164**Implementation Highlights**:
165 
166```typescript
167// From src/tools/evaluation/direct-score.ts
168 
169const systemPrompt = `You are an expert evaluator. Assess the response against each criterion.
170For each criterion:
1711. Find specific evidence in the response
1722. Score according to the rubric (1-5 scale)
1733. Justify your score
1744. Suggest one improvement
175 
176Be objective and consistent. Base scores on explicit evidence.`;
177```
178 
179**Key Features**:
180- Weighted criteria support
181- Chain-of-thought justification required
182- Evidence extraction from response
183- Improvement suggestions per criterion
184- Configurable rubrics (1-3, 1-5, 1-10 scales)
185 
186**Example Usage**:
187 
188```typescript
189const result = await executeDirectScore({
190  response: 'Quantum entanglement is like having two magical coins...',
191  prompt: 'Explain quantum entanglement to a high school student',
192  criteria: [
193    { name: 'Accuracy', description: 'Scientific correctness', weight: 0.4 },
194    { name: 'Clarity', description: 'Understandable for audience', weight: 0.3 },
195    { name: 'Engagement', description: 'Interesting and memorable', weight: 0.3 }
196  ],
197  rubric: { scale: '1-5' }
198});
199 
200// Output:
201// {
202//   success: true,
203//   scores: [
204//     { criterion: 'Accuracy', score: 4, justification: '...', evidence: [...] },
205//     { criterion: 'Clarity', score: 5, justification: '...', evidence: [...] },
206//     { criterion: 'Engagement', score: 4, justification: '...', evidence: [...] }
207//   ],
208//   overallScore: 4.33,
209//   weightedScore: 4.3,
210//   summary: { assessment: '...', strengths: [...], weaknesses: [...] }
211// }
212```
213 
214---
215 
216### 2. Pairwise Compare Tool (`pairwiseCompare`)
217 
218**Purpose**: Compare two responses and determine which is better, with position bias mitigation.
219 
220**When to Use**:
221- A/B testing responses
222- Preference evaluation
223- Style and tone assessment
224- Ranking quality differences
225 
226**Implementation Highlights**:
227 
228```typescript
229// Position bias mitigation: evaluate twice with swapped positions
230if (input.swapPositions) {
231  // First pass: A first, B second
232  const pass1 = await evaluatePair(input.responseA, input.responseB, ...);
233  
234  // Second pass: B first, A second
235  const pass2 = await evaluatePair(input.responseB, input.responseA, ...);
236  
237  // Map pass2 result back and check consistency
238  const pass2WinnerMapped = pass2.winner === 'A' ? 'B' : pass2.winner === 'B' ? 'A' : 'TIE';
239  const consistent = pass1.winner === pass2WinnerMapped;
240  
241  // If inconsistent, return TIE with lower confidence
242  if (!consistent) {
243    finalWinner = 'TIE';
244    finalConfidence = 0.5;
245  }
246}
247```
248 
249**Key Features**:
250- **Position Swapping**: Automatically runs evaluation twice with swapped positions
251- **Consistency Check**: Detects when position affects judgment
252- **Confidence Scoring**: 0-1 confidence based on consistency
253- **Per-criterion Comparison**: Detailed breakdown for each aspect
254- **Bias-aware Prompting**: Explicit instructions to ignore length and position
255 
256**Example Usage**:
257 
258```typescript
259const result = await executePairwiseCompare({
260  responseA: GOOD_RESPONSE,
261  responseB: POOR_RESPONSE,
262  prompt: 'Explain quantum entanglement',
263  criteria: ['accuracy', 'clarity', 'completeness', 'engagement'],
264  allowTie: true,
265  swapPositions: true  // Enable position bias mitigation
266});
267 
268// Output:
269// {
270//   success: true,
271//   winner: 'A',
272//   confidence: 0.85,
273//   positionConsistency: { consistent: true, firstPassWinner: 'A', secondPassWinner: 'A' },
274//   comparison: [
275//     { criterion: 'accuracy', winner: 'A', reasoning: '...' },
276//     { criterion: 'clarity', winner: 'A', reasoning: '...' },
277//     ...
278//   ]
279// }
280```
281 
282---
283 
284### 3. Generate Rubric Tool (`generateRubric`)
285 
286**Purpose**: Create detailed scoring rubrics for consistent evaluation standards.
287 
288**When to Use**:
289- Establishing evaluation criteria
290- Training human evaluators
291- Ensuring consistency across evaluations
292- Documenting quality standards
293 
294**Implementation Highlights**:
295 
296```typescript
297// Strictness affects the generated rubric:
298// - lenient: Lower bar for passing scores
299// - balanced: Fair, typical expectations
300// - strict: High standards, critical evaluation
301 
302const userPrompt = `Create a scoring rubric for:
303**Criterion**: ${input.criterionName}
304**Description**: ${input.criterionDescription}
305**Scale**: ${input.scale}
306**Domain**: ${input.domain}
307 
308Generate:
3091. Clear descriptions for each score level
3102. Specific characteristics that define each level
3113. Brief example text for each level
3124. General scoring guidelines
3135. Edge cases with guidance`;
314```
315 
316**Key Features**:
317- Domain-specific terminology
318- Configurable strictness levels
319- Example generation for each level
320- Edge case guidance
321- Scoring guidelines
322 
323**Example Usage**:
324 
325```typescript
326const result = await executeGenerateRubric({
327  criterionName: 'Code Readability',
328  criterionDescription: 'How easy the code is to understand and maintain',
329  scale: '1-5',
330  domain: 'software engineering',
331  includeExamples: true,
332  strictness: 'balanced'
333});
334 
335// Output:
336// {
337//   success: true,
338//   levels: [
339//     { score: 1, label: 'Poor', description: '...', characteristics: [...], example: '...' },
340//     { score: 2, label: 'Below Average', ... },
341//     { score: 3, label: 'Average', ... },
342//     { score: 4, label: 'Good', ... },
343//     { score: 5, label: 'Excellent', ... }
344//   ],
345//   scoringGuidelines: [...],
346//   edgeCases: [{ situation: '...', guidance: '...' }]
347// }
348```
349 
350---
351 
352### 4. Evaluator Agent
353 
354**Purpose**: High-level agent that combines all evaluation tools with conversational capability.
355 
356**Implementation**:
357 
358```typescript
359export class EvaluatorAgent {
360  private model: string;
361  private temperature: number;
362 
363  constructor(config?: EvaluatorAgentConfig) {
364    this.model = config?.model || 'gpt-5.2';
365    this.temperature = config?.temperature || 0.3;
366  }
367 
368  // Score a response
369  async score(input: DirectScoreInput) { ... }
370 
371  // Compare two responses
372  async compare(input: PairwiseCompareInput) { ... }
373 
374  // Generate a rubric
375  async generateRubric(input: GenerateRubricInput) { ... }
376 
377  // Full workflow: generate rubric then score
378  async evaluateWithGeneratedRubric(response, prompt, criteria) { ... }
379 
380  // Chat-based evaluation
381  async chat(userMessage: string) { ... }
382}
383```
384 
385---
386 
387## 📊 Test Results
388 
389All 19 tests pass successfully. Here are the actual test logs from our test run:
390 
391### Test Output
392 
393```
394> [email protected] test
395> vitest run --testTimeout=120000
396 
397 RUN  v2.1.9 /Users/muratcankoylan/app_readwren
398 
399 ✓ tests/skills.test.ts (10 tests) 159317ms
400   ✓ LLM Evaluator Skill Tests > Direct Scoring Skill > should use chain-of-thought in scoring 4439ms
401   ✓ LLM Evaluator Skill Tests > Direct Scoring Skill > should handle multiple weighted criteria 7218ms
402   ✓ LLM Evaluator Skill Tests > Pairwise Comparison Skill > should mitigate position bias with swap 13002ms
403   ✓ LLM Evaluator Skill Tests > Pairwise Comparison Skill > should identify clear winner for quality difference 25914ms
404   ✓ LLM Evaluator Skill Tests > Rubric Generation Skill > should generate domain-specific rubrics 37165ms
405   ✓ LLM Evaluator Skill Tests > Rubric Generation Skill > should provide edge case guidance 29088ms
406   ✓ LLM Evaluator Skill Tests > Context Fundamentals Skill Application > should utilize provided context in evaluation 11133ms
407   ✓ Skill Input/Output Validation > should validate DirectScore input schema 4733ms
408   ✓ Skill Input/Output Validation > should validate PairwiseCompare output structure 4123ms
409   ✓ Skill Input/Output Validation > should validate GenerateRubric output structure 22500ms
410 
411 ✓ tests/evaluation.test.ts (9 tests) 216353ms
412   ✓ Direct Score Tool > should score a response against criteria 13219ms
413   ✓ Direct Score Tool > should provide lower scores for poor responses 14834ms
414   ✓ Pairwise Compare Tool > should correctly identify the better response 29254ms
415   ✓ Pairwise Compare Tool > should handle similar responses appropriately 14418ms
416   ✓ Pairwise Compare Tool > should provide comparison details for each criterion 9931ms
417   ✓ Generate Rubric Tool > should generate a complete rubric 24106ms
418   ✓ Generate Rubric Tool > should respect strictness setting 57919ms
419   ✓ Evaluator Agent > should provide integrated evaluation workflow 48112ms
420   ✓ Evaluator Agent > should support chat-based evaluation 4558ms
421 
422 Test Files  2 passed (2)
423      Tests  19 passed (19)
424   Start at  00:25:16
425   Duration  216.66s (transform 68ms, setup 32ms, collect 148ms, tests 375.67s, environment 0ms, prepare 105ms)
426```
427 
428### Test Coverage Summary
429 
430| Test Category | Tests | Pass Rate | Avg Duration |
431|--------------|-------|-----------|--------------|
432| Direct Scoring | 4 | 100% | 9.9s |
433| Pairwise Comparison | 4 | 100% | 17.9s |
434| Rubric Generation | 4 | 100% | 33.2s |
435| Context Integration | 1 | 100% | 11.1s |
436| Agent Integration | 2 | 100% | 26.3s |
437| Schema Validation | 4 | 100% | 8.8s |
438 
439---
440 
441## 📚 Key Learnings
442 
443### 1. Position Bias is Real
444 
445During testing, we confirmed Eugene Yan's research findings:
446 
447```
448Test: "should mitigate position bias with swap" - 13002ms
449Result: Position consistency check correctly detected and mitigated bias
450```
451 
452When comparing identical responses, the system correctly returns `TIE`. When comparing clearly different quality responses, the winner is consistent across position swaps.
453 
454### 2. Chain-of-Thought Improves Quality
455 
456Tests confirm that requiring justification produces more reliable evaluations:
457 
458```
459Test: "should use chain-of-thought in scoring" - 4439ms
460Result: All scores include justifications >20 characters with specific evidence
461```
462 
463### 3. Domain-Specific Rubrics Matter
464 
465The rubric generator adapts to the specified domain:
466 
467```
468Test: "should generate domain-specific rubrics" - 37165ms
469Result: Software engineering rubric included terms like "variable", "function", "comment"
470```
471 
472### 4. Weighted Criteria Enable Nuanced Evaluation
473 
474```
475Test: "should handle multiple weighted criteria" - 7218ms
476Result: weightedScore differs from overallScore when weights are unequal
477```
478 
479### 5. Context Affects Evaluation
480 
481The context fundamentals skill proves valuable:
482 
483```
484Test: "should utilize provided context in evaluation" - 11133ms
485Result: Medical context allowed technical terminology to score well
486```
487 
488---
489 
490## 🚀 Quick Start
491 
492### Installation
493 
494```bash
495git clone https://github.com/muratcankoylan/llm-as-judge-skills.git
496cd llm-as-judge-skills
497npm install
498```
499 
500### Configuration
501 
502Create a `.env` file:
503 
504```bash
505OPENAI_API_KEY=your_openai_api_key_here
506OPENAI_MODEL=gpt-5.2  
507```
508 
509### Run Tests
510 
511```bash
512npm test
513```
514 
515### Basic Usage
516 
517```typescript
518import { EvaluatorAgent } from './src/agents/evaluator';
519 
520const agent = new EvaluatorAgent();
521 
522// Score a response
523const scoreResult = await agent.score({
524  response: 'Your AI-generated response',
525  prompt: 'The original prompt',
526  criteria: [
527    { name: 'Accuracy', description: 'Factual correctness', weight: 1 }
528  ]
529});
530 
531console.log(`Score: ${scoreResult.overallScore}/5`);
532 
533// Compare two responses
534const compareResult = await agent.compare({
535  responseA: 'First response',
536  responseB: 'Second response',
537  prompt: 'The prompt',
538  criteria: ['quality', 'completeness'],
539  allowTie: true,
540  swapPositions: true
541});
542 
543console.log(`Winner: ${compareResult.winner} (confidence: ${compareResult.confidence})`);
544```
545 
546---
547 
548## 🔗 Integration with Agent Skills Repository
549 
550This project is designed to be added to the examples section of the main repository:
551 
552```
553Agent-Skills-for-Context-Engineering/
554├── skills/
555│   ├── context-fundamentals/     # Foundation (referenced by this project)
556│   └── tool-design/              # Foundation (referenced by this project)
557├── examples/
558│   └── llm-as-judge-skills/      # ← This project
559│       ├── README.md
560│       ├── skills/
561│       ├── tools/
562│       ├── agents/
563│       └── src/
564```
565 
566### How This Example Demonstrates the Framework
567 
5681. **Skills → Prompts → Tools**: Shows the progression from knowledge (MD files) to executable code
5692. **Context Engineering**: Applies context fundamentals in evaluation prompts
5703. **Tool Design Patterns**: Implements Zod schemas, error handling, and clear interfaces
5714. **Agent Architecture**: Uses AI SDK patterns for agent abstraction
572 
573---
574 
575## 📋 API Reference
576 
577### DirectScoreInput
578 
579```typescript
580interface DirectScoreInput {
581  response: string;              // The response to evaluate
582  prompt: string;                // Original prompt
583  context?: string;              // Additional context
584  criteria: Array<{
585    name: string;                // Criterion name
586    description: string;         // What it measures
587    weight: number;              // Relative importance (0-1)
588  }>;
589  rubric?: {
590    scale: '1-3' | '1-5' | '1-10';
591    levelDescriptions?: Record<string, string>;
592  };
593}
594```
595 
596### PairwiseCompareInput
597 
598```typescript
599interface PairwiseCompareInput {
600  responseA: string;             // First response
601  responseB: string;             // Second response
602  prompt: string;                // Original prompt
603  context?: string;              // Additional context
604  criteria: string[];            // Comparison aspects
605  allowTie?: boolean;            // Allow tie verdict (default: true)
606  swapPositions?: boolean;       // Mitigate position bias (default: true)
607}
608```
609 
610### GenerateRubricInput
611 
612```typescript
613interface GenerateRubricInput {
614  criterionName: string;         // Name of criterion
615  criterionDescription: string;  // What it measures
616  scale?: '1-3' | '1-5' | '1-10';
617  domain?: string;               // Domain for terminology
618  includeExamples?: boolean;     // Generate examples
619  strictness?: 'lenient' | 'balanced' | 'strict';
620}
621```
622 
623---
624 
625## 🛠️ Development
626 
627### Scripts
628 
629```bash
630npm run build       # Compile TypeScript
631npm run dev         # Watch mode
632npm test            # Run tests
633npm run lint        # ESLint
634npm run format      # Prettier
635npm run typecheck   # Type check
636```
637 
638### Adding New Tools
639 
6401. Create `src/tools/<category>/<tool-name>.ts`
6412. Define input/output Zod schemas
6423. Implement execute function
6434. Export from `src/tools/<category>/index.ts`
6445. Add documentation in `tools/<category>/<tool-name>.md`
6456. Write tests
646 
647---
648 
649## 📄 License
650 
651MIT License - see [LICENSE](LICENSE) for details.
652 
653---
654 
655## 🙏 Acknowledgments
656 
657- [Eugene Yan](https://eugeneyan.com/writing/llm-evaluators/) - LLM-as-a-Judge research
658- [Vercel AI SDK](https://sdk.vercel.ai/) - Agent patterns and tooling
659- [Agent Skills for Context Engineering](https://github.com/muratcankoylan/Agent-Skills-for-Context-Engineering) - Foundation framework
660

Marketplace

Source from repo

Agent Skills for Context Engineering

A comprehensive collection of Agent Skills for context engineering, multi-agent architectures, and production agent systems.

muratcankoylanGitHub muratcankoylanSource repo Original GitHub link

Files

241

Skill

n/a

Size

2.6 MB

Entrypoint

SKILL.md

Format

git-repo

Open file

examples/llm-as-judge-skills/README.md

Syntax-highlighted preview of this file as included in the skill package.

Rendered Source

markdown660 linesFree

examples/llm-as-judge-skills/README.md

1# LLM-as-a-Judge Skills
2 
3> A practical implementation of LLM evaluation skills built using insights from [Eugene Yan's LLM-Evaluators research](https://eugeneyan.com/writing/llm-evaluators/) and [Vercel AI SDK 6](https://vercel.com/blog/ai-sdk-6).
4 
5[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)
6[![TypeScript](https://img.shields.io/badge/TypeScript-5.6-blue.svg)](https://www.typescriptlang.org/)
7[![AI SDK](https://img.shields.io/badge/AI%20SDK-4.1-green.svg)](https://sdk.vercel.ai/)
8[![Tests](https://img.shields.io/badge/Tests-19%20passed-brightgreen.svg)](#test-results)
9 
10## 🎯 Purpose
11 
12This repository demonstrates how to build **production-ready LLM evaluation skills** as part of the [Agent Skills for Context Engineering](https://github.com/muratcankoylan/Agent-Skills-for-Context-Engineering) project. It serves as a practical example of:
13 
141. **Skill Development**: How to transform research insights into executable agent skills
152. **Tool Design**: Best practices for building AI tools with proper schemas and error handling
163. **Evaluation Patterns**: Implementation of LLM-as-a-Judge patterns for quality assessment
17 
18### Part of the Context Engineering Ecosystem
19 
20This project is an example implementation to be added to:
21- 📁 [`Agent-Skills-for-Context-Engineering/examples/`](https://github.com/muratcankoylan/Agent-Skills-for-Context-Engineering/tree/main/examples)
22 
23It builds upon the foundational skills from:
24- 📚 [`skills/context-fundamentals`](https://github.com/muratcankoylan/Agent-Skills-for-Context-Engineering/tree/main/skills/context-fundamentals) - Context engineering principles
25- 🔧 [`skills/tool-design`](https://github.com/muratcankoylan/Agent-Skills-for-Context-Engineering/tree/main/skills/tool-design) - Tool design best practices
26 
27---
28 
29## 📖 Background & Research
30 
31### The LLM-as-a-Judge Problem
32 
33Evaluating AI-generated content is challenging. Traditional metrics (BLEU, ROUGE) often miss nuances that matter. Eugene Yan's research on [LLM-Evaluators](https://eugeneyan.com/writing/llm-evaluators/) identifies practical patterns for using LLMs to judge LLM outputs.
34 
35**Key insights we implemented:**
36 
37| Insight | Implementation |
38|---------|----------------|
39| Direct scoring works best for objective criteria | `directScore` tool with rubric support |
40| Pairwise comparison is more reliable for preferences | `pairwiseCompare` tool with position swapping |
41| Position bias affects pairwise judgments | Automatic position swapping in comparisons |
42| Chain-of-thought improves reliability | All evaluations require justification with evidence |
43| Clear rubrics reduce variance | `generateRubric` tool for consistent standards |
44 
45### Vercel AI SDK 6 Patterns
46 
47We leveraged AI SDK 6's new patterns:
48 
49- **Agent Abstraction**: Reusable `EvaluatorAgent` class with multiple capabilities
50- **Type-safe Tools**: Zod schemas for all inputs/outputs
51- **Structured Output**: JSON responses parsed and validated
52- **Error Handling**: Graceful degradation when API calls fail
53 
54---
55 
56## 🏗️ What We Built
57 
58### Architecture Overview
59 
60```
61┌─────────────────────────────────────────────────────────────────────┐
62│                        LLM-as-a-Judge Skills                         │
63├─────────────────────────────────────────────────────────────────────┤
64│                                                                       │
65│  ┌─────────────┐    ┌─────────────┐    ┌─────────────────────────┐  │
66│  │   Skills    │    │   Prompts   │    │         Tools           │  │
67│  │  (MD docs)  │───▶│  (templates)│───▶│  (TypeScript impl)      │  │
68│  └─────────────┘    └─────────────┘    └─────────────────────────┘  │
69│         │                                         │                   │
70│         │                                         ▼                   │
71│         │                              ┌─────────────────────────┐  │
72│         └─────────────────────────────▶│    EvaluatorAgent       │  │
73│                                         │  ├── score()            │  │
74│                                         │  ├── compare()          │  │
75│                                         │  ├── generateRubric()   │  │
76│                                         │  └── chat()             │  │
77│                                         └─────────────────────────┘  │
78│                                                     │                 │
79│                                                     ▼                 │
80│                                         ┌─────────────────────────┐  │
81│                                         │   OpenAI GPT-5.2 API     │  │
82│                                         └─────────────────────────┘  │
83│                                                                       │
84└─────────────────────────────────────────────────────────────────────┘
85```
86 
87### Directory Structure
88 
89```
90llm-as-judge-skills/
91├── skills/                          # Foundational knowledge (MD docs)
92│   ├── llm-evaluator/               # LLM-as-a-Judge patterns
93│   │   └── llm-evaluator.md         # Evaluation methods, metrics, bias mitigation
94│   ├── context-fundamentals/        # Context engineering principles
95│   │   └── context-fundamentals.md  # Managing context effectively
96│   └── tool-design/                 # Tool design best practices
97│       └── tool-design.md           # Schema design, error handling
98│
99├── prompts/                         # Prompt templates
100│   ├── evaluation/
101│   │   ├── direct-scoring-prompt.md      # Scoring prompt template
102│   │   └── pairwise-comparison-prompt.md # Comparison prompt template
103│   ├── research/
104│   │   └── research-synthesis-prompt.md
105│   └── agent-system/
106│       └── orchestrator-prompt.md
107│
108├── tools/                           # Tool documentation (MD)
109│   ├── evaluation/
110│   │   ├── direct-score.md          # Direct scoring tool spec
111│   │   ├── pairwise-compare.md      # Pairwise comparison spec
112│   │   └── generate-rubric.md       # Rubric generation spec
113│   ├── research/
114│   │   ├── web-search.md
115│   │   └── read-url.md
116│   └── orchestration/
117│       └── delegate-to-agent.md
118│
119├── agents/                          # Agent documentation (MD)
120│   ├── evaluator-agent/
121│   │   └── evaluator-agent.md
122│   ├── research-agent/
123│   │   └── research-agent.md
124│   └── orchestrator-agent/
125│       └── orchestrator-agent.md
126│
127├── src/                             # TypeScript implementation
128│   ├── tools/evaluation/
129│   │   ├── direct-score.ts          # 165 lines - Direct scoring implementation
130│   │   ├── pairwise-compare.ts      # 255 lines - Pairwise with bias mitigation
131│   │   └── generate-rubric.ts       # 162 lines - Rubric generation
132│   ├── agents/
133│   │   └── evaluator.ts             # 112 lines - EvaluatorAgent class
134│   ├── config/
135│   │   └── index.ts                 # Configuration and validation
136│   └── index.ts                     # Main exports
137│
138├── tests/                           # Test suite
139│   ├── evaluation.test.ts           # 9 tests for tools
140│   ├── skills.test.ts               # 10 tests for skills
141│   └── setup.ts                     # Test configuration
142│
143└── examples/                        # Usage examples
144    ├── basic-evaluation.ts
145    ├── pairwise-comparison.ts
146    ├── generate-rubric.ts
147    └── full-evaluation-workflow.ts
148```
149 
150---
151 
152## 🔧 Core Tools Implemented
153 
154### 1. Direct Score Tool (`directScore`)
155 
156**Purpose**: Evaluate a single response against defined criteria with numerical scores.
157 
158**When to Use**:
159- Factual accuracy checks
160- Instruction following assessment
161- Content quality grading
162- Compliance verification
163 
164**Implementation Highlights**:
165 
166```typescript
167// From src/tools/evaluation/direct-score.ts
168 
169const systemPrompt = `You are an expert evaluator. Assess the response against each criterion.
170For each criterion:
1711. Find specific evidence in the response
1722. Score according to the rubric (1-5 scale)
1733. Justify your score
1744. Suggest one improvement
175 
176Be objective and consistent. Base scores on explicit evidence.`;
177```
178 
179**Key Features**:
180- Weighted criteria support
181- Chain-of-thought justification required
182- Evidence extraction from response
183- Improvement suggestions per criterion
184- Configurable rubrics (1-3, 1-5, 1-10 scales)
185 
186**Example Usage**:
187 
188```typescript
189const result = await executeDirectScore({
190  response: 'Quantum entanglement is like having two magical coins...',
191  prompt: 'Explain quantum entanglement to a high school student',
192  criteria: [
193    { name: 'Accuracy', description: 'Scientific correctness', weight: 0.4 },
194    { name: 'Clarity', description: 'Understandable for audience', weight: 0.3 },
195    { name: 'Engagement', description: 'Interesting and memorable', weight: 0.3 }
196  ],
197  rubric: { scale: '1-5' }
198});
199 
200// Output:
201// {
202//   success: true,
203//   scores: [
204//     { criterion: 'Accuracy', score: 4, justification: '...', evidence: [...] },
205//     { criterion: 'Clarity', score: 5, justification: '...', evidence: [...] },
206//     { criterion: 'Engagement', score: 4, justification: '...', evidence: [...] }
207//   ],
208//   overallScore: 4.33,
209//   weightedScore: 4.3,
210//   summary: { assessment: '...', strengths: [...], weaknesses: [...] }
211// }
212```
213 
214---
215 
216### 2. Pairwise Compare Tool (`pairwiseCompare`)
217 
218**Purpose**: Compare two responses and determine which is better, with position bias mitigation.
219 
220**When to Use**:
221- A/B testing responses
222- Preference evaluation
223- Style and tone assessment
224- Ranking quality differences
225 
226**Implementation Highlights**:
227 
228```typescript
229// Position bias mitigation: evaluate twice with swapped positions
230if (input.swapPositions) {
231  // First pass: A first, B second
232  const pass1 = await evaluatePair(input.responseA, input.responseB, ...);
233  
234  // Second pass: B first, A second
235  const pass2 = await evaluatePair(input.responseB, input.responseA, ...);
236  
237  // Map pass2 result back and check consistency
238  const pass2WinnerMapped = pass2.winner === 'A' ? 'B' : pass2.winner === 'B' ? 'A' : 'TIE';
239  const consistent = pass1.winner === pass2WinnerMapped;
240  
241  // If inconsistent, return TIE with lower confidence
242  if (!consistent) {
243    finalWinner = 'TIE';
244    finalConfidence = 0.5;
245  }
246}
247```
248 
249**Key Features**:
250- **Position Swapping**: Automatically runs evaluation twice with swapped positions
251- **Consistency Check**: Detects when position affects judgment
252- **Confidence Scoring**: 0-1 confidence based on consistency
253- **Per-criterion Comparison**: Detailed breakdown for each aspect
254- **Bias-aware Prompting**: Explicit instructions to ignore length and position
255 
256**Example Usage**:
257 
258```typescript
259const result = await executePairwiseCompare({
260  responseA: GOOD_RESPONSE,
261  responseB: POOR_RESPONSE,
262  prompt: 'Explain quantum entanglement',
263  criteria: ['accuracy', 'clarity', 'completeness', 'engagement'],
264  allowTie: true,
265  swapPositions: true  // Enable position bias mitigation
266});
267 
268// Output:
269// {
270//   success: true,
271//   winner: 'A',
272//   confidence: 0.85,
273//   positionConsistency: { consistent: true, firstPassWinner: 'A', secondPassWinner: 'A' },
274//   comparison: [
275//     { criterion: 'accuracy', winner: 'A', reasoning: '...' },
276//     { criterion: 'clarity', winner: 'A', reasoning: '...' },
277//     ...
278//   ]
279// }
280```
281 
282---
283 
284### 3. Generate Rubric Tool (`generateRubric`)
285 
286**Purpose**: Create detailed scoring rubrics for consistent evaluation standards.
287 
288**When to Use**:
289- Establishing evaluation criteria
290- Training human evaluators
291- Ensuring consistency across evaluations
292- Documenting quality standards
293 
294**Implementation Highlights**:
295 
296```typescript
297// Strictness affects the generated rubric:
298// - lenient: Lower bar for passing scores
299// - balanced: Fair, typical expectations
300// - strict: High standards, critical evaluation
301 
302const userPrompt = `Create a scoring rubric for:
303**Criterion**: ${input.criterionName}
304**Description**: ${input.criterionDescription}
305**Scale**: ${input.scale}
306**Domain**: ${input.domain}
307 
308Generate:
3091. Clear descriptions for each score level
3102. Specific characteristics that define each level
3113. Brief example text for each level
3124. General scoring guidelines
3135. Edge cases with guidance`;
314```
315 
316**Key Features**:
317- Domain-specific terminology
318- Configurable strictness levels
319- Example generation for each level
320- Edge case guidance
321- Scoring guidelines
322 
323**Example Usage**:
324 
325```typescript
326const result = await executeGenerateRubric({
327  criterionName: 'Code Readability',
328  criterionDescription: 'How easy the code is to understand and maintain',
329  scale: '1-5',
330  domain: 'software engineering',
331  includeExamples: true,
332  strictness: 'balanced'
333});
334 
335// Output:
336// {
337//   success: true,
338//   levels: [
339//     { score: 1, label: 'Poor', description: '...', characteristics: [...], example: '...' },
340//     { score: 2, label: 'Below Average', ... },
341//     { score: 3, label: 'Average', ... },
342//     { score: 4, label: 'Good', ... },
343//     { score: 5, label: 'Excellent', ... }
344//   ],
345//   scoringGuidelines: [...],
346//   edgeCases: [{ situation: '...', guidance: '...' }]
347// }
348```
349 
350---
351 
352### 4. Evaluator Agent
353 
354**Purpose**: High-level agent that combines all evaluation tools with conversational capability.
355 
356**Implementation**:
357 
358```typescript
359export class EvaluatorAgent {
360  private model: string;
361  private temperature: number;
362 
363  constructor(config?: EvaluatorAgentConfig) {
364    this.model = config?.model || 'gpt-5.2';
365    this.temperature = config?.temperature || 0.3;
366  }
367 
368  // Score a response
369  async score(input: DirectScoreInput) { ... }
370 
371  // Compare two responses
372  async compare(input: PairwiseCompareInput) { ... }
373 
374  // Generate a rubric
375  async generateRubric(input: GenerateRubricInput) { ... }
376 
377  // Full workflow: generate rubric then score
378  async evaluateWithGeneratedRubric(response, prompt, criteria) { ... }
379 
380  // Chat-based evaluation
381  async chat(userMessage: string) { ... }
382}
383```
384 
385---
386 
387## 📊 Test Results
388 
389All 19 tests pass successfully. Here are the actual test logs from our test run:
390 
391### Test Output
392 
393```
394> [email protected] test
395> vitest run --testTimeout=120000
396 
397 RUN  v2.1.9 /Users/muratcankoylan/app_readwren
398 
399 ✓ tests/skills.test.ts (10 tests) 159317ms
400   ✓ LLM Evaluator Skill Tests > Direct Scoring Skill > should use chain-of-thought in scoring 4439ms
401   ✓ LLM Evaluator Skill Tests > Direct Scoring Skill > should handle multiple weighted criteria 7218ms
402   ✓ LLM Evaluator Skill Tests > Pairwise Comparison Skill > should mitigate position bias with swap 13002ms
403   ✓ LLM Evaluator Skill Tests > Pairwise Comparison Skill > should identify clear winner for quality difference 25914ms
404   ✓ LLM Evaluator Skill Tests > Rubric Generation Skill > should generate domain-specific rubrics 37165ms
405   ✓ LLM Evaluator Skill Tests > Rubric Generation Skill > should provide edge case guidance 29088ms
406   ✓ LLM Evaluator Skill Tests > Context Fundamentals Skill Application > should utilize provided context in evaluation 11133ms
407   ✓ Skill Input/Output Validation > should validate DirectScore input schema 4733ms
408   ✓ Skill Input/Output Validation > should validate PairwiseCompare output structure 4123ms
409   ✓ Skill Input/Output Validation > should validate GenerateRubric output structure 22500ms
410 
411 ✓ tests/evaluation.test.ts (9 tests) 216353ms
412   ✓ Direct Score Tool > should score a response against criteria 13219ms
413   ✓ Direct Score Tool > should provide lower scores for poor responses 14834ms
414   ✓ Pairwise Compare Tool > should correctly identify the better response 29254ms
415   ✓ Pairwise Compare Tool > should handle similar responses appropriately 14418ms
416   ✓ Pairwise Compare Tool > should provide comparison details for each criterion 9931ms
417   ✓ Generate Rubric Tool > should generate a complete rubric 24106ms
418   ✓ Generate Rubric Tool > should respect strictness setting 57919ms
419   ✓ Evaluator Agent > should provide integrated evaluation workflow 48112ms
420   ✓ Evaluator Agent > should support chat-based evaluation 4558ms
421 
422 Test Files  2 passed (2)
423      Tests  19 passed (19)
424   Start at  00:25:16
425   Duration  216.66s (transform 68ms, setup 32ms, collect 148ms, tests 375.67s, environment 0ms, prepare 105ms)
426```
427 
428### Test Coverage Summary
429 
430| Test Category | Tests | Pass Rate | Avg Duration |
431|--------------|-------|-----------|--------------|
432| Direct Scoring | 4 | 100% | 9.9s |
433| Pairwise Comparison | 4 | 100% | 17.9s |
434| Rubric Generation | 4 | 100% | 33.2s |
435| Context Integration | 1 | 100% | 11.1s |
436| Agent Integration | 2 | 100% | 26.3s |
437| Schema Validation | 4 | 100% | 8.8s |
438 
439---
440 
441## 📚 Key Learnings
442 
443### 1. Position Bias is Real
444 
445During testing, we confirmed Eugene Yan's research findings:
446 
447```
448Test: "should mitigate position bias with swap" - 13002ms
449Result: Position consistency check correctly detected and mitigated bias
450```
451 
452When comparing identical responses, the system correctly returns `TIE`. When comparing clearly different quality responses, the winner is consistent across position swaps.
453 
454### 2. Chain-of-Thought Improves Quality
455 
456Tests confirm that requiring justification produces more reliable evaluations:
457 
458```
459Test: "should use chain-of-thought in scoring" - 4439ms
460Result: All scores include justifications >20 characters with specific evidence
461```
462 
463### 3. Domain-Specific Rubrics Matter
464 
465The rubric generator adapts to the specified domain:
466 
467```
468Test: "should generate domain-specific rubrics" - 37165ms
469Result: Software engineering rubric included terms like "variable", "function", "comment"
470```
471 
472### 4. Weighted Criteria Enable Nuanced Evaluation
473 
474```
475Test: "should handle multiple weighted criteria" - 7218ms
476Result: weightedScore differs from overallScore when weights are unequal
477```
478 
479### 5. Context Affects Evaluation
480 
481The context fundamentals skill proves valuable:
482 
483```
484Test: "should utilize provided context in evaluation" - 11133ms
485Result: Medical context allowed technical terminology to score well
486```
487 
488---
489 
490## 🚀 Quick Start
491 
492### Installation
493 
494```bash
495git clone https://github.com/muratcankoylan/llm-as-judge-skills.git
496cd llm-as-judge-skills
497npm install
498```
499 
500### Configuration
501 
502Create a `.env` file:
503 
504```bash
505OPENAI_API_KEY=your_openai_api_key_here
506OPENAI_MODEL=gpt-5.2  
507```
508 
509### Run Tests
510 
511```bash
512npm test
513```
514 
515### Basic Usage
516 
517```typescript
518import { EvaluatorAgent } from './src/agents/evaluator';
519 
520const agent = new EvaluatorAgent();
521 
522// Score a response
523const scoreResult = await agent.score({
524  response: 'Your AI-generated response',
525  prompt: 'The original prompt',
526  criteria: [
527    { name: 'Accuracy', description: 'Factual correctness', weight: 1 }
528  ]
529});
530 
531console.log(`Score: ${scoreResult.overallScore}/5`);
532 
533// Compare two responses
534const compareResult = await agent.compare({
535  responseA: 'First response',
536  responseB: 'Second response',
537  prompt: 'The prompt',
538  criteria: ['quality', 'completeness'],
539  allowTie: true,
540  swapPositions: true
541});
542 
543console.log(`Winner: ${compareResult.winner} (confidence: ${compareResult.confidence})`);
544```
545 
546---
547 
548## 🔗 Integration with Agent Skills Repository
549 
550This project is designed to be added to the examples section of the main repository:
551 
552```
553Agent-Skills-for-Context-Engineering/
554├── skills/
555│   ├── context-fundamentals/     # Foundation (referenced by this project)
556│   └── tool-design/              # Foundation (referenced by this project)
557├── examples/
558│   └── llm-as-judge-skills/      # ← This project
559│       ├── README.md
560│       ├── skills/
561│       ├── tools/
562│       ├── agents/
563│       └── src/
564```
565 
566### How This Example Demonstrates the Framework
567 
5681. **Skills → Prompts → Tools**: Shows the progression from knowledge (MD files) to executable code
5692. **Context Engineering**: Applies context fundamentals in evaluation prompts
5703. **Tool Design Patterns**: Implements Zod schemas, error handling, and clear interfaces
5714. **Agent Architecture**: Uses AI SDK patterns for agent abstraction
572 
573---
574 
575## 📋 API Reference
576 
577### DirectScoreInput
578 
579```typescript
580interface DirectScoreInput {
581  response: string;              // The response to evaluate
582  prompt: string;                // Original prompt
583  context?: string;              // Additional context
584  criteria: Array<{
585    name: string;                // Criterion name
586    description: string;         // What it measures
587    weight: number;              // Relative importance (0-1)
588  }>;
589  rubric?: {
590    scale: '1-3' | '1-5' | '1-10';
591    levelDescriptions?: Record<string, string>;
592  };
593}
594```
595 
596### PairwiseCompareInput
597 
598```typescript
599interface PairwiseCompareInput {
600  responseA: string;             // First response
601  responseB: string;             // Second response
602  prompt: string;                // Original prompt
603  context?: string;              // Additional context
604  criteria: string[];            // Comparison aspects
605  allowTie?: boolean;            // Allow tie verdict (default: true)
606  swapPositions?: boolean;       // Mitigate position bias (default: true)
607}
608```
609 
610### GenerateRubricInput
611 
612```typescript
613interface GenerateRubricInput {
614  criterionName: string;         // Name of criterion
615  criterionDescription: string;  // What it measures
616  scale?: '1-3' | '1-5' | '1-10';
617  domain?: string;               // Domain for terminology
618  includeExamples?: boolean;     // Generate examples
619  strictness?: 'lenient' | 'balanced' | 'strict';
620}
621```
622 
623---
624 
625## 🛠️ Development
626 
627### Scripts
628 
629```bash
630npm run build       # Compile TypeScript
631npm run dev         # Watch mode
632npm test            # Run tests
633npm run lint        # ESLint
634npm run format      # Prettier
635npm run typecheck   # Type check
636```
637 
638### Adding New Tools
639 
6401. Create `src/tools/<category>/<tool-name>.ts`
6412. Define input/output Zod schemas
6423. Implement execute function
6434. Export from `src/tools/<category>/index.ts`
6445. Add documentation in `tools/<category>/<tool-name>.md`
6456. Write tests
646 
647---
648 
649## 📄 License
650 
651MIT License - see [LICENSE](LICENSE) for details.
652 
653---
654 
655## 🙏 Acknowledgments
656 
657- [Eugene Yan](https://eugeneyan.com/writing/llm-evaluators/) - LLM-as-a-Judge research
658- [Vercel AI SDK](https://sdk.vercel.ai/) - Agent patterns and tooling
659- [Agent Skills for Context Engineering](https://github.com/muratcankoylan/Agent-Skills-for-Context-Engineering) - Foundation framework
660

Agent Skills for Context Engineering

examples/llm-as-judge-skills/README.md

Preparing the source view

Agent Skills for Context Engineering

examples/llm-as-judge-skills/README.md