Loading source
Pulling the file list, source metadata, and syntax-aware rendering for this listing.
Source from repo
A comprehensive collection of Agent Skills for context engineering, multi-agent architectures, and production agent systems.
Files
Skill
Size
Entrypoint
Format
Open file
Syntax-highlighted preview of this file as included in the skill package.
examples/llm-as-judge-skills/agents/evaluator-agent/evaluator-agent.md
1# Evaluator Agent23## Purpose45The Evaluator Agent assesses the quality of LLM-generated responses using configurable evaluation criteria. It implements the LLM-as-a-Judge pattern with support for both direct scoring and pairwise comparison.67## Agent Definition89```typescript10import { ToolLoopAgent } from "ai";11import { anthropic } from "@ai-sdk/anthropic";12import { evaluationTools } from "../tools";1314export const evaluatorAgent = new ToolLoopAgent({15name: "evaluator",16model: anthropic("claude-sonnet-4-20250514"),17instructions: `You are an expert evaluator of LLM-generated content.1819Your role is to:201. Assess response quality against specific criteria212. Provide structured scores with justifications223. Identify specific issues and strengths234. Compare responses when asked for pairwise evaluation2425Evaluation Guidelines:26- Be objective and consistent in your assessments27- Ground evaluations in specific evidence from the response28- Consider the context and requirements of the original task29- Avoid position bias - evaluate content not placement30- Do not favor verbose responses unless verbosity adds value3132Always provide:33- Numerical scores for each criterion34- Specific examples supporting your assessment35- Actionable feedback for improvement`,3637tools: {38directScore: evaluationTools.directScore,39pairwiseCompare: evaluationTools.pairwiseCompare,40extractCriteria: evaluationTools.extractCriteria,41generateRubric: evaluationTools.generateRubric42}43});44```4546## Capabilities4748### Direct Scoring49Evaluate a single response against defined criteria and rubric.5051**Input:**52- Response to evaluate53- Original prompt/context54- Evaluation criteria55- Scoring rubric5657**Output:**58- Score per criterion (1-5)59- Overall score60- Detailed justification61- Identified issues and strengths6263### Pairwise Comparison64Compare two responses and select the better one.6566**Input:**67- Response A68- Response B69- Original prompt/context70- Comparison criteria7172**Output:**73- Winner selection (A, B, or Tie)74- Confidence score75- Comparative analysis76- Specific differentiators7778### Criteria Extraction79Automatically extract evaluation criteria from a task description.8081**Input:**82- Task description83- Domain context84- Quality expectations8586**Output:**87- List of relevant criteria88- Criterion descriptions89- Suggested weights9091### Rubric Generation92Generate a scoring rubric for specific criteria.9394**Input:**95- Criterion name96- Quality dimensions97- Scale (default 1-5)9899**Output:**100- Rubric with score descriptions101- Examples for each level102- Edge case guidance103104## Configuration105106```typescript107interface EvaluatorConfig {108// Scoring configuration109scoringMode: "direct" | "pairwise";110useChainOfThought: boolean;111nShotExamples: number;112113// Bias mitigation114swapPositionsForPairwise: boolean;115normalizeForLength: boolean;116117// Output configuration118includeJustification: boolean;119includeExamples: boolean;120outputFormat: "structured" | "prose";121}122123const defaultConfig: EvaluatorConfig = {124scoringMode: "direct",125useChainOfThought: true,126nShotExamples: 2,127swapPositionsForPairwise: true,128normalizeForLength: false,129includeJustification: true,130includeExamples: true,131outputFormat: "structured"132};133```134135## Usage Example136137```typescript138import { evaluatorAgent } from "./agents/evaluator-agent";139140// Direct scoring141const evaluation = await evaluatorAgent.generate({142prompt: `Evaluate the following response:143144Original Question: "Explain quantum entanglement to a high school student"145146Response: "${generatedResponse}"147148Criteria:1491. Accuracy - Scientific correctness1502. Clarity - Understandable for target audience1513. Engagement - Interesting and memorable1524. Completeness - Covers key concepts153154Provide scores and detailed feedback.`155});156157// Pairwise comparison158const comparison = await evaluatorAgent.generate({159prompt: `Compare these two responses to the same question.160161Question: "What are the benefits of exercise?"162163Response A: "${responseA}"164165Response B: "${responseB}"166167Which response is better? Explain your reasoning.`168});169```170171## Integration Points172173- **Content Generation Pipeline**: Evaluate outputs before delivery174- **Model Comparison**: Compare responses from different models175- **Quality Monitoring**: Track response quality over time176- **Fine-tuning Data**: Generate preference data for RLHF177178