Evaluator Agent
Purpose
The Evaluator Agent assesses the quality of LLM-generated responses using configurable evaluation criteria. It implements the LLM-as-a-Judge pattern with support for both direct scoring and pairwise comparison.
Agent Definition
import { ToolLoopAgent } from "ai";
import { anthropic } from "@ai-sdk/anthropic";
import { evaluationTools } from "../tools";
export const evaluatorAgent = new ToolLoopAgent({
name: "evaluator",
model: anthropic("claude-sonnet-4-20250514"),
instructions: `You are an expert evaluator of LLM-generated content.
Your role is to:
1. Assess response quality against specific criteria
2. Provide structured scores with justifications
3. Identify specific issues and strengths
4. Compare responses when asked for pairwise evaluation
Evaluation Guidelines:
- Be objective and consistent in your assessments
- Ground evaluations in specific evidence from the response
- Consider the context and requirements of the original task
- Avoid position bias - evaluate content not placement
- Do not favor verbose responses unless verbosity adds value
Always provide:
- Numerical scores for each criterion
- Specific examples supporting your assessment
- Actionable feedback for improvement`,
tools: {
directScore: evaluationTools.directScore,
pairwiseCompare: evaluationTools.pairwiseCompare,
extractCriteria: evaluationTools.extractCriteria,
generateRubric: evaluationTools.generateRubric
}
});Capabilities
Direct Scoring
Evaluate a single response against defined criteria and rubric.
Input:
- Response to evaluate
- Original prompt/context
- Evaluation criteria
- Scoring rubric
Output:
- Score per criterion (1-5)
- Overall score
- Detailed justification
- Identified issues and strengths
Pairwise Comparison
Compare two responses and select the better one.
Input:
- Response A
- Response B
- Original prompt/context
- Comparison criteria
Output:
- Winner selection (A, B, or Tie)
- Confidence score
- Comparative analysis
- Specific differentiators
Criteria Extraction
Automatically extract evaluation criteria from a task description.
Input:
- Task description
- Domain context
- Quality expectations
Output:
- List of relevant criteria
- Criterion descriptions
- Suggested weights
Rubric Generation
Generate a scoring rubric for specific criteria.
Input:
- Criterion name
- Quality dimensions
- Scale (default 1-5)
Output:
- Rubric with score descriptions
- Examples for each level
- Edge case guidance
Configuration
interface EvaluatorConfig {
// Scoring configuration
scoringMode: "direct" | "pairwise";
useChainOfThought: boolean;
nShotExamples: number;
// Bias mitigation
swapPositionsForPairwise: boolean;
normalizeForLength: boolean;
// Output configuration
includeJustification: boolean;
includeExamples: boolean;
outputFormat: "structured" | "prose";
}
const defaultConfig: EvaluatorConfig = {
scoringMode: "direct",
useChainOfThought: true,
nShotExamples: 2,
swapPositionsForPairwise: true,
normalizeForLength: false,
includeJustification: true,
includeExamples: true,
outputFormat: "structured"
};Usage Example
import { evaluatorAgent } from "./agents/evaluator-agent";
// Direct scoring
const evaluation = await evaluatorAgent.generate({
prompt: `Evaluate the following response:
Original Question: "Explain quantum entanglement to a high school student"
Response: "${generatedResponse}"
Criteria:
1. Accuracy - Scientific correctness
2. Clarity - Understandable for target audience
3. Engagement - Interesting and memorable
4. Completeness - Covers key concepts
Provide scores and detailed feedback.`
});
// Pairwise comparison
const comparison = await evaluatorAgent.generate({
prompt: `Compare these two responses to the same question.
Question: "What are the benefits of exercise?"
Response A: "${responseA}"
Response B: "${responseB}"
Which response is better? Explain your reasoning.`
});Integration Points
- Content Generation Pipeline: Evaluate outputs before delivery
- Model Comparison: Compare responses from different models
- Quality Monitoring: Track response quality over time
- Fine-tuning Data: Generate preference data for RLHF