Evaluator Agent

Purpose

The Evaluator Agent assesses the quality of LLM-generated responses using configurable evaluation criteria. It implements the LLM-as-a-Judge pattern with support for both direct scoring and pairwise comparison.

Agent Definition

import { ToolLoopAgent } from "ai";
import { anthropic } from "@ai-sdk/anthropic";
import { evaluationTools } from "../tools";

export const evaluatorAgent = new ToolLoopAgent({
  name: "evaluator",
  model: anthropic("claude-sonnet-4-20250514"),
  instructions: `You are an expert evaluator of LLM-generated content.

Your role is to:
1. Assess response quality against specific criteria
2. Provide structured scores with justifications
3. Identify specific issues and strengths
4. Compare responses when asked for pairwise evaluation

Evaluation Guidelines:
- Be objective and consistent in your assessments
- Ground evaluations in specific evidence from the response
- Consider the context and requirements of the original task
- Avoid position bias - evaluate content not placement
- Do not favor verbose responses unless verbosity adds value

Always provide:
- Numerical scores for each criterion
- Specific examples supporting your assessment
- Actionable feedback for improvement`,
  
  tools: {
    directScore: evaluationTools.directScore,
    pairwiseCompare: evaluationTools.pairwiseCompare,
    extractCriteria: evaluationTools.extractCriteria,
    generateRubric: evaluationTools.generateRubric
  }
});

Capabilities

Direct Scoring

Evaluate a single response against defined criteria and rubric.

Input:

Response to evaluate
Original prompt/context
Evaluation criteria
Scoring rubric

Output:

Score per criterion (1-5)
Overall score
Detailed justification
Identified issues and strengths

Pairwise Comparison

Compare two responses and select the better one.

Input:

Response A
Response B
Original prompt/context
Comparison criteria

Output:

Winner selection (A, B, or Tie)
Confidence score
Comparative analysis
Specific differentiators

Criteria Extraction

Automatically extract evaluation criteria from a task description.

Input:

Task description
Domain context
Quality expectations

Output:

List of relevant criteria
Criterion descriptions
Suggested weights

Rubric Generation

Generate a scoring rubric for specific criteria.

Input:

Criterion name
Quality dimensions
Scale (default 1-5)

Output:

Rubric with score descriptions
Examples for each level
Edge case guidance

Configuration

interface EvaluatorConfig {
  // Scoring configuration
  scoringMode: "direct" | "pairwise";
  useChainOfThought: boolean;
  nShotExamples: number;
  
  // Bias mitigation
  swapPositionsForPairwise: boolean;
  normalizeForLength: boolean;
  
  // Output configuration
  includeJustification: boolean;
  includeExamples: boolean;
  outputFormat: "structured" | "prose";
}

const defaultConfig: EvaluatorConfig = {
  scoringMode: "direct",
  useChainOfThought: true,
  nShotExamples: 2,
  swapPositionsForPairwise: true,
  normalizeForLength: false,
  includeJustification: true,
  includeExamples: true,
  outputFormat: "structured"
};

Usage Example

import { evaluatorAgent } from "./agents/evaluator-agent";

// Direct scoring
const evaluation = await evaluatorAgent.generate({
  prompt: `Evaluate the following response:

Original Question: "Explain quantum entanglement to a high school student"

Response: "${generatedResponse}"

Criteria:
1. Accuracy - Scientific correctness
2. Clarity - Understandable for target audience
3. Engagement - Interesting and memorable
4. Completeness - Covers key concepts

Provide scores and detailed feedback.`
});

// Pairwise comparison
const comparison = await evaluatorAgent.generate({
  prompt: `Compare these two responses to the same question.

Question: "What are the benefits of exercise?"

Response A: "${responseA}"

Response B: "${responseB}"

Which response is better? Explain your reasoning.`
});

Integration Points

Content Generation Pipeline: Evaluate outputs before delivery
Model Comparison: Compare responses from different models
Quality Monitoring: Track response quality over time
Fine-tuning Data: Generate preference data for RLHF

Preparing the source view

Agent Skills for Context Engineering

examples/llm-as-judge-skills/agents/evaluator-agent/evaluator-agent.md

Evaluator Agent

Purpose

Agent Definition

Capabilities

Direct Scoring

Pairwise Comparison

Criteria Extraction

Rubric Generation

Configuration

Usage Example

Integration Points