Pairwise Compare Tool

Purpose

Compare two LLM responses and determine which one better satisfies the given criteria. More reliable for subjective evaluations than direct scoring.

Tool Definition

import { tool } from "ai";
import { z } from "zod";

export const pairwiseCompare = tool({
  description: `Compare two responses and select the better one.
Use for subjective evaluations like tone, persuasiveness, or writing style.
More reliable than direct scoring for preferences.
Returns winner selection with detailed comparison.`,

  parameters: z.object({
    responseA: z.string()
      .describe("First response to compare"),
    
    responseB: z.string()
      .describe("Second response to compare"),
    
    prompt: z.string()
      .describe("The original prompt both responses address"),
    
    context: z.string().optional()
      .describe("Additional context if relevant"),
    
    criteria: z.array(z.string())
      .describe("Aspects to compare on, e.g., ['clarity', 'engagement', 'accuracy']"),
    
    allowTie: z.boolean().default(true)
      .describe("Whether to allow a tie verdict"),
    
    swapPositions: z.boolean().default(true)
      .describe("Evaluate twice with positions swapped to reduce position bias")
  }),

  execute: async (input) => {
    if (input.swapPositions) {
      return evaluateWithPositionSwap(input);
    }
    return evaluatePairwise(input);
  }
});

Input Schema

Field	Type	Required	Description
responseA	string	Yes	First response
responseB	string	Yes	Second response
prompt	string	Yes	Original prompt
context	string	No	Additional context
criteria	string[]	Yes	Comparison dimensions
allowTie	boolean	No	Allow tie verdict (default: true)
swapPositions	boolean	No	Swap positions to reduce bias (default: true)

Output Schema

interface PairwiseCompareResult {
  success: boolean;
  
  winner: "A" | "B" | "TIE";
  confidence: number; // 0-1
  
  comparison: {
    criterion: string;
    winner: "A" | "B" | "TIE";
    reasoning: string;
    aStrength: string;
    bStrength: string;
  }[];
  
  overallReasoning: string;
  
  differentiators: {
    aAdvantages: string[];
    bAdvantages: string[];
  };
  
  // If swapPositions was true
  positionConsistency?: {
    firstPassWinner: "A" | "B" | "TIE";
    secondPassWinner: "A" | "B" | "TIE";
    consistent: boolean;
  };
  
  metadata: {
    evaluationTimeMs: number;
    positionsSwapped: boolean;
  };
}

Usage Example

const result = await pairwiseCompare.execute({
  responseA: "Exercise improves cardiovascular health, builds muscle, and boosts mental wellbeing...",
  
  responseB: "Working out regularly has many benefits. You'll feel better and look better...",
  
  prompt: "Explain the benefits of regular exercise",
  
  criteria: ["accuracy", "specificity", "engagement", "completeness"],
  
  allowTie: true,
  swapPositions: true
});

// Result:
// {
//   winner: "A",
//   confidence: 0.85,
//   comparison: [
//     {
//       criterion: "accuracy",
//       winner: "A",
//       reasoning: "Response A uses specific medical terminology...",
//       aStrength: "Mentions cardiovascular, muscle, mental health",
//       bStrength: "General but not incorrect"
//     },
//     ...
//   ],
//   ...
// }

Position Swap Algorithm

To mitigate position bias:

async function evaluateWithPositionSwap(input) {
  // First pass: Original order
  const pass1 = await evaluate({
    first: input.responseA,
    second: input.responseB,
    ...input
  });
  
  // Second pass: Swapped order
  const pass2 = await evaluate({
    first: input.responseB,
    second: input.responseA,
    ...input
  });
  
  // Reconcile results
  const pass2Adjusted = pass2.winner === "A" ? "B" : pass2.winner === "B" ? "A" : "TIE";
  
  if (pass1.winner === pass2Adjusted) {
    return {
      ...pass1,
      positionConsistency: { consistent: true, ... }
    };
  } else {
    // Inconsistent - return tie or lower confidence
    return {
      winner: "TIE",
      confidence: 0.5,
      positionConsistency: { consistent: false, ... },
      ...
    };
  }
}

Implementation Notes

Position Bias Mitigation: Always use swapPositions: true for production
Criteria Order: Order criteria by importance for better focus
Tie Handling: Consider domain - some tasks should rarely tie
Confidence Calibration: Lower confidence when evaluations are close
Length Considerations: Note if one response is significantly longer

Preparing the source view

Agent Skills for Context Engineering

examples/llm-as-judge-skills/tools/evaluation/pairwise-compare.md

Pairwise Compare Tool

Purpose

Tool Definition

Input Schema

Output Schema

Usage Example

Position Swap Algorithm

Implementation Notes