Loading source
Pulling the file list, source metadata, and syntax-aware rendering for this listing.
Source from repo
A comprehensive collection of Agent Skills for context engineering, multi-agent architectures, and production agent systems.
Files
Skill
Size
Entrypoint
Format
Open file
Syntax-highlighted preview of this file as included in the skill package.
examples/llm-as-judge-skills/tools/evaluation/pairwise-compare.md
1# Pairwise Compare Tool23## Purpose45Compare two LLM responses and determine which one better satisfies the given criteria. More reliable for subjective evaluations than direct scoring.67## Tool Definition89```typescript10import { tool } from "ai";11import { z } from "zod";1213export const pairwiseCompare = tool({14description: `Compare two responses and select the better one.15Use for subjective evaluations like tone, persuasiveness, or writing style.16More reliable than direct scoring for preferences.17Returns winner selection with detailed comparison.`,1819parameters: z.object({20responseA: z.string()21.describe("First response to compare"),2223responseB: z.string()24.describe("Second response to compare"),2526prompt: z.string()27.describe("The original prompt both responses address"),2829context: z.string().optional()30.describe("Additional context if relevant"),3132criteria: z.array(z.string())33.describe("Aspects to compare on, e.g., ['clarity', 'engagement', 'accuracy']"),3435allowTie: z.boolean().default(true)36.describe("Whether to allow a tie verdict"),3738swapPositions: z.boolean().default(true)39.describe("Evaluate twice with positions swapped to reduce position bias")40}),4142execute: async (input) => {43if (input.swapPositions) {44return evaluateWithPositionSwap(input);45}46return evaluatePairwise(input);47}48});49```5051## Input Schema5253| Field | Type | Required | Description |54|-------|------|----------|-------------|55| responseA | string | Yes | First response |56| responseB | string | Yes | Second response |57| prompt | string | Yes | Original prompt |58| context | string | No | Additional context |59| criteria | string[] | Yes | Comparison dimensions |60| allowTie | boolean | No | Allow tie verdict (default: true) |61| swapPositions | boolean | No | Swap positions to reduce bias (default: true) |6263## Output Schema6465```typescript66interface PairwiseCompareResult {67success: boolean;6869winner: "A" | "B" | "TIE";70confidence: number; // 0-17172comparison: {73criterion: string;74winner: "A" | "B" | "TIE";75reasoning: string;76aStrength: string;77bStrength: string;78}[];7980overallReasoning: string;8182differentiators: {83aAdvantages: string[];84bAdvantages: string[];85};8687// If swapPositions was true88positionConsistency?: {89firstPassWinner: "A" | "B" | "TIE";90secondPassWinner: "A" | "B" | "TIE";91consistent: boolean;92};9394metadata: {95evaluationTimeMs: number;96positionsSwapped: boolean;97};98}99```100101## Usage Example102103```typescript104const result = await pairwiseCompare.execute({105responseA: "Exercise improves cardiovascular health, builds muscle, and boosts mental wellbeing...",106107responseB: "Working out regularly has many benefits. You'll feel better and look better...",108109prompt: "Explain the benefits of regular exercise",110111criteria: ["accuracy", "specificity", "engagement", "completeness"],112113allowTie: true,114swapPositions: true115});116117// Result:118// {119// winner: "A",120// confidence: 0.85,121// comparison: [122// {123// criterion: "accuracy",124// winner: "A",125// reasoning: "Response A uses specific medical terminology...",126// aStrength: "Mentions cardiovascular, muscle, mental health",127// bStrength: "General but not incorrect"128// },129// ...130// ],131// ...132// }133```134135## Position Swap Algorithm136137To mitigate position bias:138139```typescript140async function evaluateWithPositionSwap(input) {141// First pass: Original order142const pass1 = await evaluate({143first: input.responseA,144second: input.responseB,145...input146});147148// Second pass: Swapped order149const pass2 = await evaluate({150first: input.responseB,151second: input.responseA,152...input153});154155// Reconcile results156const pass2Adjusted = pass2.winner === "A" ? "B" : pass2.winner === "B" ? "A" : "TIE";157158if (pass1.winner === pass2Adjusted) {159return {160...pass1,161positionConsistency: { consistent: true, ... }162};163} else {164// Inconsistent - return tie or lower confidence165return {166winner: "TIE",167confidence: 0.5,168positionConsistency: { consistent: false, ... },169...170};171}172}173```174175## Implementation Notes1761771. **Position Bias Mitigation**: Always use `swapPositions: true` for production1782. **Criteria Order**: Order criteria by importance for better focus1793. **Tie Handling**: Consider domain - some tasks should rarely tie1804. **Confidence Calibration**: Lower confidence when evaluations are close1815. **Length Considerations**: Note if one response is significantly longer182183