Source from repo

Agent Skills for Context Engineering

A comprehensive collection of Agent Skills for context engineering, multi-agent architectures, and production agent systems.

muratcankoylanGitHub muratcankoylanSource repo Original GitHub link

Files

241

Skill

n/a

Size

2.6 MB

Entrypoint

SKILL.md

Format

git-repo

Open file

examples/llm-as-judge-skills/tools/evaluation/pairwise-compare.md

Syntax-highlighted preview of this file as included in the skill package.

Rendered Source

markdown183 linesFree

examples/llm-as-judge-skills/tools/evaluation/pairwise-compare.md

1# Pairwise Compare Tool
2 
3## Purpose
4 
5Compare two LLM responses and determine which one better satisfies the given criteria. More reliable for subjective evaluations than direct scoring.
6 
7## Tool Definition
8 
9```typescript
10import { tool } from "ai";
11import { z } from "zod";
12 
13export const pairwiseCompare = tool({
14  description: `Compare two responses and select the better one.
15Use for subjective evaluations like tone, persuasiveness, or writing style.
16More reliable than direct scoring for preferences.
17Returns winner selection with detailed comparison.`,
18 
19  parameters: z.object({
20    responseA: z.string()
21      .describe("First response to compare"),
22    
23    responseB: z.string()
24      .describe("Second response to compare"),
25    
26    prompt: z.string()
27      .describe("The original prompt both responses address"),
28    
29    context: z.string().optional()
30      .describe("Additional context if relevant"),
31    
32    criteria: z.array(z.string())
33      .describe("Aspects to compare on, e.g., ['clarity', 'engagement', 'accuracy']"),
34    
35    allowTie: z.boolean().default(true)
36      .describe("Whether to allow a tie verdict"),
37    
38    swapPositions: z.boolean().default(true)
39      .describe("Evaluate twice with positions swapped to reduce position bias")
40  }),
41 
42  execute: async (input) => {
43    if (input.swapPositions) {
44      return evaluateWithPositionSwap(input);
45    }
46    return evaluatePairwise(input);
47  }
48});
49```
50 
51## Input Schema
52 
53| Field | Type | Required | Description |
54|-------|------|----------|-------------|
55| responseA | string | Yes | First response |
56| responseB | string | Yes | Second response |
57| prompt | string | Yes | Original prompt |
58| context | string | No | Additional context |
59| criteria | string[] | Yes | Comparison dimensions |
60| allowTie | boolean | No | Allow tie verdict (default: true) |
61| swapPositions | boolean | No | Swap positions to reduce bias (default: true) |
62 
63## Output Schema
64 
65```typescript
66interface PairwiseCompareResult {
67  success: boolean;
68  
69  winner: "A" | "B" | "TIE";
70  confidence: number; // 0-1
71  
72  comparison: {
73    criterion: string;
74    winner: "A" | "B" | "TIE";
75    reasoning: string;
76    aStrength: string;
77    bStrength: string;
78  }[];
79  
80  overallReasoning: string;
81  
82  differentiators: {
83    aAdvantages: string[];
84    bAdvantages: string[];
85  };
86  
87  // If swapPositions was true
88  positionConsistency?: {
89    firstPassWinner: "A" | "B" | "TIE";
90    secondPassWinner: "A" | "B" | "TIE";
91    consistent: boolean;
92  };
93  
94  metadata: {
95    evaluationTimeMs: number;
96    positionsSwapped: boolean;
97  };
98}
99```
100 
101## Usage Example
102 
103```typescript
104const result = await pairwiseCompare.execute({
105  responseA: "Exercise improves cardiovascular health, builds muscle, and boosts mental wellbeing...",
106  
107  responseB: "Working out regularly has many benefits. You'll feel better and look better...",
108  
109  prompt: "Explain the benefits of regular exercise",
110  
111  criteria: ["accuracy", "specificity", "engagement", "completeness"],
112  
113  allowTie: true,
114  swapPositions: true
115});
116 
117// Result:
118// {
119//   winner: "A",
120//   confidence: 0.85,
121//   comparison: [
122//     {
123//       criterion: "accuracy",
124//       winner: "A",
125//       reasoning: "Response A uses specific medical terminology...",
126//       aStrength: "Mentions cardiovascular, muscle, mental health",
127//       bStrength: "General but not incorrect"
128//     },
129//     ...
130//   ],
131//   ...
132// }
133```
134 
135## Position Swap Algorithm
136 
137To mitigate position bias:
138 
139```typescript
140async function evaluateWithPositionSwap(input) {
141  // First pass: Original order
142  const pass1 = await evaluate({
143    first: input.responseA,
144    second: input.responseB,
145    ...input
146  });
147  
148  // Second pass: Swapped order
149  const pass2 = await evaluate({
150    first: input.responseB,
151    second: input.responseA,
152    ...input
153  });
154  
155  // Reconcile results
156  const pass2Adjusted = pass2.winner === "A" ? "B" : pass2.winner === "B" ? "A" : "TIE";
157  
158  if (pass1.winner === pass2Adjusted) {
159    return {
160      ...pass1,
161      positionConsistency: { consistent: true, ... }
162    };
163  } else {
164    // Inconsistent - return tie or lower confidence
165    return {
166      winner: "TIE",
167      confidence: 0.5,
168      positionConsistency: { consistent: false, ... },
169      ...
170    };
171  }
172}
173```
174 
175## Implementation Notes
176 
1771. **Position Bias Mitigation**: Always use `swapPositions: true` for production
1782. **Criteria Order**: Order criteria by importance for better focus
1793. **Tie Handling**: Consider domain - some tasks should rarely tie
1804. **Confidence Calibration**: Lower confidence when evaluations are close
1815. **Length Considerations**: Note if one response is significantly longer
182 
183

Agent Skills for Context Engineering

examples/llm-as-judge-skills/tools/evaluation/pairwise-compare.md

Preparing the source view

Agent Skills for Context Engineering

examples/llm-as-judge-skills/tools/evaluation/pairwise-compare.md