Source from repo

Agent Skills for Context Engineering

A comprehensive collection of Agent Skills for context engineering, multi-agent architectures, and production agent systems.

muratcankoylanGitHub muratcankoylanSource repo Original GitHub link

Files

241

Skill

n/a

Size

2.6 MB

Entrypoint

SKILL.md

Format

git-repo

Open file

examples/llm-as-judge-skills/agents/evaluator-agent/evaluator-agent.md

Syntax-highlighted preview of this file as included in the skill package.

Rendered Source

markdown178 linesFree

examples/llm-as-judge-skills/agents/evaluator-agent/evaluator-agent.md

1# Evaluator Agent
2 
3## Purpose
4 
5The Evaluator Agent assesses the quality of LLM-generated responses using configurable evaluation criteria. It implements the LLM-as-a-Judge pattern with support for both direct scoring and pairwise comparison.
6 
7## Agent Definition
8 
9```typescript
10import { ToolLoopAgent } from "ai";
11import { anthropic } from "@ai-sdk/anthropic";
12import { evaluationTools } from "../tools";
13 
14export const evaluatorAgent = new ToolLoopAgent({
15  name: "evaluator",
16  model: anthropic("claude-sonnet-4-20250514"),
17  instructions: `You are an expert evaluator of LLM-generated content.
18 
19Your role is to:
201. Assess response quality against specific criteria
212. Provide structured scores with justifications
223. Identify specific issues and strengths
234. Compare responses when asked for pairwise evaluation
24 
25Evaluation Guidelines:
26- Be objective and consistent in your assessments
27- Ground evaluations in specific evidence from the response
28- Consider the context and requirements of the original task
29- Avoid position bias - evaluate content not placement
30- Do not favor verbose responses unless verbosity adds value
31 
32Always provide:
33- Numerical scores for each criterion
34- Specific examples supporting your assessment
35- Actionable feedback for improvement`,
36  
37  tools: {
38    directScore: evaluationTools.directScore,
39    pairwiseCompare: evaluationTools.pairwiseCompare,
40    extractCriteria: evaluationTools.extractCriteria,
41    generateRubric: evaluationTools.generateRubric
42  }
43});
44```
45 
46## Capabilities
47 
48### Direct Scoring
49Evaluate a single response against defined criteria and rubric.
50 
51**Input:**
52- Response to evaluate
53- Original prompt/context
54- Evaluation criteria
55- Scoring rubric
56 
57**Output:**
58- Score per criterion (1-5)
59- Overall score
60- Detailed justification
61- Identified issues and strengths
62 
63### Pairwise Comparison
64Compare two responses and select the better one.
65 
66**Input:**
67- Response A
68- Response B
69- Original prompt/context
70- Comparison criteria
71 
72**Output:**
73- Winner selection (A, B, or Tie)
74- Confidence score
75- Comparative analysis
76- Specific differentiators
77 
78### Criteria Extraction
79Automatically extract evaluation criteria from a task description.
80 
81**Input:**
82- Task description
83- Domain context
84- Quality expectations
85 
86**Output:**
87- List of relevant criteria
88- Criterion descriptions
89- Suggested weights
90 
91### Rubric Generation
92Generate a scoring rubric for specific criteria.
93 
94**Input:**
95- Criterion name
96- Quality dimensions
97- Scale (default 1-5)
98 
99**Output:**
100- Rubric with score descriptions
101- Examples for each level
102- Edge case guidance
103 
104## Configuration
105 
106```typescript
107interface EvaluatorConfig {
108  // Scoring configuration
109  scoringMode: "direct" | "pairwise";
110  useChainOfThought: boolean;
111  nShotExamples: number;
112  
113  // Bias mitigation
114  swapPositionsForPairwise: boolean;
115  normalizeForLength: boolean;
116  
117  // Output configuration
118  includeJustification: boolean;
119  includeExamples: boolean;
120  outputFormat: "structured" | "prose";
121}
122 
123const defaultConfig: EvaluatorConfig = {
124  scoringMode: "direct",
125  useChainOfThought: true,
126  nShotExamples: 2,
127  swapPositionsForPairwise: true,
128  normalizeForLength: false,
129  includeJustification: true,
130  includeExamples: true,
131  outputFormat: "structured"
132};
133```
134 
135## Usage Example
136 
137```typescript
138import { evaluatorAgent } from "./agents/evaluator-agent";
139 
140// Direct scoring
141const evaluation = await evaluatorAgent.generate({
142  prompt: `Evaluate the following response:
143 
144Original Question: "Explain quantum entanglement to a high school student"
145 
146Response: "${generatedResponse}"
147 
148Criteria:
1491. Accuracy - Scientific correctness
1502. Clarity - Understandable for target audience
1513. Engagement - Interesting and memorable
1524. Completeness - Covers key concepts
153 
154Provide scores and detailed feedback.`
155});
156 
157// Pairwise comparison
158const comparison = await evaluatorAgent.generate({
159  prompt: `Compare these two responses to the same question.
160 
161Question: "What are the benefits of exercise?"
162 
163Response A: "${responseA}"
164 
165Response B: "${responseB}"
166 
167Which response is better? Explain your reasoning.`
168});
169```
170 
171## Integration Points
172 
173- **Content Generation Pipeline**: Evaluate outputs before delivery
174- **Model Comparison**: Compare responses from different models
175- **Quality Monitoring**: Track response quality over time
176- **Fine-tuning Data**: Generate preference data for RLHF
177 
178

Preparing the source view

Agent Skills for Context Engineering

examples/llm-as-judge-skills/agents/evaluator-agent/evaluator-agent.md