Source from repo

Agent Skills for Context Engineering

A comprehensive collection of Agent Skills for context engineering, multi-agent architectures, and production agent systems.

muratcankoylanGitHub muratcankoylanSource repo Original GitHub link

Files

241

Skill

n/a

Size

2.6 MB

Entrypoint

SKILL.md

Format

git-repo

Open file

examples/llm-as-judge-skills/skills/llm-evaluator/llm-evaluator.md

Syntax-highlighted preview of this file as included in the skill package.

Rendered Source

markdown78 linesFree

examples/llm-as-judge-skills/skills/llm-evaluator/llm-evaluator.md

1# LLM-Evaluator Skill
2 
3## Overview
4 
5LLM-Evaluators (LLM-as-a-Judge) are large language models designed to evaluate the quality of another LLM's response to an instruction or query. This skill provides the foundational knowledge for building evaluation systems.
6 
7## Key Considerations
8 
9### Baseline Selection
10- **Human Annotators**: Aim for LLM-human correlation to match human-human correlation. LLM-evaluators are orders of magnitude faster and cheaper than human annotation.
11- **Finetuned Classifiers**: Goal is to achieve similar recall and precision as a finetuned classifier. More challenging baseline as these are optimized for specific tasks.
12 
13### Scoring Approaches
14 
15| Approach | Use Case | Reliability |
16|----------|----------|-------------|
17| **Direct Scoring** | Objective tasks (factuality, toxicity, instruction-following) | More suitable for binary classification |
18| **Pairwise Comparison** | Subjective evaluations (tone, persuasiveness, coherence) | More reliable for preference tasks |
19| **Reference-Based** | Comparing against gold standard | Requires ground truth reference |
20 
21### Evaluation Metrics
22 
23**Classification Metrics** (Preferred for binary tasks):
24- Recall and Precision
25- F1 Score
26- Cohen's κ (Kappa)
27 
28**Correlation Metrics** (For Likert scale tasks):
29- Spearman's ρ (rho)
30- Kendall's τ (tau)
31 
32## Known Biases
33 
341. **Position Bias**: LLM-evaluators tend to prefer responses in certain positions during pairwise comparison (usually first position)
352. **Verbosity Bias**: Favor longer, more verbose responses even if not higher quality
363. **Self-Enhancement Bias**: LLM-evaluators prefer answers generated by themselves
37 
38## Mitigation Strategies
39 
40- Swap response positions and average results
41- Normalize for length when evaluating
42- Use a Panel of LLMs (PoLL) instead of single judge
43- Include "don't overthink" instructions
44- Use CoT + n-shot prompts for reliability
45 
46## Implementation Pattern
47 
48```typescript
49interface EvaluatorConfig {
50  scoringApproach: 'direct' | 'pairwise' | 'reference-based';
51  criteria: EvaluationCriteria[];
52  metrics: MetricType[];
53  useCoT: boolean;
54  nShot: number;
55}
56 
57interface EvaluationCriteria {
58  name: string;
59  description: string;
60  rubric: RubricLevel[];
61}
62 
63interface RubricLevel {
64  score: number;
65  description: string;
66}
67```
68 
69## References
70 
71Key papers reviewed:
72- Constitutional AI (Anthropic)
73- G-Eval: NLG Evaluation using GPT-4
74- SelfCheckGPT: Zero-Resource Hallucination Detection
75- Prometheus: Fine-grained Evaluation Capability
76- MT-Bench and Chatbot Arena
77 
78

Preparing the source view

Agent Skills for Context Engineering

examples/llm-as-judge-skills/skills/llm-evaluator/llm-evaluator.md