Source from repo
Agent Skills for Context Engineering

A comprehensive collection of Agent Skills for context engineering, multi-agent architectures, and production agent systems.
muratcankoylanGitHub muratcankoylanSource repo Original GitHub link
Files
241
Skill
n/a
Size
2.6 MB
Entrypoint
SKILL.md
Format
git-repo
Open file
skills/advanced-evaluation/references/implementation-patterns.md

Syntax-highlighted preview of this file as included in the skill package.
Rendered Source
markdown316 linesFree
skills/advanced-evaluation/references/implementation-patterns.md
1# LLM-as-Judge Implementation Patterns
2 
3This reference provides detailed implementation patterns for building production-grade LLM evaluation systems.
4 
5## Pattern 1: Structured Evaluation Pipeline
6 
7The most reliable evaluation systems follow a structured pipeline that separates concerns:
8 
9```
10Input Validation → Criteria Loading → Scoring → Bias Mitigation → Output Formatting
11```
12 
13### Input Validation Layer
14 
15Before evaluation begins, validate:
16 
171. **Response presence**: Non-empty response to evaluate
182. **Prompt presence**: Original prompt for context
193. **Criteria validity**: At least one criterion with name and description
204. **Weight normalization**: Weights sum to 1.0 (or normalize them)
21 
22```python
23def validate_input(response, prompt, criteria):
24    if not response or not response.strip():
25        raise ValueError("Response cannot be empty")
26    if not prompt or not prompt.strip():
27        raise ValueError("Prompt cannot be empty")
28    if not criteria or len(criteria) == 0:
29        raise ValueError("At least one criterion required")
30    
31    # Normalize weights
32    total_weight = sum(c.get('weight', 1) for c in criteria)
33    for c in criteria:
34        c['weight'] = c.get('weight', 1) / total_weight
35```
36 
37### Criteria Loading Layer
38 
39Criteria should be loaded from configuration, not hardcoded:
40 
41```python
42class CriteriaLoader:
43    def __init__(self, rubric_path=None):
44        self.rubrics = self._load_rubrics(rubric_path)
45    
46    def get_criteria(self, task_type):
47        return self.rubrics.get(task_type, self.default_criteria)
48    
49    def get_rubric(self, criterion_name):
50        return self.rubrics.get(criterion_name, {}).get('levels', [])
51```
52 
53### Scoring Layer
54 
55The scoring layer handles the actual LLM call:
56 
57```python
58async def score_response(response, prompt, criteria, rubric, model):
59    system_prompt = build_system_prompt(criteria, rubric)
60    user_prompt = build_user_prompt(response, prompt, criteria)
61    
62    result = await generate_text(
63        model=model,
64        system=system_prompt,
65        prompt=user_prompt,
66        temperature=0.3  # Lower temperature for consistency
67    )
68    
69    return parse_scores(result.text)
70```
71 
72### Bias Mitigation Layer
73 
74For pairwise comparison, always include position swapping:
75 
76```python
77async def compare_with_bias_mitigation(response_a, response_b, prompt, criteria, model):
78    # First pass: A first
79    pass1 = await compare_pair(response_a, response_b, prompt, criteria, model)
80    
81    # Second pass: B first
82    pass2 = await compare_pair(response_b, response_a, prompt, criteria, model)
83    
84    # Map pass2 winner back
85    pass2_mapped = map_winner(pass2.winner)  # A→B, B→A, TIE→TIE
86    
87    # Check consistency
88    if pass1.winner == pass2_mapped:
89        return {
90            'winner': pass1.winner,
91            'confidence': (pass1.confidence + pass2.confidence) / 2,
92            'consistent': True
93        }
94    else:
95        return {
96            'winner': 'TIE',
97            'confidence': 0.5,
98            'consistent': False
99        }
100```
101 
102## Pattern 2: Hierarchical Evaluation
103 
104For complex evaluations, use a hierarchical approach:
105 
106```
107Quick Screen (cheap model) → Detailed Evaluation (expensive model) → Human Review (edge cases)
108```
109 
110### Quick Screen Implementation
111 
112```python
113async def quick_screen(response, prompt, threshold=0.7):
114    """Fast, cheap screening for obvious passes/fails."""
115    result = await generate_text(
116        model='gpt-5.2',  # Cheaper model
117        prompt=f"Rate 0-1 if this response adequately addresses the prompt:\n\nPrompt: {prompt}\n\nResponse: {response}",
118        temperature=0
119    )
120    score = float(result.text.strip())
121    return score, score > threshold
122```
123 
124### Detailed Evaluation
125 
126```python
127async def detailed_evaluation(response, prompt, criteria):
128    """Full evaluation for borderline or important cases."""
129    result = await generate_text(
130        model='gpt-5.2',  # More capable model
131        system=DETAILED_EVALUATION_PROMPT,
132        prompt=build_detailed_prompt(response, prompt, criteria),
133        temperature=0.3
134    )
135    return parse_detailed_scores(result.text)
136```
137 
138## Pattern 3: Panel of LLM Judges (PoLL)
139 
140For high-stakes evaluation, use multiple models:
141 
142```python
143async def poll_evaluation(response, prompt, criteria, models):
144    """Aggregate judgments from multiple LLM judges."""
145    results = await asyncio.gather(*[
146        score_with_model(response, prompt, criteria, model)
147        for model in models
148    ])
149    
150    # Aggregate scores
151    aggregated = aggregate_scores(results)
152    
153    # Calculate agreement
154    agreement = calculate_agreement(results)
155    
156    return {
157        'scores': aggregated,
158        'agreement': agreement,
159        'individual_results': results
160    }
161 
162def aggregate_scores(results):
163    """Aggregate scores using median (robust to outliers)."""
164    scores = {}
165    for criterion in results[0]['scores'].keys():
166        criterion_scores = [r['scores'][criterion] for r in results]
167        scores[criterion] = {
168            'score': statistics.median(criterion_scores),
169            'std': statistics.stdev(criterion_scores) if len(criterion_scores) > 1 else 0
170        }
171    return scores
172```
173 
174## Pattern 4: Confidence Calibration
175 
176Confidence scores should be calibrated to actual reliability:
177 
178```python
179def calibrate_confidence(raw_confidence, position_consistent, evidence_count):
180    """Calibrate confidence based on multiple signals."""
181    
182    # Base confidence from model output
183    calibrated = raw_confidence
184    
185    # Position consistency is a strong signal
186    if not position_consistent:
187        calibrated *= 0.6  # Significant reduction
188    
189    # More evidence = higher confidence
190    evidence_factor = min(evidence_count / 3, 1.0)  # Cap at 3 pieces
191    calibrated *= (0.7 + 0.3 * evidence_factor)
192    
193    return min(calibrated, 0.99)  # Never 100% confident
194```
195 
196## Pattern 5: Output Formatting
197 
198Always return structured outputs with consistent schemas:
199 
200```python
201@dataclass
202class ScoreResult:
203    criterion: str
204    score: float
205    max_score: float
206    justification: str
207    evidence: List[str]
208    improvement: str
209 
210@dataclass
211class EvaluationResult:
212    success: bool
213    scores: List[ScoreResult]
214    overall_score: float
215    weighted_score: float
216    summary: Dict[str, Any]
217    metadata: Dict[str, Any]
218 
219def format_output(scores, metadata) -> EvaluationResult:
220    """Format evaluation results consistently."""
221    return EvaluationResult(
222        success=True,
223        scores=scores,
224        overall_score=sum(s.score for s in scores) / len(scores),
225        weighted_score=calculate_weighted_score(scores),
226        summary=generate_summary(scores),
227        metadata=metadata
228    )
229```
230 
231## Error Handling Patterns
232 
233### Graceful Degradation
234 
235```python
236async def evaluate_with_fallback(response, prompt, criteria):
237    try:
238        return await full_evaluation(response, prompt, criteria)
239    except RateLimitError:
240        # Fall back to simpler evaluation
241        return await simple_evaluation(response, prompt, criteria)
242    except ParseError as e:
243        # Return partial results with error flag
244        return {
245            'success': False,
246            'partial_results': e.partial_data,
247            'error': str(e)
248        }
249```
250 
251### Retry Logic
252 
253```python
254async def evaluate_with_retry(response, prompt, criteria, max_retries=3):
255    for attempt in range(max_retries):
256        try:
257            result = await evaluate(response, prompt, criteria)
258            if is_valid_result(result):
259                return result
260        except TransientError:
261            await asyncio.sleep(2 ** attempt)  # Exponential backoff
262    
263    raise EvaluationError("Max retries exceeded")
264```
265 
266## Testing Patterns
267 
268### Unit Tests for Parsing
269 
270```python
271def test_score_parsing():
272    raw_output = '{"scores": [{"criterion": "Accuracy", "score": 4}]}'
273    result = parse_scores(raw_output)
274    assert result.scores[0].criterion == "Accuracy"
275    assert result.scores[0].score == 4
276 
277def test_malformed_output():
278    raw_output = 'Invalid JSON'
279    with pytest.raises(ParseError):
280        parse_scores(raw_output)
281```
282 
283### Integration Tests with Real API
284 
285```python
286@pytest.mark.integration
287async def test_full_evaluation_pipeline():
288    result = await evaluate(
289        response="Water boils at 100°C at sea level.",
290        prompt="At what temperature does water boil?",
291        criteria=[{"name": "Accuracy", "description": "Factual correctness", "weight": 1}]
292    )
293    
294    assert result.success
295    assert len(result.scores) == 1
296    assert result.scores[0].score >= 4  # Should score high for accurate response
297```
298 
299### Bias Detection Tests
300 
301```python
302async def test_position_bias_mitigation():
303    # Same response in both positions should tie
304    result = await compare(
305        response_a="Same response",
306        response_b="Same response",
307        prompt="Test prompt",
308        criteria=["quality"],
309        swap_positions=True
310    )
311    
312    assert result.winner == "TIE"
313    assert result.consistent == True
314```
315 
316
Preparing the source view

Agent Skills for Context Engineering

skills/advanced-evaluation/references/implementation-patterns.md