Loading source
Pulling the file list, source metadata, and syntax-aware rendering for this listing.
Source from repo
A comprehensive collection of Agent Skills for context engineering, multi-agent architectures, and production agent systems.
Files
Skill
Size
Entrypoint
Format
Open file
Syntax-highlighted preview of this file as included in the skill package.
skills/advanced-evaluation/references/implementation-patterns.md
1# LLM-as-Judge Implementation Patterns23This reference provides detailed implementation patterns for building production-grade LLM evaluation systems.45## Pattern 1: Structured Evaluation Pipeline67The most reliable evaluation systems follow a structured pipeline that separates concerns:89```10Input Validation → Criteria Loading → Scoring → Bias Mitigation → Output Formatting11```1213### Input Validation Layer1415Before evaluation begins, validate:16171. **Response presence**: Non-empty response to evaluate182. **Prompt presence**: Original prompt for context193. **Criteria validity**: At least one criterion with name and description204. **Weight normalization**: Weights sum to 1.0 (or normalize them)2122```python23def validate_input(response, prompt, criteria):24if not response or not response.strip():25raise ValueError("Response cannot be empty")26if not prompt or not prompt.strip():27raise ValueError("Prompt cannot be empty")28if not criteria or len(criteria) == 0:29raise ValueError("At least one criterion required")3031# Normalize weights32total_weight = sum(c.get('weight', 1) for c in criteria)33for c in criteria:34c['weight'] = c.get('weight', 1) / total_weight35```3637### Criteria Loading Layer3839Criteria should be loaded from configuration, not hardcoded:4041```python42class CriteriaLoader:43def __init__(self, rubric_path=None):44self.rubrics = self._load_rubrics(rubric_path)4546def get_criteria(self, task_type):47return self.rubrics.get(task_type, self.default_criteria)4849def get_rubric(self, criterion_name):50return self.rubrics.get(criterion_name, {}).get('levels', [])51```5253### Scoring Layer5455The scoring layer handles the actual LLM call:5657```python58async def score_response(response, prompt, criteria, rubric, model):59system_prompt = build_system_prompt(criteria, rubric)60user_prompt = build_user_prompt(response, prompt, criteria)6162result = await generate_text(63model=model,64system=system_prompt,65prompt=user_prompt,66temperature=0.3 # Lower temperature for consistency67)6869return parse_scores(result.text)70```7172### Bias Mitigation Layer7374For pairwise comparison, always include position swapping:7576```python77async def compare_with_bias_mitigation(response_a, response_b, prompt, criteria, model):78# First pass: A first79pass1 = await compare_pair(response_a, response_b, prompt, criteria, model)8081# Second pass: B first82pass2 = await compare_pair(response_b, response_a, prompt, criteria, model)8384# Map pass2 winner back85pass2_mapped = map_winner(pass2.winner) # A→B, B→A, TIE→TIE8687# Check consistency88if pass1.winner == pass2_mapped:89return {90'winner': pass1.winner,91'confidence': (pass1.confidence + pass2.confidence) / 2,92'consistent': True93}94else:95return {96'winner': 'TIE',97'confidence': 0.5,98'consistent': False99}100```101102## Pattern 2: Hierarchical Evaluation103104For complex evaluations, use a hierarchical approach:105106```107Quick Screen (cheap model) → Detailed Evaluation (expensive model) → Human Review (edge cases)108```109110### Quick Screen Implementation111112```python113async def quick_screen(response, prompt, threshold=0.7):114"""Fast, cheap screening for obvious passes/fails."""115result = await generate_text(116model='gpt-5.2', # Cheaper model117prompt=f"Rate 0-1 if this response adequately addresses the prompt:\n\nPrompt: {prompt}\n\nResponse: {response}",118temperature=0119)120score = float(result.text.strip())121return score, score > threshold122```123124### Detailed Evaluation125126```python127async def detailed_evaluation(response, prompt, criteria):128"""Full evaluation for borderline or important cases."""129result = await generate_text(130model='gpt-5.2', # More capable model131system=DETAILED_EVALUATION_PROMPT,132prompt=build_detailed_prompt(response, prompt, criteria),133temperature=0.3134)135return parse_detailed_scores(result.text)136```137138## Pattern 3: Panel of LLM Judges (PoLL)139140For high-stakes evaluation, use multiple models:141142```python143async def poll_evaluation(response, prompt, criteria, models):144"""Aggregate judgments from multiple LLM judges."""145results = await asyncio.gather(*[146score_with_model(response, prompt, criteria, model)147for model in models148])149150# Aggregate scores151aggregated = aggregate_scores(results)152153# Calculate agreement154agreement = calculate_agreement(results)155156return {157'scores': aggregated,158'agreement': agreement,159'individual_results': results160}161162def aggregate_scores(results):163"""Aggregate scores using median (robust to outliers)."""164scores = {}165for criterion in results[0]['scores'].keys():166criterion_scores = [r['scores'][criterion] for r in results]167scores[criterion] = {168'score': statistics.median(criterion_scores),169'std': statistics.stdev(criterion_scores) if len(criterion_scores) > 1 else 0170}171return scores172```173174## Pattern 4: Confidence Calibration175176Confidence scores should be calibrated to actual reliability:177178```python179def calibrate_confidence(raw_confidence, position_consistent, evidence_count):180"""Calibrate confidence based on multiple signals."""181182# Base confidence from model output183calibrated = raw_confidence184185# Position consistency is a strong signal186if not position_consistent:187calibrated *= 0.6 # Significant reduction188189# More evidence = higher confidence190evidence_factor = min(evidence_count / 3, 1.0) # Cap at 3 pieces191calibrated *= (0.7 + 0.3 * evidence_factor)192193return min(calibrated, 0.99) # Never 100% confident194```195196## Pattern 5: Output Formatting197198Always return structured outputs with consistent schemas:199200```python201@dataclass202class ScoreResult:203criterion: str204score: float205max_score: float206justification: str207evidence: List[str]208improvement: str209210@dataclass211class EvaluationResult:212success: bool213scores: List[ScoreResult]214overall_score: float215weighted_score: float216summary: Dict[str, Any]217metadata: Dict[str, Any]218219def format_output(scores, metadata) -> EvaluationResult:220"""Format evaluation results consistently."""221return EvaluationResult(222success=True,223scores=scores,224overall_score=sum(s.score for s in scores) / len(scores),225weighted_score=calculate_weighted_score(scores),226summary=generate_summary(scores),227metadata=metadata228)229```230231## Error Handling Patterns232233### Graceful Degradation234235```python236async def evaluate_with_fallback(response, prompt, criteria):237try:238return await full_evaluation(response, prompt, criteria)239except RateLimitError:240# Fall back to simpler evaluation241return await simple_evaluation(response, prompt, criteria)242except ParseError as e:243# Return partial results with error flag244return {245'success': False,246'partial_results': e.partial_data,247'error': str(e)248}249```250251### Retry Logic252253```python254async def evaluate_with_retry(response, prompt, criteria, max_retries=3):255for attempt in range(max_retries):256try:257result = await evaluate(response, prompt, criteria)258if is_valid_result(result):259return result260except TransientError:261await asyncio.sleep(2 ** attempt) # Exponential backoff262263raise EvaluationError("Max retries exceeded")264```265266## Testing Patterns267268### Unit Tests for Parsing269270```python271def test_score_parsing():272raw_output = '{"scores": [{"criterion": "Accuracy", "score": 4}]}'273result = parse_scores(raw_output)274assert result.scores[0].criterion == "Accuracy"275assert result.scores[0].score == 4276277def test_malformed_output():278raw_output = 'Invalid JSON'279with pytest.raises(ParseError):280parse_scores(raw_output)281```282283### Integration Tests with Real API284285```python286@pytest.mark.integration287async def test_full_evaluation_pipeline():288result = await evaluate(289response="Water boils at 100°C at sea level.",290prompt="At what temperature does water boil?",291criteria=[{"name": "Accuracy", "description": "Factual correctness", "weight": 1}]292)293294assert result.success295assert len(result.scores) == 1296assert result.scores[0].score >= 4 # Should score high for accurate response297```298299### Bias Detection Tests300301```python302async def test_position_bias_mitigation():303# Same response in both positions should tie304result = await compare(305response_a="Same response",306response_b="Same response",307prompt="Test prompt",308criteria=["quality"],309swap_positions=True310)311312assert result.winner == "TIE"313assert result.consistent == True314```315316