Loading source
Pulling the file list, source metadata, and syntax-aware rendering for this listing.
Source from repo
A comprehensive collection of Agent Skills for context engineering, multi-agent architectures, and production agent systems.
Files
Skill
Size
Entrypoint
Format
Open file
Syntax-highlighted preview of this file as included in the skill package.
skills/advanced-evaluation/references/bias-mitigation.md
1# Bias Mitigation Techniques for LLM Evaluation23This reference details specific techniques for mitigating known biases in LLM-as-a-Judge systems.45## Position Bias67### The Problem89In pairwise comparison, LLMs systematically prefer responses in certain positions. Research shows:10- GPT has mild first-position bias (~55% preference for first position in ties)11- Claude shows similar patterns12- Smaller models often show stronger bias1314### Mitigation: Position Swapping Protocol1516```python17async def position_swap_comparison(response_a, response_b, prompt, criteria):18# Pass 1: Original order19result_ab = await compare(response_a, response_b, prompt, criteria)2021# Pass 2: Swapped order22result_ba = await compare(response_b, response_a, prompt, criteria)2324# Map second result (A in second position → B in first)25result_ba_mapped = {26'winner': {'A': 'B', 'B': 'A', 'TIE': 'TIE'}[result_ba['winner']],27'confidence': result_ba['confidence']28}2930# Consistency check31if result_ab['winner'] == result_ba_mapped['winner']:32return {33'winner': result_ab['winner'],34'confidence': (result_ab['confidence'] + result_ba_mapped['confidence']) / 2,35'position_consistent': True36}37else:38# Disagreement indicates position bias was a factor39return {40'winner': 'TIE',41'confidence': 0.5,42'position_consistent': False,43'bias_detected': True44}45```4647### Alternative: Multiple Shuffles4849For higher reliability, use multiple position orderings:5051```python52async def multi_shuffle_comparison(response_a, response_b, prompt, criteria, n_shuffles=3):53results = []54for i in range(n_shuffles):55if i % 2 == 0:56r = await compare(response_a, response_b, prompt, criteria)57else:58r = await compare(response_b, response_a, prompt, criteria)59r['winner'] = {'A': 'B', 'B': 'A', 'TIE': 'TIE'}[r['winner']]60results.append(r)6162# Majority vote63winners = [r['winner'] for r in results]64final_winner = max(set(winners), key=winners.count)65agreement = winners.count(final_winner) / len(winners)6667return {68'winner': final_winner,69'confidence': agreement,70'n_shuffles': n_shuffles71}72```7374## Length Bias7576### The Problem7778LLMs tend to rate longer responses higher, regardless of quality. This manifests as:79- Verbose responses receiving inflated scores80- Concise but complete responses penalized81- Padding and repetition being rewarded8283### Mitigation: Explicit Prompting8485Include anti-length-bias instructions in the prompt:8687```88CRITICAL EVALUATION GUIDELINES:89- Do NOT prefer responses because they are longer90- Concise, complete answers are as valuable as detailed ones91- Penalize unnecessary verbosity or repetition92- Focus on information density, not word count93```9495### Mitigation: Length-Normalized Scoring9697```python98def length_normalized_score(score, response_length, target_length=500):99"""Adjust score based on response length."""100length_ratio = response_length / target_length101102if length_ratio > 2.0:103# Penalize excessively long responses104penalty = (length_ratio - 2.0) * 0.1105return max(score - penalty, 1)106elif length_ratio < 0.3:107# Penalize excessively short responses108penalty = (0.3 - length_ratio) * 0.5109return max(score - penalty, 1)110else:111return score112```113114### Mitigation: Separate Length Criterion115116Make length a separate, explicit criterion so it's not implicitly rewarded:117118```python119criteria = [120{"name": "Accuracy", "description": "Factual correctness", "weight": 0.4},121{"name": "Completeness", "description": "Covers key points", "weight": 0.3},122{"name": "Conciseness", "description": "No unnecessary content", "weight": 0.3} # Explicit123]124```125126## Self-Enhancement Bias127128### The Problem129130Models rate outputs generated by themselves (or similar models) higher than outputs from different models.131132### Mitigation: Cross-Model Evaluation133134Use a different model family for evaluation than generation:135136```python137def get_evaluator_model(generator_model):138"""Select evaluator to avoid self-enhancement bias."""139if 'gpt' in generator_model.lower():140return 'claude-4-5-sonnet'141elif 'claude' in generator_model.lower():142return 'gpt-5.2'143else:144return 'gpt-5.2' # Default145```146147### Mitigation: Blind Evaluation148149Remove model attribution from responses before evaluation:150151```python152def anonymize_response(response, model_name):153"""Remove model-identifying patterns."""154patterns = [155f"As {model_name}",156"I am an AI",157"I don't have personal opinions",158# Model-specific patterns159]160anonymized = response161for pattern in patterns:162anonymized = anonymized.replace(pattern, "[REDACTED]")163return anonymized164```165166## Verbosity Bias167168### The Problem169170Detailed explanations receive higher scores even when the extra detail is irrelevant or incorrect.171172### Mitigation: Relevance-Weighted Scoring173174```python175async def relevance_weighted_evaluation(response, prompt, criteria):176# First, assess relevance of each segment177relevance_scores = await assess_relevance(response, prompt)178179# Weight evaluation by relevance180segments = split_into_segments(response)181weighted_scores = []182for segment, relevance in zip(segments, relevance_scores):183if relevance > 0.5: # Only count relevant segments184score = await evaluate_segment(segment, prompt, criteria)185weighted_scores.append(score * relevance)186187return sum(weighted_scores) / len(weighted_scores)188```189190### Mitigation: Rubric with Verbosity Penalty191192Include explicit verbosity penalties in rubrics:193194```python195rubric_levels = [196{197"score": 5,198"description": "Complete and concise. All necessary information, nothing extraneous.",199"characteristics": ["Every sentence adds value", "No repetition", "Appropriately scoped"]200},201{202"score": 3,203"description": "Complete but verbose. Contains unnecessary detail or repetition.",204"characteristics": ["Main points covered", "Some tangents", "Could be more concise"]205},206# ... etc207]208```209210## Authority Bias211212### The Problem213214Confident, authoritative tone is rated higher regardless of accuracy.215216### Mitigation: Evidence Requirement217218Require explicit evidence for claims:219220```221For each claim in the response:2221. Identify whether it's a factual claim2232. Note if evidence or sources are provided2243. Score based on verifiability, not confidence225226IMPORTANT: Confident claims without evidence should NOT receive higher scores than227hedged claims with evidence.228```229230### Mitigation: Fact-Checking Layer231232Add a fact-checking step before scoring:233234```python235async def fact_checked_evaluation(response, prompt, criteria):236# Extract claims237claims = await extract_claims(response)238239# Fact-check each claim240fact_check_results = await asyncio.gather(*[241verify_claim(claim) for claim in claims242])243244# Adjust score based on fact-check results245accuracy_factor = sum(r['verified'] for r in fact_check_results) / len(fact_check_results)246247base_score = await evaluate(response, prompt, criteria)248return base_score * (0.7 + 0.3 * accuracy_factor) # At least 70% of score249```250251## Aggregate Bias Detection252253Monitor for systematic biases in production:254255```python256class BiasMonitor:257def __init__(self):258self.evaluations = []259260def record(self, evaluation):261self.evaluations.append(evaluation)262263def detect_position_bias(self):264"""Detect if first position wins more often than expected."""265first_wins = sum(1 for e in self.evaluations if e['first_position_winner'])266expected = len(self.evaluations) * 0.5267z_score = (first_wins - expected) / (expected * 0.5) ** 0.5268return {'bias_detected': abs(z_score) > 2, 'z_score': z_score}269270def detect_length_bias(self):271"""Detect if longer responses score higher."""272from scipy.stats import spearmanr273lengths = [e['response_length'] for e in self.evaluations]274scores = [e['score'] for e in self.evaluations]275corr, p_value = spearmanr(lengths, scores)276return {'bias_detected': corr > 0.3 and p_value < 0.05, 'correlation': corr}277```278279## Summary Table280281| Bias | Primary Mitigation | Secondary Mitigation | Detection Method |282|------|-------------------|---------------------|------------------|283| Position | Position swapping | Multiple shuffles | Consistency check |284| Length | Explicit prompting | Length normalization | Length-score correlation |285| Self-enhancement | Cross-model evaluation | Anonymization | Model comparison study |286| Verbosity | Relevance weighting | Rubric penalties | Relevance scoring |287| Authority | Evidence requirement | Fact-checking layer | Confidence-accuracy correlation |288289