Source from repo
Agent Skills for Context Engineering

A comprehensive collection of Agent Skills for context engineering, multi-agent architectures, and production agent systems.
muratcankoylanGitHub muratcankoylanSource repo Original GitHub link
Files
241
Skill
n/a
Size
2.6 MB
Entrypoint
SKILL.md
Format
git-repo
Open file
skills/advanced-evaluation/references/bias-mitigation.md

Syntax-highlighted preview of this file as included in the skill package.
Rendered Source
markdown289 linesFree
skills/advanced-evaluation/references/bias-mitigation.md
1# Bias Mitigation Techniques for LLM Evaluation
2 
3This reference details specific techniques for mitigating known biases in LLM-as-a-Judge systems.
4 
5## Position Bias
6 
7### The Problem
8 
9In pairwise comparison, LLMs systematically prefer responses in certain positions. Research shows:
10- GPT has mild first-position bias (~55% preference for first position in ties)
11- Claude shows similar patterns
12- Smaller models often show stronger bias
13 
14### Mitigation: Position Swapping Protocol
15 
16```python
17async def position_swap_comparison(response_a, response_b, prompt, criteria):
18    # Pass 1: Original order
19    result_ab = await compare(response_a, response_b, prompt, criteria)
20    
21    # Pass 2: Swapped order
22    result_ba = await compare(response_b, response_a, prompt, criteria)
23    
24    # Map second result (A in second position → B in first)
25    result_ba_mapped = {
26        'winner': {'A': 'B', 'B': 'A', 'TIE': 'TIE'}[result_ba['winner']],
27        'confidence': result_ba['confidence']
28    }
29    
30    # Consistency check
31    if result_ab['winner'] == result_ba_mapped['winner']:
32        return {
33            'winner': result_ab['winner'],
34            'confidence': (result_ab['confidence'] + result_ba_mapped['confidence']) / 2,
35            'position_consistent': True
36        }
37    else:
38        # Disagreement indicates position bias was a factor
39        return {
40            'winner': 'TIE',
41            'confidence': 0.5,
42            'position_consistent': False,
43            'bias_detected': True
44        }
45```
46 
47### Alternative: Multiple Shuffles
48 
49For higher reliability, use multiple position orderings:
50 
51```python
52async def multi_shuffle_comparison(response_a, response_b, prompt, criteria, n_shuffles=3):
53    results = []
54    for i in range(n_shuffles):
55        if i % 2 == 0:
56            r = await compare(response_a, response_b, prompt, criteria)
57        else:
58            r = await compare(response_b, response_a, prompt, criteria)
59            r['winner'] = {'A': 'B', 'B': 'A', 'TIE': 'TIE'}[r['winner']]
60        results.append(r)
61    
62    # Majority vote
63    winners = [r['winner'] for r in results]
64    final_winner = max(set(winners), key=winners.count)
65    agreement = winners.count(final_winner) / len(winners)
66    
67    return {
68        'winner': final_winner,
69        'confidence': agreement,
70        'n_shuffles': n_shuffles
71    }
72```
73 
74## Length Bias
75 
76### The Problem
77 
78LLMs tend to rate longer responses higher, regardless of quality. This manifests as:
79- Verbose responses receiving inflated scores
80- Concise but complete responses penalized
81- Padding and repetition being rewarded
82 
83### Mitigation: Explicit Prompting
84 
85Include anti-length-bias instructions in the prompt:
86 
87```
88CRITICAL EVALUATION GUIDELINES:
89- Do NOT prefer responses because they are longer
90- Concise, complete answers are as valuable as detailed ones
91- Penalize unnecessary verbosity or repetition
92- Focus on information density, not word count
93```
94 
95### Mitigation: Length-Normalized Scoring
96 
97```python
98def length_normalized_score(score, response_length, target_length=500):
99    """Adjust score based on response length."""
100    length_ratio = response_length / target_length
101    
102    if length_ratio > 2.0:
103        # Penalize excessively long responses
104        penalty = (length_ratio - 2.0) * 0.1
105        return max(score - penalty, 1)
106    elif length_ratio < 0.3:
107        # Penalize excessively short responses
108        penalty = (0.3 - length_ratio) * 0.5
109        return max(score - penalty, 1)
110    else:
111        return score
112```
113 
114### Mitigation: Separate Length Criterion
115 
116Make length a separate, explicit criterion so it's not implicitly rewarded:
117 
118```python
119criteria = [
120    {"name": "Accuracy", "description": "Factual correctness", "weight": 0.4},
121    {"name": "Completeness", "description": "Covers key points", "weight": 0.3},
122    {"name": "Conciseness", "description": "No unnecessary content", "weight": 0.3}  # Explicit
123]
124```
125 
126## Self-Enhancement Bias
127 
128### The Problem
129 
130Models rate outputs generated by themselves (or similar models) higher than outputs from different models.
131 
132### Mitigation: Cross-Model Evaluation
133 
134Use a different model family for evaluation than generation:
135 
136```python
137def get_evaluator_model(generator_model):
138    """Select evaluator to avoid self-enhancement bias."""
139    if 'gpt' in generator_model.lower():
140        return 'claude-4-5-sonnet'
141    elif 'claude' in generator_model.lower():
142        return 'gpt-5.2'
143    else:
144        return 'gpt-5.2'  # Default
145```
146 
147### Mitigation: Blind Evaluation
148 
149Remove model attribution from responses before evaluation:
150 
151```python
152def anonymize_response(response, model_name):
153    """Remove model-identifying patterns."""
154    patterns = [
155        f"As {model_name}",
156        "I am an AI",
157        "I don't have personal opinions",
158        # Model-specific patterns
159    ]
160    anonymized = response
161    for pattern in patterns:
162        anonymized = anonymized.replace(pattern, "[REDACTED]")
163    return anonymized
164```
165 
166## Verbosity Bias
167 
168### The Problem
169 
170Detailed explanations receive higher scores even when the extra detail is irrelevant or incorrect.
171 
172### Mitigation: Relevance-Weighted Scoring
173 
174```python
175async def relevance_weighted_evaluation(response, prompt, criteria):
176    # First, assess relevance of each segment
177    relevance_scores = await assess_relevance(response, prompt)
178    
179    # Weight evaluation by relevance
180    segments = split_into_segments(response)
181    weighted_scores = []
182    for segment, relevance in zip(segments, relevance_scores):
183        if relevance > 0.5:  # Only count relevant segments
184            score = await evaluate_segment(segment, prompt, criteria)
185            weighted_scores.append(score * relevance)
186    
187    return sum(weighted_scores) / len(weighted_scores)
188```
189 
190### Mitigation: Rubric with Verbosity Penalty
191 
192Include explicit verbosity penalties in rubrics:
193 
194```python
195rubric_levels = [
196    {
197        "score": 5,
198        "description": "Complete and concise. All necessary information, nothing extraneous.",
199        "characteristics": ["Every sentence adds value", "No repetition", "Appropriately scoped"]
200    },
201    {
202        "score": 3,
203        "description": "Complete but verbose. Contains unnecessary detail or repetition.",
204        "characteristics": ["Main points covered", "Some tangents", "Could be more concise"]
205    },
206    # ... etc
207]
208```
209 
210## Authority Bias
211 
212### The Problem
213 
214Confident, authoritative tone is rated higher regardless of accuracy.
215 
216### Mitigation: Evidence Requirement
217 
218Require explicit evidence for claims:
219 
220```
221For each claim in the response:
2221. Identify whether it's a factual claim
2232. Note if evidence or sources are provided
2243. Score based on verifiability, not confidence
225 
226IMPORTANT: Confident claims without evidence should NOT receive higher scores than 
227hedged claims with evidence.
228```
229 
230### Mitigation: Fact-Checking Layer
231 
232Add a fact-checking step before scoring:
233 
234```python
235async def fact_checked_evaluation(response, prompt, criteria):
236    # Extract claims
237    claims = await extract_claims(response)
238    
239    # Fact-check each claim
240    fact_check_results = await asyncio.gather(*[
241        verify_claim(claim) for claim in claims
242    ])
243    
244    # Adjust score based on fact-check results
245    accuracy_factor = sum(r['verified'] for r in fact_check_results) / len(fact_check_results)
246    
247    base_score = await evaluate(response, prompt, criteria)
248    return base_score * (0.7 + 0.3 * accuracy_factor)  # At least 70% of score
249```
250 
251## Aggregate Bias Detection
252 
253Monitor for systematic biases in production:
254 
255```python
256class BiasMonitor:
257    def __init__(self):
258        self.evaluations = []
259    
260    def record(self, evaluation):
261        self.evaluations.append(evaluation)
262    
263    def detect_position_bias(self):
264        """Detect if first position wins more often than expected."""
265        first_wins = sum(1 for e in self.evaluations if e['first_position_winner'])
266        expected = len(self.evaluations) * 0.5
267        z_score = (first_wins - expected) / (expected * 0.5) ** 0.5
268        return {'bias_detected': abs(z_score) > 2, 'z_score': z_score}
269    
270    def detect_length_bias(self):
271        """Detect if longer responses score higher."""
272        from scipy.stats import spearmanr
273        lengths = [e['response_length'] for e in self.evaluations]
274        scores = [e['score'] for e in self.evaluations]
275        corr, p_value = spearmanr(lengths, scores)
276        return {'bias_detected': corr > 0.3 and p_value < 0.05, 'correlation': corr}
277```
278 
279## Summary Table
280 
281| Bias | Primary Mitigation | Secondary Mitigation | Detection Method |
282|------|-------------------|---------------------|------------------|
283| Position | Position swapping | Multiple shuffles | Consistency check |
284| Length | Explicit prompting | Length normalization | Length-score correlation |
285| Self-enhancement | Cross-model evaluation | Anonymization | Model comparison study |
286| Verbosity | Relevance weighting | Rubric penalties | Relevance scoring |
287| Authority | Evidence requirement | Fact-checking layer | Confidence-accuracy correlation |
288 
289
Preparing the source view

Agent Skills for Context Engineering

skills/advanced-evaluation/references/bias-mitigation.md