Source from repo
Agent Skills for Context Engineering

A comprehensive collection of Agent Skills for context engineering, multi-agent architectures, and production agent systems.
muratcankoylanGitHub muratcankoylanSource repo Original GitHub link
Files
241
Skill
n/a
Size
2.6 MB
Entrypoint
SKILL.md
Format
git-repo
Open file
skills/advanced-evaluation/references/metrics-guide.md

Syntax-highlighted preview of this file as included in the skill package.
Rendered Source
markdown332 linesFree
skills/advanced-evaluation/references/metrics-guide.md
1# Metric Selection Guide for LLM Evaluation
2 
3This reference provides guidance on selecting appropriate metrics for different evaluation scenarios.
4 
5## Metric Categories
6 
7### Classification Metrics
8 
9Use for binary or multi-class evaluation tasks (pass/fail, correct/incorrect).
10 
11#### Precision
12 
13```
14Precision = True Positives / (True Positives + False Positives)
15```
16 
17**Interpretation**: Of all responses the judge said were good, what fraction were actually good?
18 
19**Use when**: False positives are costly (e.g., approving unsafe content)
20 
21```python
22def precision(predictions, ground_truth):
23    true_positives = sum(1 for p, g in zip(predictions, ground_truth) if p == 1 and g == 1)
24    predicted_positives = sum(predictions)
25    return true_positives / predicted_positives if predicted_positives > 0 else 0
26```
27 
28#### Recall
29 
30```
31Recall = True Positives / (True Positives + False Negatives)
32```
33 
34**Interpretation**: Of all actually good responses, what fraction did the judge identify?
35 
36**Use when**: False negatives are costly (e.g., missing good content in filtering)
37 
38```python
39def recall(predictions, ground_truth):
40    true_positives = sum(1 for p, g in zip(predictions, ground_truth) if p == 1 and g == 1)
41    actual_positives = sum(ground_truth)
42    return true_positives / actual_positives if actual_positives > 0 else 0
43```
44 
45#### F1 Score
46 
47```
48F1 = 2 * (Precision * Recall) / (Precision + Recall)
49```
50 
51**Interpretation**: Harmonic mean of precision and recall
52 
53**Use when**: You need a single number balancing both concerns
54 
55```python
56def f1_score(predictions, ground_truth):
57    p = precision(predictions, ground_truth)
58    r = recall(predictions, ground_truth)
59    return 2 * p * r / (p + r) if (p + r) > 0 else 0
60```
61 
62### Agreement Metrics
63 
64Use for comparing automated evaluation with human judgment.
65 
66#### Cohen's Kappa (κ)
67 
68```
69κ = (Observed Agreement - Expected Agreement) / (1 - Expected Agreement)
70```
71 
72**Interpretation**: Agreement adjusted for chance
73- κ > 0.8: Almost perfect agreement
74- κ 0.6-0.8: Substantial agreement
75- κ 0.4-0.6: Moderate agreement
76- κ < 0.4: Fair to poor agreement
77 
78**Use for**: Binary or categorical judgments
79 
80```python
81def cohens_kappa(judge1, judge2):
82    from sklearn.metrics import cohen_kappa_score
83    return cohen_kappa_score(judge1, judge2)
84```
85 
86#### Weighted Kappa
87 
88For ordinal scales where disagreement severity matters:
89 
90```python
91def weighted_kappa(judge1, judge2):
92    from sklearn.metrics import cohen_kappa_score
93    return cohen_kappa_score(judge1, judge2, weights='quadratic')
94```
95 
96**Interpretation**: Penalizes large disagreements more than small ones
97 
98### Correlation Metrics
99 
100Use for ordinal/continuous scores.
101 
102#### Spearman's Rank Correlation (ρ)
103 
104**Interpretation**: Correlation between rankings, not absolute values
105- ρ > 0.9: Very strong correlation
106- ρ 0.7-0.9: Strong correlation
107- ρ 0.5-0.7: Moderate correlation
108- ρ < 0.5: Weak correlation
109 
110**Use when**: Order matters more than exact values
111 
112```python
113def spearmans_rho(scores1, scores2):
114    from scipy.stats import spearmanr
115    rho, p_value = spearmanr(scores1, scores2)
116    return {'rho': rho, 'p_value': p_value}
117```
118 
119#### Kendall's Tau (τ)
120 
121**Interpretation**: Similar to Spearman but based on pairwise concordance
122 
123**Use when**: You have many tied values
124 
125```python
126def kendalls_tau(scores1, scores2):
127    from scipy.stats import kendalltau
128    tau, p_value = kendalltau(scores1, scores2)
129    return {'tau': tau, 'p_value': p_value}
130```
131 
132#### Pearson Correlation (r)
133 
134**Interpretation**: Linear correlation between scores
135 
136**Use when**: Exact score values matter, not just order
137 
138```python
139def pearsons_r(scores1, scores2):
140    from scipy.stats import pearsonr
141    r, p_value = pearsonr(scores1, scores2)
142    return {'r': r, 'p_value': p_value}
143```
144 
145### Pairwise Comparison Metrics
146 
147#### Agreement Rate
148 
149```
150Agreement = (Matching Decisions) / (Total Comparisons)
151```
152 
153**Interpretation**: Simple percentage of agreement
154 
155```python
156def pairwise_agreement(decisions1, decisions2):
157    matches = sum(1 for d1, d2 in zip(decisions1, decisions2) if d1 == d2)
158    return matches / len(decisions1)
159```
160 
161#### Position Consistency
162 
163```
164Consistency = (Consistent across position swaps) / (Total comparisons)
165```
166 
167**Interpretation**: How often does swapping position change the decision?
168 
169```python
170def position_consistency(results):
171    consistent = sum(1 for r in results if r['position_consistent'])
172    return consistent / len(results)
173```
174 
175## Selection Decision Tree
176 
177```
178What type of evaluation task?
179│
180├── Binary classification (pass/fail)
181│   └── Use: Precision, Recall, F1, Cohen's κ
182│
183├── Ordinal scale (1-5 rating)
184│   ├── Comparing to human judgments?
185│   │   └── Use: Spearman's ρ, Weighted κ
186│   └── Comparing two automated judges?
187│       └── Use: Kendall's τ, Spearman's ρ
188│
189├── Pairwise preference
190│   └── Use: Agreement rate, Position consistency
191│
192└── Multi-label classification
193    └── Use: Macro-F1, Micro-F1, Per-label metrics
194```
195 
196## Metric Selection by Use Case
197 
198### Use Case 1: Validating Automated Evaluation
199 
200**Goal**: Ensure automated evaluation correlates with human judgment
201 
202**Recommended Metrics**:
2031. Primary: Spearman's ρ (for ordinal scales) or Cohen's κ (for categorical)
2042. Secondary: Per-criterion agreement
2053. Diagnostic: Confusion matrix for systematic errors
206 
207```python
208def validate_automated_eval(automated_scores, human_scores, criteria):
209    results = {}
210    
211    # Overall correlation
212    results['overall_spearman'] = spearmans_rho(automated_scores, human_scores)
213    
214    # Per-criterion agreement
215    for criterion in criteria:
216        auto_crit = [s[criterion] for s in automated_scores]
217        human_crit = [s[criterion] for s in human_scores]
218        results[f'{criterion}_spearman'] = spearmans_rho(auto_crit, human_crit)
219    
220    return results
221```
222 
223### Use Case 2: Comparing Two Models
224 
225**Goal**: Determine which model produces better outputs
226 
227**Recommended Metrics**:
2281. Primary: Win rate (from pairwise comparison)
2292. Secondary: Position consistency (bias check)
2303. Diagnostic: Per-criterion breakdown
231 
232```python
233def compare_models(model_a_outputs, model_b_outputs, prompts):
234    results = []
235    for a, b, p in zip(model_a_outputs, model_b_outputs, prompts):
236        comparison = await compare_with_position_swap(a, b, p)
237        results.append(comparison)
238    
239    return {
240        'a_wins': sum(1 for r in results if r['winner'] == 'A'),
241        'b_wins': sum(1 for r in results if r['winner'] == 'B'),
242        'ties': sum(1 for r in results if r['winner'] == 'TIE'),
243        'position_consistency': position_consistency(results)
244    }
245```
246 
247### Use Case 3: Quality Monitoring
248 
249**Goal**: Track evaluation quality over time
250 
251**Recommended Metrics**:
2521. Primary: Rolling agreement with human spot-checks
2532. Secondary: Score distribution stability
2543. Diagnostic: Bias indicators (position, length)
255 
256```python
257class QualityMonitor:
258    def __init__(self, window_size=100):
259        self.window = deque(maxlen=window_size)
260    
261    def add_evaluation(self, automated, human_spot_check=None):
262        self.window.append({
263            'automated': automated,
264            'human': human_spot_check,
265            'length': len(automated['response'])
266        })
267    
268    def get_metrics(self):
269        # Filter to evaluations with human spot-checks
270        with_human = [e for e in self.window if e['human'] is not None]
271        
272        if len(with_human) < 10:
273            return {'insufficient_data': True}
274        
275        auto_scores = [e['automated']['score'] for e in with_human]
276        human_scores = [e['human']['score'] for e in with_human]
277        
278        return {
279            'correlation': spearmans_rho(auto_scores, human_scores),
280            'mean_difference': np.mean([a - h for a, h in zip(auto_scores, human_scores)]),
281            'length_correlation': spearmans_rho(
282                [e['length'] for e in self.window],
283                [e['automated']['score'] for e in self.window]
284            )
285        }
286```
287 
288## Interpreting Metric Results
289 
290### Good Evaluation System Indicators
291 
292| Metric | Good | Acceptable | Concerning |
293|--------|------|------------|------------|
294| Spearman's ρ | > 0.8 | 0.6-0.8 | < 0.6 |
295| Cohen's κ | > 0.7 | 0.5-0.7 | < 0.5 |
296| Position consistency | > 0.9 | 0.8-0.9 | < 0.8 |
297| Length correlation | < 0.2 | 0.2-0.4 | > 0.4 |
298 
299### Warning Signs
300 
3011. **High agreement but low correlation**: May indicate calibration issues
3022. **Low position consistency**: Position bias affecting results
3033. **High length correlation**: Length bias inflating scores
3044. **Per-criterion variance**: Some criteria may be poorly defined
305 
306## Reporting Template
307 
308```markdown
309## Evaluation System Metrics Report
310 
311### Human Agreement
312- Spearman's ρ: 0.82 (p < 0.001)
313- Cohen's κ: 0.74
314- Sample size: 500 evaluations
315 
316### Bias Indicators
317- Position consistency: 91%
318- Length-score correlation: 0.12
319 
320### Per-Criterion Performance
321| Criterion | Spearman's ρ | κ |
322|-----------|--------------|---|
323| Accuracy | 0.88 | 0.79 |
324| Clarity | 0.76 | 0.68 |
325| Completeness | 0.81 | 0.72 |
326 
327### Recommendations
328- All metrics within acceptable ranges
329- Monitor "Clarity" criterion - lower agreement may indicate need for rubric refinement
330```
331 
332
Preparing the source view

Agent Skills for Context Engineering

skills/advanced-evaluation/references/metrics-guide.md