Loading source
Pulling the file list, source metadata, and syntax-aware rendering for this listing.
Source from repo
A comprehensive collection of Agent Skills for context engineering, multi-agent architectures, and production agent systems.
Files
Skill
Size
Entrypoint
Format
Open file
Syntax-highlighted preview of this file as included in the skill package.
skills/advanced-evaluation/references/metrics-guide.md
1# Metric Selection Guide for LLM Evaluation23This reference provides guidance on selecting appropriate metrics for different evaluation scenarios.45## Metric Categories67### Classification Metrics89Use for binary or multi-class evaluation tasks (pass/fail, correct/incorrect).1011#### Precision1213```14Precision = True Positives / (True Positives + False Positives)15```1617**Interpretation**: Of all responses the judge said were good, what fraction were actually good?1819**Use when**: False positives are costly (e.g., approving unsafe content)2021```python22def precision(predictions, ground_truth):23true_positives = sum(1 for p, g in zip(predictions, ground_truth) if p == 1 and g == 1)24predicted_positives = sum(predictions)25return true_positives / predicted_positives if predicted_positives > 0 else 026```2728#### Recall2930```31Recall = True Positives / (True Positives + False Negatives)32```3334**Interpretation**: Of all actually good responses, what fraction did the judge identify?3536**Use when**: False negatives are costly (e.g., missing good content in filtering)3738```python39def recall(predictions, ground_truth):40true_positives = sum(1 for p, g in zip(predictions, ground_truth) if p == 1 and g == 1)41actual_positives = sum(ground_truth)42return true_positives / actual_positives if actual_positives > 0 else 043```4445#### F1 Score4647```48F1 = 2 * (Precision * Recall) / (Precision + Recall)49```5051**Interpretation**: Harmonic mean of precision and recall5253**Use when**: You need a single number balancing both concerns5455```python56def f1_score(predictions, ground_truth):57p = precision(predictions, ground_truth)58r = recall(predictions, ground_truth)59return 2 * p * r / (p + r) if (p + r) > 0 else 060```6162### Agreement Metrics6364Use for comparing automated evaluation with human judgment.6566#### Cohen's Kappa (κ)6768```69κ = (Observed Agreement - Expected Agreement) / (1 - Expected Agreement)70```7172**Interpretation**: Agreement adjusted for chance73- κ > 0.8: Almost perfect agreement74- κ 0.6-0.8: Substantial agreement75- κ 0.4-0.6: Moderate agreement76- κ < 0.4: Fair to poor agreement7778**Use for**: Binary or categorical judgments7980```python81def cohens_kappa(judge1, judge2):82from sklearn.metrics import cohen_kappa_score83return cohen_kappa_score(judge1, judge2)84```8586#### Weighted Kappa8788For ordinal scales where disagreement severity matters:8990```python91def weighted_kappa(judge1, judge2):92from sklearn.metrics import cohen_kappa_score93return cohen_kappa_score(judge1, judge2, weights='quadratic')94```9596**Interpretation**: Penalizes large disagreements more than small ones9798### Correlation Metrics99100Use for ordinal/continuous scores.101102#### Spearman's Rank Correlation (ρ)103104**Interpretation**: Correlation between rankings, not absolute values105- ρ > 0.9: Very strong correlation106- ρ 0.7-0.9: Strong correlation107- ρ 0.5-0.7: Moderate correlation108- ρ < 0.5: Weak correlation109110**Use when**: Order matters more than exact values111112```python113def spearmans_rho(scores1, scores2):114from scipy.stats import spearmanr115rho, p_value = spearmanr(scores1, scores2)116return {'rho': rho, 'p_value': p_value}117```118119#### Kendall's Tau (τ)120121**Interpretation**: Similar to Spearman but based on pairwise concordance122123**Use when**: You have many tied values124125```python126def kendalls_tau(scores1, scores2):127from scipy.stats import kendalltau128tau, p_value = kendalltau(scores1, scores2)129return {'tau': tau, 'p_value': p_value}130```131132#### Pearson Correlation (r)133134**Interpretation**: Linear correlation between scores135136**Use when**: Exact score values matter, not just order137138```python139def pearsons_r(scores1, scores2):140from scipy.stats import pearsonr141r, p_value = pearsonr(scores1, scores2)142return {'r': r, 'p_value': p_value}143```144145### Pairwise Comparison Metrics146147#### Agreement Rate148149```150Agreement = (Matching Decisions) / (Total Comparisons)151```152153**Interpretation**: Simple percentage of agreement154155```python156def pairwise_agreement(decisions1, decisions2):157matches = sum(1 for d1, d2 in zip(decisions1, decisions2) if d1 == d2)158return matches / len(decisions1)159```160161#### Position Consistency162163```164Consistency = (Consistent across position swaps) / (Total comparisons)165```166167**Interpretation**: How often does swapping position change the decision?168169```python170def position_consistency(results):171consistent = sum(1 for r in results if r['position_consistent'])172return consistent / len(results)173```174175## Selection Decision Tree176177```178What type of evaluation task?179│180├── Binary classification (pass/fail)181│ └── Use: Precision, Recall, F1, Cohen's κ182│183├── Ordinal scale (1-5 rating)184│ ├── Comparing to human judgments?185│ │ └── Use: Spearman's ρ, Weighted κ186│ └── Comparing two automated judges?187│ └── Use: Kendall's τ, Spearman's ρ188│189├── Pairwise preference190│ └── Use: Agreement rate, Position consistency191│192└── Multi-label classification193└── Use: Macro-F1, Micro-F1, Per-label metrics194```195196## Metric Selection by Use Case197198### Use Case 1: Validating Automated Evaluation199200**Goal**: Ensure automated evaluation correlates with human judgment201202**Recommended Metrics**:2031. Primary: Spearman's ρ (for ordinal scales) or Cohen's κ (for categorical)2042. Secondary: Per-criterion agreement2053. Diagnostic: Confusion matrix for systematic errors206207```python208def validate_automated_eval(automated_scores, human_scores, criteria):209results = {}210211# Overall correlation212results['overall_spearman'] = spearmans_rho(automated_scores, human_scores)213214# Per-criterion agreement215for criterion in criteria:216auto_crit = [s[criterion] for s in automated_scores]217human_crit = [s[criterion] for s in human_scores]218results[f'{criterion}_spearman'] = spearmans_rho(auto_crit, human_crit)219220return results221```222223### Use Case 2: Comparing Two Models224225**Goal**: Determine which model produces better outputs226227**Recommended Metrics**:2281. Primary: Win rate (from pairwise comparison)2292. Secondary: Position consistency (bias check)2303. Diagnostic: Per-criterion breakdown231232```python233def compare_models(model_a_outputs, model_b_outputs, prompts):234results = []235for a, b, p in zip(model_a_outputs, model_b_outputs, prompts):236comparison = await compare_with_position_swap(a, b, p)237results.append(comparison)238239return {240'a_wins': sum(1 for r in results if r['winner'] == 'A'),241'b_wins': sum(1 for r in results if r['winner'] == 'B'),242'ties': sum(1 for r in results if r['winner'] == 'TIE'),243'position_consistency': position_consistency(results)244}245```246247### Use Case 3: Quality Monitoring248249**Goal**: Track evaluation quality over time250251**Recommended Metrics**:2521. Primary: Rolling agreement with human spot-checks2532. Secondary: Score distribution stability2543. Diagnostic: Bias indicators (position, length)255256```python257class QualityMonitor:258def __init__(self, window_size=100):259self.window = deque(maxlen=window_size)260261def add_evaluation(self, automated, human_spot_check=None):262self.window.append({263'automated': automated,264'human': human_spot_check,265'length': len(automated['response'])266})267268def get_metrics(self):269# Filter to evaluations with human spot-checks270with_human = [e for e in self.window if e['human'] is not None]271272if len(with_human) < 10:273return {'insufficient_data': True}274275auto_scores = [e['automated']['score'] for e in with_human]276human_scores = [e['human']['score'] for e in with_human]277278return {279'correlation': spearmans_rho(auto_scores, human_scores),280'mean_difference': np.mean([a - h for a, h in zip(auto_scores, human_scores)]),281'length_correlation': spearmans_rho(282[e['length'] for e in self.window],283[e['automated']['score'] for e in self.window]284)285}286```287288## Interpreting Metric Results289290### Good Evaluation System Indicators291292| Metric | Good | Acceptable | Concerning |293|--------|------|------------|------------|294| Spearman's ρ | > 0.8 | 0.6-0.8 | < 0.6 |295| Cohen's κ | > 0.7 | 0.5-0.7 | < 0.5 |296| Position consistency | > 0.9 | 0.8-0.9 | < 0.8 |297| Length correlation | < 0.2 | 0.2-0.4 | > 0.4 |298299### Warning Signs3003011. **High agreement but low correlation**: May indicate calibration issues3022. **Low position consistency**: Position bias affecting results3033. **High length correlation**: Length bias inflating scores3044. **Per-criterion variance**: Some criteria may be poorly defined305306## Reporting Template307308```markdown309## Evaluation System Metrics Report310311### Human Agreement312- Spearman's ρ: 0.82 (p < 0.001)313- Cohen's κ: 0.74314- Sample size: 500 evaluations315316### Bias Indicators317- Position consistency: 91%318- Length-score correlation: 0.12319320### Per-Criterion Performance321| Criterion | Spearman's ρ | κ |322|-----------|--------------|---|323| Accuracy | 0.88 | 0.79 |324| Clarity | 0.76 | 0.68 |325| Completeness | 0.81 | 0.72 |326327### Recommendations328- All metrics within acceptable ranges329- Monitor "Clarity" criterion - lower agreement may indicate need for rubric refinement330```331332