Source from repo
Agent Skills for Context Engineering

A comprehensive collection of Agent Skills for context engineering, multi-agent architectures, and production agent systems.
muratcankoylanGitHub muratcankoylanSource repo Original GitHub link
Files
241
Skill
n/a
Size
2.6 MB
Entrypoint
SKILL.md
Format
git-repo
Open file
skills/evaluation/references/metrics.md

Syntax-highlighted preview of this file as included in the skill package.
Rendered Source
markdown340 linesFree
skills/evaluation/references/metrics.md
1# Evaluation Reference: Metrics and Implementation
2 
3This document provides implementation details for evaluation metrics and evaluation systems.
4 
5## Core Metric Definitions
6 
7### Factual Accuracy
8 
9Factual accuracy measures whether claims in agent output match ground truth.
10 
11```
12Excellent (1.0): All claims verified against ground truth, no errors
13Good (0.8): Minor errors that do not affect main conclusions
14Acceptable (0.6): Major claims correct, minor inaccuracies present
15Poor (0.3): Significant factual errors in key claims
16Failed (0.0): Fundamental factual errors that invalidate output
17```
18 
19Calculation approach:
20- Extract claims from output
21- Verify each claim against ground truth
22- Weight claims by importance (major claims more weight)
23- Calculate weighted average of claim accuracy
24 
25### Completeness
26 
27Completeness measures whether output covers all requested aspects.
28 
29```
30Excellent (1.0): All requested aspects thoroughly covered
31Good (0.8): Most aspects covered with minor gaps
32Acceptable (0.6): Key aspects covered, some gaps
33Poor (0.3): Major aspects missing from output
34Failed (0.0): Fundamental aspects not addressed
35```
36 
37### Citation Accuracy
38 
39Citation accuracy measures whether cited sources match claimed sources.
40 
41```
42Excellent (1.0): All citations accurate and complete
43Good (0.8): Minor citation formatting issues
44Acceptable (0.6): Major citations accurate
45Poor (0.3): Significant citation problems
46Failed (0.0): Citations missing or completely incorrect
47```
48 
49### Source Quality
50 
51Source quality measures whether appropriate primary sources were used.
52 
53```
54Excellent (1.0): Primary authoritative sources
55Good (0.8): Mostly primary sources with some secondary
56Acceptable (0.6): Mix of primary and secondary sources
57Poor (0.3): Mostly secondary or unreliable sources
58Failed (0.0): No credible sources cited
59```
60 
61### Tool Efficiency
62 
63Tool efficiency measures whether the agent used appropriate tools a reasonable number of times.
64 
65```
66Excellent (1.0): Optimal tool selection and call count
67Good (0.8): Good tool selection with minor inefficiencies
68Acceptable (0.6): Appropriate tools with some redundancy
69Poor (0.3): Wrong tools or excessive call counts
70Failed (0.0): Severe tool misuse or extremely excessive calls
71```
72 
73## Rubric Implementation
74 
75```python
76EVALUATION_DIMENSIONS = {
77    "factual_accuracy": {
78        "weight": 0.30,
79        "description": "Claims match ground truth",
80        "levels": {
81            "excellent": 1.0,
82            "good": 0.8,
83            "acceptable": 0.6,
84            "poor": 0.3,
85            "failed": 0.0
86        }
87    },
88    "completeness": {
89        "weight": 0.25,
90        "description": "All requested aspects covered",
91        "levels": {
92            "excellent": 1.0,
93            "good": 0.8,
94            "acceptable": 0.6,
95            "poor": 0.3,
96            "failed": 0.0
97        }
98    },
99    "citation_accuracy": {
100        "weight": 0.15,
101        "description": "Citations match sources",
102        "levels": {
103            "excellent": 1.0,
104            "good": 0.8,
105            "acceptable": 0.6,
106            "poor": 0.3,
107            "failed": 0.0
108        }
109    },
110    "source_quality": {
111        "weight": 0.10,
112        "description": "Appropriate primary sources used",
113        "levels": {
114            "excellent": 1.0,
115            "good": 0.8,
116            "acceptable": 0.6,
117            "poor": 0.3,
118            "failed": 0.0
119        }
120    },
121    "tool_efficiency": {
122        "weight": 0.20,
123        "description": "Right tools used reasonably",
124        "levels": {
125            "excellent": 1.0,
126            "good": 0.8,
127            "acceptable": 0.6,
128            "poor": 0.3,
129            "failed": 0.0
130        }
131    }
132}
133 
134def calculate_overall_score(dimension_scores, rubric):
135    """Calculate weighted overall score from dimension scores."""
136    total_weight = 0
137    weighted_sum = 0
138    
139    for dimension, score in dimension_scores.items():
140        if dimension in rubric:
141            weight = rubric[dimension]["weight"]
142            weighted_sum += score * weight
143            total_weight += weight
144    
145    return weighted_sum / total_weight if total_weight > 0 else 0
146```
147 
148## Test Set Management
149 
150```python
151class TestSet:
152    def __init__(self, name):
153        self.name = name
154        self.tests = []
155        self.tags = {}
156    
157    def add_test(self, test_case):
158        """Add test case to test set."""
159        self.tests.append(test_case)
160        
161        # Index by tags
162        for tag in test_case.get("tags", []):
163            if tag not in self.tags:
164                self.tags[tag] = []
165            self.tags[tag].append(len(self.tests) - 1)
166    
167    def filter(self, **criteria):
168        """Filter tests by criteria."""
169        filtered = []
170        for test in self.tests:
171            match = True
172            for key, value in criteria.items():
173                if test.get(key) != value:
174                    match = False
175                    break
176            if match:
177                filtered.append(test)
178        return filtered
179    
180    def get_complexity_distribution(self):
181        """Get distribution of tests by complexity."""
182        distribution = {}
183        for test in self.tests:
184            complexity = test.get("complexity", "medium")
185            distribution[complexity] = distribution.get(complexity, 0) + 1
186        return distribution
187```
188 
189## Evaluation Runner
190 
191```python
192class EvaluationRunner:
193    def __init__(self, test_set, rubric, agent):
194        self.test_set = test_set
195        self.rubric = rubric
196        self.agent = agent
197        self.results = []
198    
199    def run_all(self, verbose=False):
200        """Run evaluation on all tests."""
201        self.results = []
202        
203        for i, test in enumerate(self.test_set.tests):
204            if verbose:
205                print(f"Running test {i+1}/{len(self.test_set.tests)}")
206            
207            result = self.run_test(test)
208            self.results.append(result)
209        
210        return self.summarize()
211    
212    def run_test(self, test):
213        """Run single evaluation test."""
214        # Get agent output
215        output = self.agent.run(test["input"])
216        
217        # Evaluate
218        evaluation = self.evaluate_output(output, test)
219        
220        return {
221            "test": test,
222            "output": output,
223            "evaluation": evaluation
224        }
225    
226    def evaluate_output(self, output, test):
227        """Evaluate agent output against test."""
228        ground_truth = test.get("expected", {})
229        
230        dimension_scores = {}
231        for dimension, config in self.rubric.items():
232            score = self.evaluate_dimension(
233                output, ground_truth, dimension, config
234            )
235            dimension_scores[dimension] = score
236        
237        overall = calculate_overall_score(dimension_scores, self.rubric)
238        
239        return {
240            "overall_score": overall,
241            "dimension_scores": dimension_scores,
242            "passed": overall >= 0.7
243        }
244    
245    def summarize(self):
246        """Summarize evaluation results."""
247        if not self.results:
248            return {"error": "No results"}
249        
250        passed = sum(1 for r in self.results if r["evaluation"]["passed"])
251        
252        dimension_totals = {}
253        for dimension in self.rubric.keys():
254            dimension_totals[dimension] = {
255                "total": 0,
256                "count": 0
257            }
258        
259        for result in self.results:
260            for dimension, score in result["evaluation"]["dimension_scores"].items():
261                if dimension in dimension_totals:
262                    dimension_totals[dimension]["total"] += score
263                    dimension_totals[dimension]["count"] += 1
264        
265        dimension_averages = {}
266        for dimension, data in dimension_totals.items():
267            if data["count"] > 0:
268                dimension_averages[dimension] = data["total"] / data["count"]
269        
270        return {
271            "total_tests": len(self.results),
272            "passed": passed,
273            "failed": len(self.results) - passed,
274            "pass_rate": passed / len(self.results) if self.results else 0,
275            "dimension_averages": dimension_averages,
276            "failures": [
277                r for r in self.results 
278                if not r["evaluation"]["passed"]
279            ]
280        }
281```
282 
283## Production Monitoring
284 
285```python
286class ProductionMonitor:
287    def __init__(self, sample_rate=0.01):
288        self.sample_rate = sample_rate
289        self.samples = []
290        self.alert_thresholds = {
291            "pass_rate_warning": 0.85,
292            "pass_rate_critical": 0.70
293        }
294    
295    def sample_and_evaluate(self, query, output):
296        """Sample production interaction for evaluation."""
297        if random.random() > self.sample_rate:
298            return None
299        
300        evaluation = evaluate_output(output, {}, EVALUATION_RUBRIC)
301        
302        sample = {
303            "query": query[:200],
304            "output_preview": output[:200],
305            "score": evaluation["overall_score"],
306            "passed": evaluation["passed"],
307            "timestamp": current_timestamp()
308        }
309        
310        self.samples.append(sample)
311        return sample
312    
313    def get_metrics(self):
314        """Calculate current metrics from samples."""
315        if not self.samples:
316            return {"status": "insufficient_data"}
317        
318        passed = sum(1 for s in self.samples if s["passed"])
319        pass_rate = passed / len(self.samples)
320        
321        avg_score = sum(s["score"] for s in self.samples) / len(self.samples)
322        
323        return {
324            "sample_count": len(self.samples),
325            "pass_rate": pass_rate,
326            "average_score": avg_score,
327            "status": self._get_status(pass_rate)
328        }
329    
330    def _get_status(self, pass_rate):
331        """Get status based on pass rate."""
332        if pass_rate < self.alert_thresholds["pass_rate_critical"]:
333            return "critical"
334        elif pass_rate < self.alert_thresholds["pass_rate_warning"]:
335            return "warning"
336        else:
337            return "healthy"
338```
339 
340
Preparing the source view

Agent Skills for Context Engineering

skills/evaluation/references/metrics.md