Source from repo
Agent Skills for Context Engineering

A comprehensive collection of Agent Skills for context engineering, multi-agent architectures, and production agent systems.
muratcankoylanGitHub muratcankoylanSource repo Original GitHub link
Files
241
Skill
n/a
Size
2.6 MB
Entrypoint
SKILL.md
Format
git-repo
Open file
skills/context-compression/references/evaluation-framework.md

Syntax-highlighted preview of this file as included in the skill package.
Rendered Source
markdown214 linesFree
skills/context-compression/references/evaluation-framework.md
1# Context Compression Evaluation Framework
2 
3This document provides the complete evaluation framework for measuring context compression quality, including probe types, scoring rubrics, and LLM judge configuration.
4 
5## Probe Types
6 
7### Recall Probes
8 
9Test factual retention of specific details from conversation history.
10 
11**Structure:**
12```
13Question: [Ask for specific fact from truncated history]
14Expected: [Exact detail that should be preserved]
15Scoring: Match accuracy of technical details
16```
17 
18**Examples:**
19- "What was the original error message that started this debugging session?"
20- "What version of the dependency did we decide to use?"
21- "What was the exact command that failed?"
22 
23### Artifact Probes
24 
25Test file tracking and modification awareness.
26 
27**Structure:**
28```
29Question: [Ask about files created, modified, or examined]
30Expected: [Complete list with change descriptions]
31Scoring: Completeness of file list and accuracy of change descriptions
32```
33 
34**Examples:**
35- "Which files have we modified? Describe what changed in each."
36- "What new files did we create in this session?"
37- "Which configuration files did we examine but not change?"
38 
39### Continuation Probes
40 
41Test ability to continue work without re-fetching context.
42 
43**Structure:**
44```
45Question: [Ask about next steps or current state]
46Expected: [Actionable next steps based on session history]
47Scoring: Ability to continue without requesting re-read of files
48```
49 
50**Examples:**
51- "What should we do next?"
52- "What tests are still failing and why?"
53- "What was left incomplete from our last step?"
54 
55### Decision Probes
56 
57Test retention of reasoning chains and decision rationale.
58 
59**Structure:**
60```
61Question: [Ask about why a decision was made]
62Expected: [Reasoning that led to the decision]
63Scoring: Preservation of decision context and alternatives considered
64```
65 
66**Examples:**
67- "We discussed options for the Redis issue. What did we decide and why?"
68- "Why did we choose connection pooling over per-request connections?"
69- "What alternatives did we consider for the authentication fix?"
70 
71## Scoring Rubrics
72 
73### Accuracy Dimension
74 
75| Criterion | Question | Score 0 | Score 3 | Score 5 |
76|-----------|----------|---------|---------|---------|
77| accuracy_factual | Are facts, file paths, and technical details correct? | Completely incorrect or fabricated | Mostly accurate with minor errors | Perfectly accurate |
78| accuracy_technical | Are code references and technical concepts correct? | Major technical errors | Generally correct with minor issues | Technically precise |
79 
80### Context Awareness Dimension
81 
82| Criterion | Question | Score 0 | Score 3 | Score 5 |
83|-----------|----------|---------|---------|---------|
84| context_conversation_state | Does the response reflect current conversation state? | No awareness of prior context | General awareness with gaps | Full awareness of conversation history |
85| context_artifact_state | Does the response reflect which files/artifacts were accessed? | No awareness of artifacts | Partial artifact awareness | Complete artifact state awareness |
86 
87### Artifact Trail Dimension
88 
89| Criterion | Question | Score 0 | Score 3 | Score 5 |
90|-----------|----------|---------|---------|---------|
91| artifact_files_created | Does the agent know which files were created? | No knowledge | Knows most files | Perfect knowledge |
92| artifact_files_modified | Does the agent know which files were modified and what changed? | No knowledge | Good knowledge of most modifications | Perfect knowledge of all modifications |
93| artifact_key_details | Does the agent remember function names, variable names, error messages? | No recall | Recalls most key details | Perfect recall |
94 
95### Completeness Dimension
96 
97| Criterion | Question | Score 0 | Score 3 | Score 5 |
98|-----------|----------|---------|---------|---------|
99| completeness_coverage | Does the response address all parts of the question? | Ignores most parts | Addresses most parts | Addresses all parts thoroughly |
100| completeness_depth | Is sufficient detail provided? | Superficial or missing detail | Adequate detail | Comprehensive detail |
101 
102### Continuity Dimension
103 
104| Criterion | Question | Score 0 | Score 3 | Score 5 |
105|-----------|----------|---------|---------|---------|
106| continuity_work_state | Can the agent continue without re-fetching previously accessed information? | Cannot continue without re-fetching all context | Can continue with minimal re-fetching | Can continue seamlessly |
107| continuity_todo_state | Does the agent maintain awareness of pending tasks? | Lost track of all TODOs | Good awareness with some gaps | Perfect task awareness |
108| continuity_reasoning | Does the agent retain rationale behind previous decisions? | No memory of reasoning | Generally remembers reasoning | Excellent retention |
109 
110### Instruction Following Dimension
111 
112| Criterion | Question | Score 0 | Score 3 | Score 5 |
113|-----------|----------|---------|---------|---------|
114| instruction_format | Does the response follow the requested format? | Ignores format | Generally follows format | Perfectly follows format |
115| instruction_constraints | Does the response respect stated constraints? | Ignores constraints | Mostly respects constraints | Fully respects all constraints |
116 
117## LLM Judge Configuration
118 
119### System Prompt
120 
121```
122You are an expert evaluator assessing AI assistant responses in software development conversations.
123 
124Your task is to grade responses against specific rubric criteria. For each criterion:
1251. Read the criterion question carefully
1262. Examine the response for evidence
1273. Assign a score from 0-5 based on the scoring guide
1284. Provide brief reasoning for your score
129 
130Be objective and consistent. Focus on what is present in the response, not what could have been included.
131```
132 
133### Judge Input Format
134 
135```json
136{
137  "probe_question": "What was the original error message?",
138  "model_response": "[Response to evaluate]",
139  "compacted_context": "[The compressed context that was provided]",
140  "ground_truth": "[Optional: known correct answer]",
141  "rubric_criteria": ["accuracy_factual", "accuracy_technical", "context_conversation_state"]
142}
143```
144 
145### Judge Output Format
146 
147```json
148{
149  "criterionResults": [
150    {
151      "criterionId": "accuracy_factual",
152      "score": 5,
153      "reasoning": "Response correctly identifies the 401 error, specific endpoint, and root cause."
154    }
155  ],
156  "aggregateScore": 4.8,
157  "dimensionScores": {
158    "accuracy": 4.9,
159    "context_awareness": 4.5,
160    "artifact_trail": 3.2,
161    "completeness": 5.0,
162    "continuity": 4.8,
163    "instruction_following": 5.0
164  }
165}
166```
167 
168## Benchmark Results Reference
169 
170Performance across compression methods (based on 36,000+ messages):
171 
172| Method | Overall | Accuracy | Context | Artifact | Complete | Continuity | Instruction |
173|--------|---------|----------|---------|----------|----------|------------|-------------|
174| Anchored Iterative | 3.70 | 4.04 | 4.01 | 2.45 | 4.44 | 3.80 | 4.99 |
175| Regenerative | 3.44 | 3.74 | 3.56 | 2.33 | 4.37 | 3.67 | 4.95 |
176| Opaque | 3.35 | 3.43 | 3.64 | 2.19 | 4.37 | 3.77 | 4.92 |
177 
178**Key Findings:**
179 
1801. **Accuracy gap**: 0.61 points between best and worst methods
1812. **Context awareness gap**: 0.45 points, favoring anchored iterative
1823. **Artifact trail**: Universally weak (2.19-2.45), needs specialized handling
1834. **Completeness and instruction following**: Minimal differentiation
184 
185## Statistical Considerations
186 
187- Differences of 0.26-0.35 points are consistent across task types and session lengths
188- Pattern holds for both short and long sessions
189- Pattern holds across debugging, feature implementation, and code review tasks
190- Sample size: 36,611 messages across hundreds of compression points
191 
192## Implementation Notes
193 
194### Probe Generation
195 
196Generate probes at each compression point based on truncated history:
1971. Extract factual claims for recall probes
1982. Extract file operations for artifact probes
1993. Extract incomplete tasks for continuation probes
2004. Extract decision points for decision probes
201 
202### Grading Process
203 
2041. Feed probe question + model response + compressed context to judge
2052. Evaluate against each criterion in rubric
2063. Output structured JSON with scores and reasoning
2074. Compute dimension scores as weighted averages
2085. Compute overall score as unweighted average of dimensions
209 
210### Blinding
211 
212The judge should not know which compression method produced the response being evaluated. This prevents bias toward known methods.
213 
214
Agent Skills for Context Engineering

skills/context-compression/references/evaluation-framework.md

Preparing the source view

Agent Skills for Context Engineering

skills/context-compression/references/evaluation-framework.md