Source from repo
Agent Skills for Context Engineering

A comprehensive collection of Agent Skills for context engineering, multi-agent architectures, and production agent systems.
muratcankoylanGitHub muratcankoylanSource repo Original GitHub link
Files
339
Skill
n/a
Size
4.3 MB
Entrypoint
SKILL.md
Format
git-repo
Open file
docs/compression.md

Syntax-highlighted preview of this file as included in the skill package.
Rendered Source
markdown298 linesFree
docs/compression.md
1---
2name: context-compression-evaluation
3description: Evaluation framework for measuring how much context different compression strategies preserve in AI agents, comparing structured summarization with alternatives from OpenAI and Anthropic.
4doc_type: research
5source_url: No
6---
7 
8Evaluating Context Compression for AI Agents
9By Factory Research - December 16, 2025 - 10 minute read -
10 
11Share
12 
13 
14 
15 
16 
17Engineering
18 
19Research
20 
21New
22 
23We built an evaluation framework to measure how much context different compression strategies preserve. After testing three approaches on real-world, long-running agent sessions spanning debugging, code review, and feature implementation, we found that structured summarization retains more useful information than alternatives from OpenAI and Anthropic.
24 
25Table of Contents
26 
27 
28 
29 
30 
31 
32 
33 
34 
35 
36 
3701 The problem
38 
39 
4002 Measuring context quality
41 
42 
4303 Three approaches to compression
44 
45 
4604 A concrete example
47 
48 
4905 How the LLM judge works
50 
51 
5206 Results
53 
54 
5507 What we learned
56 
57 
5808 Methodology details
59 
60 
6109 Appendix: LLM Judge Prompts and Rubrics
62 
63Tasteful abstract illustration evocative of memory and blurriness
64When an AI agent helps you work through a complex task across hundreds of messages, what happens when it runs out of memory? The answer determines whether your agent continues productively or starts asking "wait, what were we trying to do again?"
65 
66We built an evaluation framework to measure how much context different compression strategies preserve. After testing three approaches on real-world, long-running agent sessions (debugging, PR review, feature implementation, CI troubleshooting, data science, ML research), we found that structured summarization retains more useful information than alternative methods from OpenAI and Anthropic, without sacrificing compression efficiency.
67 
68Bar chart comparing quality scores by dimension across Factory, OpenAI, and Anthropic
69This post walks through the problem, our methodology, concrete examples of how different approaches perform, and what the results mean for building reliable AI agents.
70 
71The problem
72Long-running agent sessions can generate millions of tokens of conversation history. That far exceeds what any model can hold in working memory.
73 
74The naive solution is aggressive compression: squeeze everything into the smallest possible summary. But this increases the chance your agent forgets which files it modified or what approach it already tried. It is likely to waste tokens re-reading files and re-exploring dead ends.
75 
76The right optimization target is not tokens per request. It is tokens per task.
77 
78Measuring context quality
79Traditional metrics like ROUGE or embedding similarity do not tell you whether an agent can continue working effectively after compression. A summary might score high on lexical overlap while missing the one file path the agent needs to continue.
80 
81We designed a probe-based evaluation that directly measures functional quality. The idea is simple: after compression, ask the agent questions that require remembering specific details from the truncated history. If the compression preserved the right information, the agent answers correctly. If not, it guesses or hallucinates.
82 
83We use four probe types:
84 
85Probe type	What it tests	Example question
86Recall	Factual retention	"What was the original error message?"
87Artifact	File tracking	"Which files have we modified? Describe what changed in each."
88Continuation	Task planning	"What should we do next?"
89Decision	Reasoning chain	"We discussed options for the Redis issue. What did we decide?"
90Recall probes test whether specific facts survive compression. Artifact probes test whether the agent knows what files it touched. Continuation probes test whether the agent can pick up where it left off. Decision probes test whether the reasoning behind past choices is preserved.
91 
92We grade responses using an LLM judge (GPT-5.2) across six dimensions:
93 
94Dimension	What it measures
95Accuracy	Are technical details correct? File paths, function names, errors
96Context awareness	Does the response reflect current conversation state?
97Artifact trail	Does the agent know which files were read or modified?
98Completeness	Does the response address all parts of the question?
99Continuity	Can work continue without re-fetching information?
100Instruction following	Does the response follow the probe format?
101Each dimension is scored 0-5 using detailed rubrics. The rubrics specify what constitutes a 0 ("Completely fails"), 3 ("Adequately meets with minor issues"), and 5 ("Excellently meets with no issues") for each criterion.
102 
103Why these dimensions matter for software development
104These dimensions were chosen specifically because they capture what goes wrong when coding agents lose context:
105 
106Artifact trail is critical because coding agents need to know which files they have touched. Without this, an agent might re-read files it already examined, make conflicting edits, or lose track of test results. A ChatGPT conversation can afford to forget earlier topics; a coding agent that forgets it modified auth.controller.ts will produce inconsistent work.
107 
108Continuity directly impacts token efficiency. When an agent cannot continue from where it left off, it re-fetches files and re-explores approaches it already tried. This wastes tokens and time, turning a single-pass task into an expensive multi-pass one.
109 
110Context awareness matters because coding sessions have state. The agent needs to know not just facts from the past, but the current state of the task: what has been tried, what failed, what is left to do. Generic summarization often captures "what happened" while losing "where we are."
111 
112Accuracy is non-negotiable for code. A wrong file path or misremembered function name leads to failed edits or hallucinated solutions. Unlike conversational AI where approximate recall is acceptable, coding agents need precise technical details.
113 
114Completeness ensures the agent addresses all parts of a multi-part request. When a user asks to "fix the bug and add tests," a complete response handles both. Incomplete responses force follow-up prompts and waste tokens on re-establishing context.
115 
116Instruction following verifies the agent respects constraints and formats. When asked to "only modify the auth module" or "output as JSON," the agent must comply. This dimension catches cases where compression preserved facts but lost the user's requirements.
117 
118Three approaches to compression
119We compared three production-ready compression strategies.
120 
121Factory maintains a structured, persistent summary with explicit sections for different information types: session intent, file modifications, decisions made, and next steps. When compression triggers, only the newly-truncated span is summarized and merged with the existing summary. We call this anchored iterative summarization.
122 
123The key insight is that structure forces preservation. By dedicating sections to specific information types, the summary cannot silently drop file paths or skip over decisions. Each section acts as a checklist: the summarizer must populate it or explicitly leave it empty. This prevents the gradual information loss that occurs with freeform summarization.
124 
125OpenAI offers the /responses/compact endpoint, which produces opaque, compressed representations optimized for reconstruction fidelity. It achieves the highest compression ratios (99.3%) but sacrifices interpretability. You cannot read the compressed output to verify what was preserved.
126 
127Anthropic provides built-in context compression through the Claude SDK. It produces detailed, structured summaries (typically 7-12k characters) with sections for analysis, files, pending tasks, and current state. The key difference from Factory is the update mechanism: Anthropic regenerates the full summary on each compression, while Factory's anchored approach incrementally merges new information into a persistent summary. This affects both consistency across compressions and which details survive repeated compression cycles.
128 
129A concrete example
130Consider a debugging session where a user reported a 401 error on the /api/auth/login endpoint despite valid credentials. Over 178 messages and 89,000 tokens, the assistant:
131 
132Read the authentication controller and identified JWT token generation
133Examined the middleware chain and found a misconfigured CORS policy
134Discovered the session store was using an expired Redis connection
135Fixed the Redis connection pooling configuration
136Added retry logic for transient connection failures
137Ran the test suite (14 tests passing, 2 failing)
138Fixed the remaining test failures related to mock setup
139Files examined: auth.controller.ts, middleware/cors.ts, config/redis.ts, services/session.service.ts, tests/auth.test.ts
140 
141After compression, we ask: "Before we continue, can you remind me what the original error was that started this debugging session? Please be specific about the error code and endpoint."
142 
143Here is how each method responded:
144 
145Factory response (scored 4.8/5):
146 
147"The original error was a 401 Unauthorized response from the /api/auth/login endpoint. Users were receiving this error even when providing valid username/password combinations. The root cause turned out to be a stale Redis connection in the session store - the JWT was being generated correctly, but the session could not be persisted, causing subsequent authenticated requests to fail."
148 
149Anthropic response (scored 3.9/5):
150 
151"The session started with debugging a login failure - specifically a 401 error on the authentication endpoint. We traced the issue through the auth flow and found it was related to session storage rather than the credential validation itself."
152 
153OpenAI response (scored 3.2/5):
154 
155"We were debugging an authentication issue. The login was failing for users. We looked at several files and found some configuration problems that needed to be fixed."
156 
157The Factory response names the exact endpoint (/api/auth/login), the error code (401), and the root cause (Redis session store). The Anthropic response gets the error code and general cause but loses the endpoint path. The OpenAI response loses almost all technical detail.
158 
159This pattern repeated across probe types. On artifact probes ("Which files have we modified?"), Factory scored 3.6 while OpenAI scored 2.8. Factory's summary explicitly lists files in a dedicated section. OpenAI's compression discards file paths as low-entropy content.
160 
161How the LLM judge works
162We use GPT-5.2 as an LLM judge, following the methodology established by Zheng et al. (2023) in their MT-Bench paper. Their work showed that GPT-4 achieves over 80% agreement with human preferences, matching the agreement level among humans themselves.
163 
164The judge receives the probe question, the model's response, the compacted conversation context, and (when available) ground truth. It then scores each rubric criterion with explicit reasoning.
165 
166Here is an abbreviated example of judge output for the Factory response above:
167 
168{
169  "criterionResults": [
170    {
171      "criterionId": "accuracy_factual",
172      "score": 5,
173      "reasoning": "Response correctly identifies the 401 error, the specific endpoint (/api/auth/login), and the root cause (Redis connection issue)."
174    },
175    {
176      "criterionId": "accuracy_technical",
177      "score": 5,
178      "reasoning": "Technical details are accurate - JWT generation, session persistence, and the causal chain are correctly described."
179    },
180    {
181      "criterionId": "context_artifact_state",
182      "score": 4,
183      "reasoning": "Response demonstrates awareness of the debugging journey but does not enumerate all files examined."
184    },
185    {
186      "criterionId": "completeness_coverage",
187      "score": 5,
188      "reasoning": "Fully addresses the probe question with the error code, endpoint, symptom, and root cause."
189    }
190  ],
191  "aggregateScore": 4.8
192}
193 
194The judge does not know which compression method produced the response. It evaluates purely on response quality against the rubric.
195 
196Results
197We evaluated all three methods on over 36,000 messages from production sessions spanning PR review, testing, bug fixes, feature implementation, and refactoring. For each compression point, we generated four probe responses per method and graded them across six dimensions.
198 
199Method	Overall	Accuracy	Context	Artifact	Complete	Continuity	Instruction
200Factory	3.70	4.04	4.01	2.45	4.44	3.80	4.99
201Anthropic	3.44	3.74	3.56	2.33	4.37	3.67	4.95
202OpenAI	3.35	3.43	3.64	2.19	4.37	3.77	4.92
203Factory scores 0.35 points higher than OpenAI and 0.26 higher than Anthropic overall.
204 
205Radar chart showing quality profile comparison across all three methods
206Breaking down by dimension:
207 
208Accuracy shows the largest gap. Factory scores 4.04, Anthropic 3.74, OpenAI 3.43. The 0.61 point difference between Factory and OpenAI reflects how often technical details like file paths and error messages survive compression.
209 
210Context awareness favors Factory (4.01) over Anthropic (3.56), a 0.45 point gap. Both approaches include structured sections for current state. Factory's advantage comes from the anchored iterative approach: by merging new summaries into a persistent state rather than regenerating from scratch, key details are less likely to drift or disappear across multiple compression cycles.
211 
212Artifact trail is the weakest dimension for all methods, ranging from 2.19 to 2.45. Even Factory's structured approach struggles to maintain complete file tracking across long sessions. This suggests artifact preservation needs specialized handling beyond general summarization.
213 
214Completeness and instruction following show small differences. All methods produce responses that address the question and follow the format. The differentiation happens in the quality of the content, not its structure.
215 
216Horizontal bar chart showing Factory quality advantage by dimensionSide-by-side comparison of token reduction efficiency and summary quality
217Compression ratios tell an interesting story. OpenAI compresses to 99.3% (removing 99.3% of tokens), Anthropic to 98.7%, Factory to 98.6%. Factory retains about 0.7% more tokens than OpenAI, but gains 0.35 quality points. That tradeoff favors Factory for any task where re-fetching costs matter.
218 
219What we learned
220The biggest surprise was how much structure matters. Generic summarization treats all content as equally compressible. A file path might be "low entropy" from an information-theoretic perspective, but it is exactly what the agent needs to continue working. By forcing the summarizer to fill explicit sections for files, decisions, and next steps, Factory's format prevents the silent drift that happens when you regenerate summaries from scratch.
221 
222Compression ratio turned out to be the wrong metric entirely. OpenAI achieves 99.3% compression but scores 0.35 points lower on quality. Those lost details eventually require re-fetching, which can exceed the token savings. What matters is total tokens to complete a task, not tokens per request.
223 
224Artifact tracking remains an unsolved problem. All methods scored between 2.19 and 2.45 out of 5.0 on knowing which files were created, modified, or examined. Even with explicit file sections, Factory only reaches 2.45. This probably requires specialized handling beyond summarization: a separate artifact index, or explicit file-state tracking in the agent scaffolding.
225 
226Finally, probe-based evaluation captures something that traditional metrics miss. ROUGE measures lexical similarity between summaries. Our approach measures whether the summary actually enables task continuation. For agentic workflows, that distinction matters.
227 
228Methodology details
229Dataset: Hundreds of compression points over 36,611 messages. Sessions were collected from production software engineering sessions across real codebases from users who opted into a special research program.
230 
231Probe generation: For each compression point, we generated four probes (recall, artifact, continuation, decision) based on the truncated conversation history. Probes reference specific facts, files, and decisions from the pre-compression context.
232 
233Compression: We applied all three methods to identical conversation prefixes at each compression point. Factory summaries came from production. OpenAI and Anthropic summaries were generated by feeding the same prefix to their respective APIs.
234 
235Grading: GPT-5.2 scored each probe response against six rubric dimensions. Each dimension has 2-3 criteria with explicit scoring guides. We computed dimension scores as weighted averages of criteria, and overall scores as unweighted averages of dimensions.
236 
237Statistical note: The differences we report (0.26-0.35 points) are consistent across task types and session lengths. The pattern holds whether we look at short sessions or long ones, debugging tasks or feature implementation.
238 
239Appendix: LLM Judge Prompts and Rubrics
240Since the LLM judge is core to this evaluation, we provide the full prompts and rubrics here.
241 
242System Prompt
243The judge receives this system prompt:
244 
245You are an expert evaluator assessing AI assistant responses in software development conversations.
246 
247Your task is to grade responses against specific rubric criteria. For each criterion:
2481. Read the criterion question carefully
2492. Examine the response for evidence
2503. Assign a score from 0-5 based on the scoring guide
2514. Provide brief reasoning for your score
252 
253Be objective and consistent. Focus on what is present in the response, not what could have been included.
254 
255Rubric Criteria
256Each dimension contains 2-3 criteria. Here are the key criteria with their scoring guides:
257 
258Accuracy
259 
260Criterion	Question	0	3	5
261accuracy_factual	Are facts, file paths, and technical details correct?	Completely incorrect or fabricated	Mostly accurate with minor errors	Perfectly accurate
262accuracy_technical	Are code references and technical concepts correct?	Major technical errors	Generally correct with minor issues	Technically precise
263Context Awareness
264 
265Criterion	Question	0	3	5
266context_conversation_state	Does the response reflect current conversation state?	No awareness of prior context	General awareness with gaps	Full awareness of conversation history
267context_artifact_state	Does the response reflect which files/artifacts were accessed?	No awareness of artifacts	Partial artifact awareness	Complete artifact state awareness
268Artifact Trail Integrity
269 
270Criterion	Question	0	3	5
271artifact_files_created	Does the agent know which files were created?	No knowledge	Knows most files	Perfect knowledge
272artifact_files_modified	Does the agent know which files were modified and what changed?	No knowledge	Good knowledge of most modifications	Perfect knowledge of all modifications
273artifact_key_details	Does the agent remember function names, variable names, error messages?	No recall	Recalls most key details	Perfect recall
274Continuity Preservation
275 
276Criterion	Question	0	3	5
277continuity_work_state	Can the agent continue without re-fetching previously accessed information?	Cannot continue without re-fetching all context	Can continue with minimal re-fetching	Can continue seamlessly
278continuity_todo_state	Does the agent maintain awareness of pending tasks?	Lost track of all TODOs	Good awareness with some gaps	Perfect task awareness
279continuity_reasoning	Does the agent retain rationale behind previous decisions?	No memory of reasoning	Generally remembers reasoning	Excellent retention
280Completeness
281 
282Criterion	Question	0	3	5
283completeness_coverage	Does the response address all parts of the question?	Ignores most parts	Addresses most parts	Addresses all parts thoroughly
284completeness_depth	Is sufficient detail provided?	Superficial or missing detail	Adequate detail	Comprehensive detail
285Instruction Following
286 
287Criterion	Question	0	3	5
288instruction_format	Does the response follow the requested format?	Ignores format	Generally follows format	Perfectly follows format
289instruction_constraints	Does the response respect stated constraints?	Ignores constraints	Mostly respects constraints	Fully respects all constraints
290Grading Process
291For each probe response, the judge:
292 
293Receives the probe question, the model's response, and the compacted context
294Evaluates against each criterion in the rubric for that probe type
295Outputs structured JSON with scores and reasoning per criterion
296Computes dimension scores as weighted averages of criteria
297Computes overall score as unweighted average of dimensions
298The judge does not know which compression method produced the response being evaluated.
Preparing the source view

Agent Skills for Context Engineering

docs/compression.md