Loading source
Pulling the file list, source metadata, and syntax-aware rendering for this listing.
Source from repo
A comprehensive collection of Agent Skills for context engineering, multi-agent architectures, and production agent systems.
Files
Skill
Size
Entrypoint
Format
Open file
Syntax-highlighted preview of this file as included in the skill package.
docs/compression.md
1---2name: context-compression-evaluation3description: Evaluation framework for measuring how much context different compression strategies preserve in AI agents, comparing structured summarization with alternatives from OpenAI and Anthropic.4doc_type: research5source_url: No6---78Evaluating Context Compression for AI Agents9By Factory Research - December 16, 2025 - 10 minute read -1011Share121314151617Engineering1819Research2021New2223We built an evaluation framework to measure how much context different compression strategies preserve. After testing three approaches on real-world, long-running agent sessions spanning debugging, code review, and feature implementation, we found that structured summarization retains more useful information than alternatives from OpenAI and Anthropic.2425Table of Contents26272829303132333435363701 The problem38394002 Measuring context quality41424303 Three approaches to compression44454604 A concrete example47484905 How the LLM judge works50515206 Results53545507 What we learned56575808 Methodology details59606109 Appendix: LLM Judge Prompts and Rubrics6263Tasteful abstract illustration evocative of memory and blurriness64When an AI agent helps you work through a complex task across hundreds of messages, what happens when it runs out of memory? The answer determines whether your agent continues productively or starts asking "wait, what were we trying to do again?"6566We built an evaluation framework to measure how much context different compression strategies preserve. After testing three approaches on real-world, long-running agent sessions (debugging, PR review, feature implementation, CI troubleshooting, data science, ML research), we found that structured summarization retains more useful information than alternative methods from OpenAI and Anthropic, without sacrificing compression efficiency.6768Bar chart comparing quality scores by dimension across Factory, OpenAI, and Anthropic69This post walks through the problem, our methodology, concrete examples of how different approaches perform, and what the results mean for building reliable AI agents.7071The problem72Long-running agent sessions can generate millions of tokens of conversation history. That far exceeds what any model can hold in working memory.7374The naive solution is aggressive compression: squeeze everything into the smallest possible summary. But this increases the chance your agent forgets which files it modified or what approach it already tried. It is likely to waste tokens re-reading files and re-exploring dead ends.7576The right optimization target is not tokens per request. It is tokens per task.7778Measuring context quality79Traditional metrics like ROUGE or embedding similarity do not tell you whether an agent can continue working effectively after compression. A summary might score high on lexical overlap while missing the one file path the agent needs to continue.8081We designed a probe-based evaluation that directly measures functional quality. The idea is simple: after compression, ask the agent questions that require remembering specific details from the truncated history. If the compression preserved the right information, the agent answers correctly. If not, it guesses or hallucinates.8283We use four probe types:8485Probe type What it tests Example question86Recall Factual retention "What was the original error message?"87Artifact File tracking "Which files have we modified? Describe what changed in each."88Continuation Task planning "What should we do next?"89Decision Reasoning chain "We discussed options for the Redis issue. What did we decide?"90Recall probes test whether specific facts survive compression. Artifact probes test whether the agent knows what files it touched. Continuation probes test whether the agent can pick up where it left off. Decision probes test whether the reasoning behind past choices is preserved.9192We grade responses using an LLM judge (GPT-5.2) across six dimensions:9394Dimension What it measures95Accuracy Are technical details correct? File paths, function names, errors96Context awareness Does the response reflect current conversation state?97Artifact trail Does the agent know which files were read or modified?98Completeness Does the response address all parts of the question?99Continuity Can work continue without re-fetching information?100Instruction following Does the response follow the probe format?101Each dimension is scored 0-5 using detailed rubrics. The rubrics specify what constitutes a 0 ("Completely fails"), 3 ("Adequately meets with minor issues"), and 5 ("Excellently meets with no issues") for each criterion.102103Why these dimensions matter for software development104These dimensions were chosen specifically because they capture what goes wrong when coding agents lose context:105106Artifact trail is critical because coding agents need to know which files they have touched. Without this, an agent might re-read files it already examined, make conflicting edits, or lose track of test results. A ChatGPT conversation can afford to forget earlier topics; a coding agent that forgets it modified auth.controller.ts will produce inconsistent work.107108Continuity directly impacts token efficiency. When an agent cannot continue from where it left off, it re-fetches files and re-explores approaches it already tried. This wastes tokens and time, turning a single-pass task into an expensive multi-pass one.109110Context awareness matters because coding sessions have state. The agent needs to know not just facts from the past, but the current state of the task: what has been tried, what failed, what is left to do. Generic summarization often captures "what happened" while losing "where we are."111112Accuracy is non-negotiable for code. A wrong file path or misremembered function name leads to failed edits or hallucinated solutions. Unlike conversational AI where approximate recall is acceptable, coding agents need precise technical details.113114Completeness ensures the agent addresses all parts of a multi-part request. When a user asks to "fix the bug and add tests," a complete response handles both. Incomplete responses force follow-up prompts and waste tokens on re-establishing context.115116Instruction following verifies the agent respects constraints and formats. When asked to "only modify the auth module" or "output as JSON," the agent must comply. This dimension catches cases where compression preserved facts but lost the user's requirements.117118Three approaches to compression119We compared three production-ready compression strategies.120121Factory maintains a structured, persistent summary with explicit sections for different information types: session intent, file modifications, decisions made, and next steps. When compression triggers, only the newly-truncated span is summarized and merged with the existing summary. We call this anchored iterative summarization.122123The key insight is that structure forces preservation. By dedicating sections to specific information types, the summary cannot silently drop file paths or skip over decisions. Each section acts as a checklist: the summarizer must populate it or explicitly leave it empty. This prevents the gradual information loss that occurs with freeform summarization.124125OpenAI offers the /responses/compact endpoint, which produces opaque, compressed representations optimized for reconstruction fidelity. It achieves the highest compression ratios (99.3%) but sacrifices interpretability. You cannot read the compressed output to verify what was preserved.126127Anthropic provides built-in context compression through the Claude SDK. It produces detailed, structured summaries (typically 7-12k characters) with sections for analysis, files, pending tasks, and current state. The key difference from Factory is the update mechanism: Anthropic regenerates the full summary on each compression, while Factory's anchored approach incrementally merges new information into a persistent summary. This affects both consistency across compressions and which details survive repeated compression cycles.128129A concrete example130Consider a debugging session where a user reported a 401 error on the /api/auth/login endpoint despite valid credentials. Over 178 messages and 89,000 tokens, the assistant:131132Read the authentication controller and identified JWT token generation133Examined the middleware chain and found a misconfigured CORS policy134Discovered the session store was using an expired Redis connection135Fixed the Redis connection pooling configuration136Added retry logic for transient connection failures137Ran the test suite (14 tests passing, 2 failing)138Fixed the remaining test failures related to mock setup139Files examined: auth.controller.ts, middleware/cors.ts, config/redis.ts, services/session.service.ts, tests/auth.test.ts140141After compression, we ask: "Before we continue, can you remind me what the original error was that started this debugging session? Please be specific about the error code and endpoint."142143Here is how each method responded:144145Factory response (scored 4.8/5):146147"The original error was a 401 Unauthorized response from the /api/auth/login endpoint. Users were receiving this error even when providing valid username/password combinations. The root cause turned out to be a stale Redis connection in the session store - the JWT was being generated correctly, but the session could not be persisted, causing subsequent authenticated requests to fail."148149Anthropic response (scored 3.9/5):150151"The session started with debugging a login failure - specifically a 401 error on the authentication endpoint. We traced the issue through the auth flow and found it was related to session storage rather than the credential validation itself."152153OpenAI response (scored 3.2/5):154155"We were debugging an authentication issue. The login was failing for users. We looked at several files and found some configuration problems that needed to be fixed."156157The Factory response names the exact endpoint (/api/auth/login), the error code (401), and the root cause (Redis session store). The Anthropic response gets the error code and general cause but loses the endpoint path. The OpenAI response loses almost all technical detail.158159This pattern repeated across probe types. On artifact probes ("Which files have we modified?"), Factory scored 3.6 while OpenAI scored 2.8. Factory's summary explicitly lists files in a dedicated section. OpenAI's compression discards file paths as low-entropy content.160161How the LLM judge works162We use GPT-5.2 as an LLM judge, following the methodology established by Zheng et al. (2023) in their MT-Bench paper. Their work showed that GPT-4 achieves over 80% agreement with human preferences, matching the agreement level among humans themselves.163164The judge receives the probe question, the model's response, the compacted conversation context, and (when available) ground truth. It then scores each rubric criterion with explicit reasoning.165166Here is an abbreviated example of judge output for the Factory response above:167168{169"criterionResults": [170{171"criterionId": "accuracy_factual",172"score": 5,173"reasoning": "Response correctly identifies the 401 error, the specific endpoint (/api/auth/login), and the root cause (Redis connection issue)."174},175{176"criterionId": "accuracy_technical",177"score": 5,178"reasoning": "Technical details are accurate - JWT generation, session persistence, and the causal chain are correctly described."179},180{181"criterionId": "context_artifact_state",182"score": 4,183"reasoning": "Response demonstrates awareness of the debugging journey but does not enumerate all files examined."184},185{186"criterionId": "completeness_coverage",187"score": 5,188"reasoning": "Fully addresses the probe question with the error code, endpoint, symptom, and root cause."189}190],191"aggregateScore": 4.8192}193194The judge does not know which compression method produced the response. It evaluates purely on response quality against the rubric.195196Results197We evaluated all three methods on over 36,000 messages from production sessions spanning PR review, testing, bug fixes, feature implementation, and refactoring. For each compression point, we generated four probe responses per method and graded them across six dimensions.198199Method Overall Accuracy Context Artifact Complete Continuity Instruction200Factory 3.70 4.04 4.01 2.45 4.44 3.80 4.99201Anthropic 3.44 3.74 3.56 2.33 4.37 3.67 4.95202OpenAI 3.35 3.43 3.64 2.19 4.37 3.77 4.92203Factory scores 0.35 points higher than OpenAI and 0.26 higher than Anthropic overall.204205Radar chart showing quality profile comparison across all three methods206Breaking down by dimension:207208Accuracy shows the largest gap. Factory scores 4.04, Anthropic 3.74, OpenAI 3.43. The 0.61 point difference between Factory and OpenAI reflects how often technical details like file paths and error messages survive compression.209210Context awareness favors Factory (4.01) over Anthropic (3.56), a 0.45 point gap. Both approaches include structured sections for current state. Factory's advantage comes from the anchored iterative approach: by merging new summaries into a persistent state rather than regenerating from scratch, key details are less likely to drift or disappear across multiple compression cycles.211212Artifact trail is the weakest dimension for all methods, ranging from 2.19 to 2.45. Even Factory's structured approach struggles to maintain complete file tracking across long sessions. This suggests artifact preservation needs specialized handling beyond general summarization.213214Completeness and instruction following show small differences. All methods produce responses that address the question and follow the format. The differentiation happens in the quality of the content, not its structure.215216Horizontal bar chart showing Factory quality advantage by dimensionSide-by-side comparison of token reduction efficiency and summary quality217Compression ratios tell an interesting story. OpenAI compresses to 99.3% (removing 99.3% of tokens), Anthropic to 98.7%, Factory to 98.6%. Factory retains about 0.7% more tokens than OpenAI, but gains 0.35 quality points. That tradeoff favors Factory for any task where re-fetching costs matter.218219What we learned220The biggest surprise was how much structure matters. Generic summarization treats all content as equally compressible. A file path might be "low entropy" from an information-theoretic perspective, but it is exactly what the agent needs to continue working. By forcing the summarizer to fill explicit sections for files, decisions, and next steps, Factory's format prevents the silent drift that happens when you regenerate summaries from scratch.221222Compression ratio turned out to be the wrong metric entirely. OpenAI achieves 99.3% compression but scores 0.35 points lower on quality. Those lost details eventually require re-fetching, which can exceed the token savings. What matters is total tokens to complete a task, not tokens per request.223224Artifact tracking remains an unsolved problem. All methods scored between 2.19 and 2.45 out of 5.0 on knowing which files were created, modified, or examined. Even with explicit file sections, Factory only reaches 2.45. This probably requires specialized handling beyond summarization: a separate artifact index, or explicit file-state tracking in the agent scaffolding.225226Finally, probe-based evaluation captures something that traditional metrics miss. ROUGE measures lexical similarity between summaries. Our approach measures whether the summary actually enables task continuation. For agentic workflows, that distinction matters.227228Methodology details229Dataset: Hundreds of compression points over 36,611 messages. Sessions were collected from production software engineering sessions across real codebases from users who opted into a special research program.230231Probe generation: For each compression point, we generated four probes (recall, artifact, continuation, decision) based on the truncated conversation history. Probes reference specific facts, files, and decisions from the pre-compression context.232233Compression: We applied all three methods to identical conversation prefixes at each compression point. Factory summaries came from production. OpenAI and Anthropic summaries were generated by feeding the same prefix to their respective APIs.234235Grading: GPT-5.2 scored each probe response against six rubric dimensions. Each dimension has 2-3 criteria with explicit scoring guides. We computed dimension scores as weighted averages of criteria, and overall scores as unweighted averages of dimensions.236237Statistical note: The differences we report (0.26-0.35 points) are consistent across task types and session lengths. The pattern holds whether we look at short sessions or long ones, debugging tasks or feature implementation.238239Appendix: LLM Judge Prompts and Rubrics240Since the LLM judge is core to this evaluation, we provide the full prompts and rubrics here.241242System Prompt243The judge receives this system prompt:244245You are an expert evaluator assessing AI assistant responses in software development conversations.246247Your task is to grade responses against specific rubric criteria. For each criterion:2481. Read the criterion question carefully2492. Examine the response for evidence2503. Assign a score from 0-5 based on the scoring guide2514. Provide brief reasoning for your score252253Be objective and consistent. Focus on what is present in the response, not what could have been included.254255Rubric Criteria256Each dimension contains 2-3 criteria. Here are the key criteria with their scoring guides:257258Accuracy259260Criterion Question 0 3 5261accuracy_factual Are facts, file paths, and technical details correct? Completely incorrect or fabricated Mostly accurate with minor errors Perfectly accurate262accuracy_technical Are code references and technical concepts correct? Major technical errors Generally correct with minor issues Technically precise263Context Awareness264265Criterion Question 0 3 5266context_conversation_state Does the response reflect current conversation state? No awareness of prior context General awareness with gaps Full awareness of conversation history267context_artifact_state Does the response reflect which files/artifacts were accessed? No awareness of artifacts Partial artifact awareness Complete artifact state awareness268Artifact Trail Integrity269270Criterion Question 0 3 5271artifact_files_created Does the agent know which files were created? No knowledge Knows most files Perfect knowledge272artifact_files_modified Does the agent know which files were modified and what changed? No knowledge Good knowledge of most modifications Perfect knowledge of all modifications273artifact_key_details Does the agent remember function names, variable names, error messages? No recall Recalls most key details Perfect recall274Continuity Preservation275276Criterion Question 0 3 5277continuity_work_state Can the agent continue without re-fetching previously accessed information? Cannot continue without re-fetching all context Can continue with minimal re-fetching Can continue seamlessly278continuity_todo_state Does the agent maintain awareness of pending tasks? Lost track of all TODOs Good awareness with some gaps Perfect task awareness279continuity_reasoning Does the agent retain rationale behind previous decisions? No memory of reasoning Generally remembers reasoning Excellent retention280Completeness281282Criterion Question 0 3 5283completeness_coverage Does the response address all parts of the question? Ignores most parts Addresses most parts Addresses all parts thoroughly284completeness_depth Is sufficient detail provided? Superficial or missing detail Adequate detail Comprehensive detail285Instruction Following286287Criterion Question 0 3 5288instruction_format Does the response follow the requested format? Ignores format Generally follows format Perfectly follows format289instruction_constraints Does the response respect stated constraints? Ignores constraints Mostly respects constraints Fully respects all constraints290Grading Process291For each probe response, the judge:292293Receives the probe question, the model's response, and the compacted context294Evaluates against each criterion in the rubric for that probe type295Outputs structured JSON with scores and reasoning per criterion296Computes dimension scores as weighted averages of criteria297Computes overall score as unweighted average of dimensions298The judge does not know which compression method produced the response being evaluated.