Loading source
Pulling the file list, source metadata, and syntax-aware rendering for this listing.
Source from repo
A comprehensive collection of Agent Skills for context engineering, multi-agent architectures, and production agent systems.
Files
Skill
Size
Entrypoint
Format
Open file
Syntax-highlighted preview of this file as included in the skill package.
skills/context-compression/references/evaluation-framework.md
1# Context Compression Evaluation Framework23This document provides the complete evaluation framework for measuring context compression quality, including probe types, scoring rubrics, and LLM judge configuration.45## Probe Types67### Recall Probes89Test factual retention of specific details from conversation history.1011**Structure:**12```13Question: [Ask for specific fact from truncated history]14Expected: [Exact detail that should be preserved]15Scoring: Match accuracy of technical details16```1718**Examples:**19- "What was the original error message that started this debugging session?"20- "What version of the dependency did we decide to use?"21- "What was the exact command that failed?"2223### Artifact Probes2425Test file tracking and modification awareness.2627**Structure:**28```29Question: [Ask about files created, modified, or examined]30Expected: [Complete list with change descriptions]31Scoring: Completeness of file list and accuracy of change descriptions32```3334**Examples:**35- "Which files have we modified? Describe what changed in each."36- "What new files did we create in this session?"37- "Which configuration files did we examine but not change?"3839### Continuation Probes4041Test ability to continue work without re-fetching context.4243**Structure:**44```45Question: [Ask about next steps or current state]46Expected: [Actionable next steps based on session history]47Scoring: Ability to continue without requesting re-read of files48```4950**Examples:**51- "What should we do next?"52- "What tests are still failing and why?"53- "What was left incomplete from our last step?"5455### Decision Probes5657Test retention of reasoning chains and decision rationale.5859**Structure:**60```61Question: [Ask about why a decision was made]62Expected: [Reasoning that led to the decision]63Scoring: Preservation of decision context and alternatives considered64```6566**Examples:**67- "We discussed options for the Redis issue. What did we decide and why?"68- "Why did we choose connection pooling over per-request connections?"69- "What alternatives did we consider for the authentication fix?"7071## Scoring Rubrics7273### Accuracy Dimension7475| Criterion | Question | Score 0 | Score 3 | Score 5 |76|-----------|----------|---------|---------|---------|77| accuracy_factual | Are facts, file paths, and technical details correct? | Completely incorrect or fabricated | Mostly accurate with minor errors | Perfectly accurate |78| accuracy_technical | Are code references and technical concepts correct? | Major technical errors | Generally correct with minor issues | Technically precise |7980### Context Awareness Dimension8182| Criterion | Question | Score 0 | Score 3 | Score 5 |83|-----------|----------|---------|---------|---------|84| context_conversation_state | Does the response reflect current conversation state? | No awareness of prior context | General awareness with gaps | Full awareness of conversation history |85| context_artifact_state | Does the response reflect which files/artifacts were accessed? | No awareness of artifacts | Partial artifact awareness | Complete artifact state awareness |8687### Artifact Trail Dimension8889| Criterion | Question | Score 0 | Score 3 | Score 5 |90|-----------|----------|---------|---------|---------|91| artifact_files_created | Does the agent know which files were created? | No knowledge | Knows most files | Perfect knowledge |92| artifact_files_modified | Does the agent know which files were modified and what changed? | No knowledge | Good knowledge of most modifications | Perfect knowledge of all modifications |93| artifact_key_details | Does the agent remember function names, variable names, error messages? | No recall | Recalls most key details | Perfect recall |9495### Completeness Dimension9697| Criterion | Question | Score 0 | Score 3 | Score 5 |98|-----------|----------|---------|---------|---------|99| completeness_coverage | Does the response address all parts of the question? | Ignores most parts | Addresses most parts | Addresses all parts thoroughly |100| completeness_depth | Is sufficient detail provided? | Superficial or missing detail | Adequate detail | Comprehensive detail |101102### Continuity Dimension103104| Criterion | Question | Score 0 | Score 3 | Score 5 |105|-----------|----------|---------|---------|---------|106| continuity_work_state | Can the agent continue without re-fetching previously accessed information? | Cannot continue without re-fetching all context | Can continue with minimal re-fetching | Can continue seamlessly |107| continuity_todo_state | Does the agent maintain awareness of pending tasks? | Lost track of all TODOs | Good awareness with some gaps | Perfect task awareness |108| continuity_reasoning | Does the agent retain rationale behind previous decisions? | No memory of reasoning | Generally remembers reasoning | Excellent retention |109110### Instruction Following Dimension111112| Criterion | Question | Score 0 | Score 3 | Score 5 |113|-----------|----------|---------|---------|---------|114| instruction_format | Does the response follow the requested format? | Ignores format | Generally follows format | Perfectly follows format |115| instruction_constraints | Does the response respect stated constraints? | Ignores constraints | Mostly respects constraints | Fully respects all constraints |116117## LLM Judge Configuration118119### System Prompt120121```122You are an expert evaluator assessing AI assistant responses in software development conversations.123124Your task is to grade responses against specific rubric criteria. For each criterion:1251. Read the criterion question carefully1262. Examine the response for evidence1273. Assign a score from 0-5 based on the scoring guide1284. Provide brief reasoning for your score129130Be objective and consistent. Focus on what is present in the response, not what could have been included.131```132133### Judge Input Format134135```json136{137"probe_question": "What was the original error message?",138"model_response": "[Response to evaluate]",139"compacted_context": "[The compressed context that was provided]",140"ground_truth": "[Optional: known correct answer]",141"rubric_criteria": ["accuracy_factual", "accuracy_technical", "context_conversation_state"]142}143```144145### Judge Output Format146147```json148{149"criterionResults": [150{151"criterionId": "accuracy_factual",152"score": 5,153"reasoning": "Response correctly identifies the 401 error, specific endpoint, and root cause."154}155],156"aggregateScore": 4.8,157"dimensionScores": {158"accuracy": 4.9,159"context_awareness": 4.5,160"artifact_trail": 3.2,161"completeness": 5.0,162"continuity": 4.8,163"instruction_following": 5.0164}165}166```167168## Benchmark Results Reference169170Performance across compression methods (based on 36,000+ messages):171172| Method | Overall | Accuracy | Context | Artifact | Complete | Continuity | Instruction |173|--------|---------|----------|---------|----------|----------|------------|-------------|174| Anchored Iterative | 3.70 | 4.04 | 4.01 | 2.45 | 4.44 | 3.80 | 4.99 |175| Regenerative | 3.44 | 3.74 | 3.56 | 2.33 | 4.37 | 3.67 | 4.95 |176| Opaque | 3.35 | 3.43 | 3.64 | 2.19 | 4.37 | 3.77 | 4.92 |177178**Key Findings:**1791801. **Accuracy gap**: 0.61 points between best and worst methods1812. **Context awareness gap**: 0.45 points, favoring anchored iterative1823. **Artifact trail**: Universally weak (2.19-2.45), needs specialized handling1834. **Completeness and instruction following**: Minimal differentiation184185## Statistical Considerations186187- Differences of 0.26-0.35 points are consistent across task types and session lengths188- Pattern holds for both short and long sessions189- Pattern holds across debugging, feature implementation, and code review tasks190- Sample size: 36,611 messages across hundreds of compression points191192## Implementation Notes193194### Probe Generation195196Generate probes at each compression point based on truncated history:1971. Extract factual claims for recall probes1982. Extract file operations for artifact probes1993. Extract incomplete tasks for continuation probes2004. Extract decision points for decision probes201202### Grading Process2032041. Feed probe question + model response + compressed context to judge2052. Evaluate against each criterion in rubric2063. Output structured JSON with scores and reasoning2074. Compute dimension scores as weighted averages2085. Compute overall score as unweighted average of dimensions209210### Blinding211212The judge should not know which compression method produced the response being evaluated. This prevents bias toward known methods.213214