Loading source
Pulling the file list, source metadata, and syntax-aware rendering for this listing.
Source from repo
A comprehensive collection of Agent Skills for context engineering, multi-agent architectures, and production agent systems.
Files
Skill
Size
Entrypoint
Format
Open file
Syntax-highlighted preview of this file as included in the skill package.
skills/context-optimization/references/optimization_techniques.md
1# Context Optimization Reference23This document provides detailed technical reference for context optimization techniques and strategies.45## Compaction Strategies67### Summary-Based Compaction89Summary-based compaction replaces verbose content with concise summaries while preserving key information. The approach works by identifying sections that can be compressed, generating summaries that capture essential points, and replacing full content with summaries.1011The effectiveness of compaction depends on what information is preserved. Critical decisions, user preferences, and current task state should never be compacted. Intermediate results and supporting evidence can be summarized more aggressively. Boilerplate, repeated information, and exploratory reasoning can often be removed entirely.1213### Token Budget Allocation1415Effective context budgeting requires understanding how different context components consume tokens and allocating budget strategically:1617| Component | Typical Range | Notes |18|-----------|---------------|-------|19| System prompt | 500-2000 tokens | Stable across session |20| Tool definitions | 100-500 per tool | Grows with tool count |21| Retrieved documents | Variable | Often largest consumer |22| Message history | Variable | Grows with conversation |23| Tool outputs | Variable | Can dominate context |2425### Compaction Thresholds2627Trigger compaction at appropriate thresholds to maintain performance:2829- Warning threshold at 70% of effective context limit30- Compaction trigger at 80% of effective context limit31- Aggressive compaction at 90% of effective context limit3233The exact thresholds depend on model behavior and task characteristics. Some models show graceful degradation while others exhibit sharp performance cliffs.3435## Observation Masking Patterns3637### Selective Masking3839Not all observations should be masked equally. Consider masking observations that have served their purpose and are no longer needed for active reasoning. Keep observations that are central to the current task. Keep observations from the most recent turn. Keep observations that may be referenced again.4041### Masking Implementation4243```python44def selective_mask(observations: List[Dict], current_task: Dict) -> List[Dict]:45"""46Selectively mask observations based on relevance.4748Returns observations with mask field indicating masked content.49"""50masked = []5152for obs in observations:53relevance = calculate_relevance(obs, current_task)5455if relevance < 0.3 and obs["age"] > 3:56# Low relevance and old - mask57masked.append({58**obs,59"masked": True,60"reference": store_for_reference(obs["content"]),61"summary": summarize_content(obs["content"])62})63else:64masked.append({65**obs,66"masked": False67})6869return masked70```7172## KV-Cache Optimization7374### Prefix Stability7576KV-cache hit rates depend on prefix stability. Stable prefixes enable cache reuse across requests. Dynamic prefixes invalidate cache and force recomputation.7778Elements that should remain stable include system prompts, tool definitions, and frequently used templates. Elements that may vary include timestamps, session identifiers, and query-specific content.7980### Cache-Friendly Design8182Design prompts to maximize cache hit rates:83841. Place stable content at the beginning852. Use consistent formatting across requests863. Avoid dynamic content in prompts when possible874. Use placeholders for dynamic content8889```python90# Cache-unfriendly: Dynamic timestamp in prompt91system_prompt = f"""92Current time: {datetime.now().isoformat()}93You are a helpful assistant.94"""9596# Cache-friendly: Stable prompt with dynamic time as variable97system_prompt = """98You are a helpful assistant.99Current time is provided separately when relevant.100"""101```102103## Context Partitioning Strategies104105### Sub-Agent Isolation106107Partition work across sub-agents to prevent any single context from growing too large. Each sub-agent operates with a clean context focused on its subtask.108109### Partition Planning110111```python112def plan_partitioning(task: Dict, context_limit: int) -> Dict:113"""114Plan how to partition a task based on context limits.115116Returns partitioning strategy and subtask definitions.117"""118estimated_context = estimate_task_context(task)119120if estimated_context <= context_limit:121return {122"strategy": "single_agent",123"subtasks": [task]124}125126# Plan multi-agent approach127subtasks = decompose_task(task)128129return {130"strategy": "multi_agent",131"subtasks": subtasks,132"coordination": "hierarchical"133}134```135136## Optimization Decision Framework137138### When to Optimize139140Consider context optimization when context utilization exceeds 70%, when response quality degrades as conversations extend, when costs increase due to long contexts, or when latency increases with conversation length.141142### What Optimization to Apply143144Choose optimization strategies based on context composition:145146If tool outputs dominate context, apply observation masking. If retrieved documents dominate context, apply summarization or partitioning. If message history dominates context, apply compaction with summarization. If multiple components contribute, combine strategies.147148### Evaluation of Optimization149150After applying optimization, evaluate effectiveness:151152- Measure token reduction achieved153- Measure quality preservation (output quality should not degrade)154- Measure latency improvement155- Measure cost reduction156157Iterate on optimization strategies based on evaluation results.158159## Common Pitfalls160161### Over-Aggressive Compaction162163Compacting too aggressively can remove critical information. Always preserve task goals, user preferences, and recent conversation context. Test compaction at increasing aggressiveness levels to find the optimal balance.164165### Masking Critical Observations166167Masking observations that are still needed can cause errors. Track observation usage and only mask content that is no longer referenced. Consider keeping references to masked content that could be retrieved if needed.168169### Ignoring Attention Distribution170171The lost-in-middle phenomenon means that information placement matters. Place critical information at attention-favored positions (beginning and end of context). Use explicit markers to highlight important content.172173### Premature Optimization174175Not all contexts require optimization. Adding optimization machinery has overhead. Optimize only when context limits actually constrain agent performance.176177## Monitoring and Alerting178179### Key Metrics180181Track these metrics to understand optimization needs:182183- Context token count over time184- Cache hit rates for repeated patterns185- Response quality metrics by context size186- Cost per conversation by context length187- Latency by context size188189### Alert Thresholds190191Set alerts for:192193- Context utilization above 80%194- Cache hit rate below 50%195- Quality score drop of more than 10%196- Cost increase above baseline197198## Integration Patterns199200### Integration with Agent Framework201202Integrate optimization into agent workflow:203204```python205class OptimizingAgent:206def __init__(self, context_limit: int = 80000):207self.context_limit = context_limit208self.optimizer = ContextOptimizer()209210def process(self, user_input: str, context: Dict) -> Dict:211# Check if optimization needed212if self.optimizer.should_compact(context):213context = self.optimizer.compact(context)214215# Process with optimized context216response = self._call_model(user_input, context)217218# Track metrics219self.optimizer.record_metrics(context, response)220221return response222```223224### Integration with Memory Systems225226Connect optimization with memory systems:227228```python229class MemoryAwareOptimizer:230def __init__(self, memory_system, context_limit: int):231self.memory = memory_system232self.limit = context_limit233234def optimize_context(self, current_context: Dict, task: str) -> Dict:235# Check if information is in memory236relevant_memories = self.memory.retrieve(task)237238# Move information to memory if not needed in context239for mem in relevant_memories:240if mem["importance"] < threshold:241current_context = remove_from_context(current_context, mem)242# Keep reference that memory can be retrieved243244return current_context245```246247## Performance Benchmarks248249### Compaction Performance250251Compaction should reduce token count while preserving quality. Target:252253- 50-70% token reduction for aggressive compaction254- Less than 5% quality degradation from compaction255- Less than 10% latency increase from compaction overhead256257### Masking Performance258259Observation masking should reduce token count significantly:260261- 60-80% reduction in masked observations262- Less than 2% quality impact from masking263- Near-zero latency overhead264265### Cache Performance266267KV-cache optimization should improve cost and latency:268269- 70%+ cache hit rate for stable workloads270- 50%+ cost reduction from cache hits271- 40%+ latency reduction from cache hits272273