Loading source
Pulling the file list, source metadata, and syntax-aware rendering for this listing.
Source from repo
A comprehensive collection of Agent Skills for context engineering, multi-agent architectures, and production agent systems.
Files
Skill
Size
Entrypoint
Format
Open file
Syntax-highlighted preview of this file as included in the skill package.
examples/interleaved-thinking/README.md
1# Reasoning Trace Optimizer23<p align="center">4<strong>Debug and optimize AI agents by analyzing reasoning traces with MiniMax M2.1's interleaved thinking</strong>5</p>67<p align="center">8<a href="#key-features">Features</a> |9<a href="#quick-start">Quick Start</a> |10<a href="#how-it-works">How It Works</a> |11<a href="#examples">Examples</a> |12<a href="#api-reference">API Reference</a>13</p>1415---1617## The Problem1819Traditional AI agents fail in opaque ways. You see the final output, but not **why** decisions were made. When an agent:20- Calls the wrong tool21- Loses track of the goal22- Makes up information2324...you're left guessing where things went wrong.2526## The Solution2728**Reasoning Trace Optimizer** uses MiniMax M2.1's unique **interleaved thinking** capability to expose the agent's reasoning process between every tool call. This enables:29301. **Deep Debugging** - See exactly where reasoning diverged from expected behavior312. **Pattern Detection** - Automatically identify failure modes (context degradation, tool confusion, etc.)323. **Automated Optimization** - Generate improved prompts based on detected issues334. **Shareable Skills** - Convert learnings into reusable Agent Skills for team sharing3435## Why MiniMax M2.1?3637M2.1's **interleaved thinking** is fundamentally different from traditional reasoning models:3839```40Traditional: Think → Act → Act → Act → Done41↑42(reasoning only at start)4344M2.1: Think → Act → Think → Act → Think → Act → Done45↑ ↑ ↑46(continuous reasoning between each tool call)47```4849This matters for agents because:50- **Long tasks** require maintaining focus across many turns51- **Tool outputs** introduce unexpected information requiring adaptation52- **Debugging** needs visibility into decision-making, not just outputs5354The `thinking` block (Anthropic SDK) or `reasoning_details` field (OpenAI SDK) exposes this reasoning for analysis.5556---5758## Key Features5960| Component | Description |61|-----------|-------------|62| **TraceCapture** | Wrap M2.1 API to capture all thinking blocks with full context |63| **TraceAnalyzer** | Detect patterns like context degradation, tool confusion, instruction drift |64| **PromptOptimizer** | Generate improved prompts based on analysis using M2.1 |65| **OptimizationLoop** | Automated capture → analyze → improve → re-run cycle |66| **SkillGenerator** | Convert learnings into shareable Agent Skills |6768### Pattern Detection6970The analyzer automatically identifies these failure patterns:7172| Pattern | Description | Severity |73|---------|-------------|----------|74| `context_degradation` | Model loses information over long contexts | High |75| `tool_confusion` | Model misunderstands tool capabilities | High |76| `instruction_drift` | Model deviates from original instructions | Medium |77| `hallucination` | Model generates unsupported information | Critical |78| `goal_abandonment` | Model stops pursuing the original goal | High |79| `circular_reasoning` | Model repeats similar actions without progress | Medium |80| `premature_conclusion` | Model concludes before completing task | Medium |81| `missing_validation` | Model doesn't verify results | High |8283Each detected pattern includes:84- **Evidence** - Specific excerpts from thinking blocks85- **Severity** - Critical/High/Medium/Low86- **Suggestion** - Concrete improvement for the prompt87- **Confidence** - How certain the detection is8889---9091## Quick Start9293### Installation9495```bash96cd examples/interleaved-thinking97pip install -e .98```99100### Configuration101102Set your MiniMax API key:103104```bash105export ANTHROPIC_API_KEY=your_minimax_api_key106export ANTHROPIC_BASE_URL=https://api.minimax.io/anthropic107```108109Or create a `.env` file:110111```env112ANTHROPIC_API_KEY=your_minimax_api_key113ANTHROPIC_BASE_URL=https://api.minimax.io/anthropic114```115116### Basic Usage117118```python119from reasoning_trace_optimizer import TraceCapture, TraceAnalyzer120121# Capture reasoning trace122capture = TraceCapture()123trace = capture.run(124task="Explain quantum computing",125system_prompt="You are a science educator."126)127128print(f"Captured {len(trace.thinking_blocks)} thinking blocks")129130# Analyze the reasoning131analyzer = TraceAnalyzer()132analysis = analyzer.analyze(trace)133134print(f"Overall Score: {analysis.overall_score}/100")135for pattern in analysis.patterns:136print(f" [{pattern.severity.value}] {pattern.type.value}")137print(f" Suggestion: {pattern.suggestion}")138```139140---141142## How It Works143144### The Optimization Loop145146```147┌─────────────────────────────────────────────────────────────────────────┐148│ OPTIMIZATION LOOP │149│ │150│ ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐ │151│ │ Agent │───▶│ Capture │───▶│ Analyze │───▶│ Optimize │ │152│ │ Execute │ │ Traces │ │ Patterns │ │ Prompt │ │153│ └──────────┘ └──────────┘ └──────────┘ └──────────┘ │154│ ▲ │ │155│ └───────────────────────────────────────────────┘ │156│ (loop until converged or max iterations) │157│ │158│ Convergence: Score improvement < threshold OR score > target │159└─────────────────────────────────────────────────────────────────────────┘160```161162### What Gets Captured163164For each agent execution, we capture:1651661. **Thinking Blocks** - M2.1's reasoning before each action1672. **Tool Calls** - What tools were called with what inputs1683. **Tool Results** - What each tool returned1694. **Final Response** - The agent's output1705. **Metadata** - Tokens used, turns taken, success/failure171172### What Gets Analyzed173174The analyzer examines thinking blocks to understand:175176- **Current Understanding** - What does the agent believe about the task?177- **Tool Interpretation** - How did it interpret each tool result?178- **Alternatives Considered** - What options did it evaluate?179- **Goal Awareness** - Is it still pursuing the original objective?180181---182183## Examples184185### Example 1: Basic Trace Capture186187```python188# examples/01_basic_capture.py189from reasoning_trace_optimizer import TraceCapture190191capture = TraceCapture()192trace = capture.run(193task="Explain what interleaved thinking is and why it matters for AI agents.",194system_prompt="You are an AI researcher explaining concepts clearly."195)196197# Output:198# Captured 1 thinking block199# Turn 0: "The user is asking me to explain 'interleaved thinking'..."200```201202### Example 2: Tool Usage with Analysis203204```python205# examples/02_tool_usage.py206from reasoning_trace_optimizer import TraceCapture, TraceAnalyzer207208# Define tools209tools = [210{211"name": "get_weather",212"description": "Get current weather for a city",213"input_schema": {...}214}215]216217capture = TraceCapture()218trace = capture.run(219task="Compare the weather in San Francisco and New York",220tools=tools,221tool_executor=execute_tool222)223224# Analyze225analyzer = TraceAnalyzer()226analysis = analyzer.analyze(trace)227228# Output:229# Score: 85/100230# Thinking Blocks: 3231# Tool Calls: 4 (get_weather x2, get_forecast x2)232# Patterns: None detected233```234235### Example 3: Full Optimization Loop236237This example demonstrates a complex research task with 7 tools (web search, file operations, note-taking):238239```python240# examples/03_full_optimization.py241from reasoning_trace_optimizer import OptimizationLoop, LoopConfig, SkillGenerator242243config = LoopConfig(244max_iterations=3,245min_score_threshold=85.0,246convergence_threshold=5.0,247save_artifacts=True,248)249250loop = OptimizationLoop(config=config)251result = loop.run(252task="""Research "context engineering for AI agents" and create a summary...""",253initial_prompt="You are a research assistant.",254tools=TOOLS,255tool_executor=execute_tool,256)257258# Generate shareable skill259generator = SkillGenerator()260skill_path = generator.generate(result, skill_name="research-agent")261```262263**Actual Output from Example 3:**264265```266======================================================================267OPTIMIZATION RESULTS268======================================================================269270Total Iterations: 3271Converged: Yes272273ITERATION 1 (Score: 69/100)274├── Task Completed: Yes275├── Thinking Blocks: 6276├── Tool Calls: 16277├── Patterns Found: 2278│ ├── [LOW] missing_validation279│ └── [LOW] incomplete_reasoning280├── Strengths: Excellent goal adherence, thorough source diversity281└── Warning: Prompt grew too large (2979 chars), limiting growth282283ITERATION 2 (Score: 60/100) ← Regression detected!284├── Task Completed: Yes285├── Thinking Blocks: 8286├── Tool Calls: 16287├── Patterns Found: 3288│ ├── [MEDIUM] incomplete_reasoning289│ ├── [MEDIUM] missing_validation290│ └── [LOW] tool_misuse291292ITERATION 3 (Score: 66/100)293├── Task Completed: Yes294├── Thinking Blocks: 8295├── Tool Calls: 16296└── Patterns Found: 3297298→ Using best prompt from iteration 1 (score: 67.6)299300TOOL USAGE ACROSS ALL ITERATIONS:301├── read_url: 20 calls302├── web_search: 12 calls303├── list_directory: 7 calls304├── save_note: 6 calls305└── write_file: 3 calls306307NOTES SAVED: 6 research notes with tagged findings308FILES WRITTEN: ./output/research_summary.md (11,357 chars)309310GENERATED SKILL: ./generated_skills/comprehensive-research-agent/SKILL.md311```312313**Key Features Demonstrated:**3143151. **Prompt Growth Limiting** - Prevents prompt bloat by limiting expansion to 3x original size3162. **Best Score Tracking** - Automatically uses the best-performing prompt, even if later iterations regress3173. **Regression Detection** - Warns when scores drop and can stop after consecutive regressions318319---320321## Generated Artifacts322323### Optimization Artifacts324325Each optimization run creates artifacts for inspection:326327```328optimization_artifacts/329├── summary.json # Overall results330├── final_prompt.txt # The optimized prompt331├── iteration_1/332│ ├── trace.json # Full reasoning trace333│ ├── analysis.json # Pattern detection results334│ └── optimization.json # Prompt changes made335├── iteration_2/336│ └── ...337└── iteration_3/338└── ...339```340341### Generated Skills342343The SkillGenerator converts optimization learnings into shareable Agent Skills:344345```346generated_skills/347└── comprehensive-research-agent/348├── SKILL.md # The shareable skill349└── references/350├── optimization_summary.json351├── optimized_prompt.txt352└── patterns_found.json353```354355**Example Generated Skill Content:**356357```markdown358## Patterns to Avoid359360- **Missing Validation**: Accepting tool responses at face value without361verifying the actual state change occurred.362- **Hallucinating Sources**: Citing sources that failed to load.363- **Ignoring Contradictions**: Proceeding when tool results conflict.364365## Recommended Practices366367- After every tool call, state the outcome explicitly368- Track sources separately: 'attempted' vs 'successful'369- Implement error recovery with alternative approaches370- Cross-reference key claims against multiple sources371```372373---374375## API Reference376377### TraceCapture378379```python380capture = TraceCapture(381api_key="...", # MiniMax API key382base_url="https://api.minimax.io/anthropic", # API endpoint383model="MiniMax-M2.1" # Model to use384)385386trace = capture.run(387task="...", # The task to execute388system_prompt="...", # System prompt389tools=[...], # Tool definitions (Anthropic format)390tool_executor=fn, # Function to execute tools391max_turns=10, # Maximum conversation turns392max_tokens=4096 # Max tokens per response393)394```395396### TraceAnalyzer397398```python399analyzer = TraceAnalyzer(400api_key="...",401base_url="https://api.minimax.io/anthropic",402model="MiniMax-M2.1"403)404405analysis = analyzer.analyze(trace)406# Returns: AnalysisResult with patterns, scores, recommendations407408quick_score = analyzer.quick_score(trace)409# Returns: float (0-100) for fast feedback410```411412### OptimizationLoop413414```python415config = LoopConfig(416# Iteration control417max_iterations=5, # Maximum optimization iterations418convergence_threshold=3.0, # Stop if improvement < this %419min_score_threshold=75.0, # Stop if score exceeds this420regression_threshold=8.0, # Warn if score drops by this much421422# Optimization behavior423use_best_prompt=True, # Use best-performing prompt, not final424max_prompt_growth=5.0, # Limit prompt expansion to 5x original425426# Output options427save_artifacts=True, # Save traces and analyses428artifacts_dir="./artifacts" # Where to save429)430431loop = OptimizationLoop(config=config)432result = loop.run(task, initial_prompt, tools, tool_executor)433# Returns: LoopResult with iterations, final_prompt, scores434```435436**Optimization Safeguards:**437438- **Best Prompt Tracking**: Keeps the prompt that produced the highest score439- **Prompt Growth Limiting**: Prevents prompt bloat by limiting size expansion440- **Regression Detection**: Warns on score drops, stops after consecutive regressions441442**Score Expectations:**443444| Task Complexity | Typical Score Range | Notes |445|-----------------|---------------------|-------|446| Simple (1-2 tools) | 80-95 | Straightforward tasks converge quickly |447| Medium (3-5 tools) | 70-85 | Multiple tool coordination adds variability |448| Complex (6+ tools, multi-step) | 60-75 | Inherent variance in long reasoning chains |449450Complex research tasks with many tools and steps typically plateau around **65-75** due to:451- Tool output variability affecting reasoning paths452- Multiple valid approaches leading to different scoring453- The stochastic nature of multi-step agent execution454455The optimizer focuses on **relative improvement** and **pattern elimination** rather than achieving a specific absolute score.456457### SkillGenerator458459```python460generator = SkillGenerator()461skill_path = generator.generate(462result=loop_result, # From OptimizationLoop463skill_name="my-skill", # Lowercase with hyphens464output_dir="./generated_skills",465title="Human Readable Title"466)467```468469---470471## CLI Usage472473```bash474# Capture a reasoning trace475rto capture "Explain interleaved thinking" -s "You are an AI researcher."476477# Analyze a task and output results478rto analyze "Debug this code snippet" -o analysis.txt479480# Run full optimization loop481rto optimize "Research AI papers" --max-iterations 5 --generate-skill482483# Generate skill from previous optimization484rto generate-skill my-skill-name --artifacts-dir ./optimization_artifacts485```486487---488489## Real-World Sources Used490491Example 3 uses real documentation URLs for realistic simulation:492493| Source | URL |494|--------|-----|495| Anthropic Docs | `docs.anthropic.com/en/docs/build-with-claude/*` |496| Anthropic Research | `anthropic.com/research/building-effective-agents` |497| OpenAI Docs | `platform.openai.com/docs/guides/*` |498| MiniMax M2.1 | `minimax.io/platform/docs/M2.1` |499| DAIR.AI | `promptingguide.ai/techniques` |500| LangChain | `python.langchain.com/docs/how_to/debugging` |501| arXiv Papers | `arxiv.org/abs/2307.03172` (Lost in the Middle) |502503---504505## Robustness Features506507The optimizer includes several safeguards to handle real-world variability:508509### Parsing Resilience510511LLM responses don't always produce valid JSON. The system handles this gracefully:512513| Component | Fallback Behavior |514|-----------|-------------------|515| **Analyzer** | Extracts scores via regex patterns when JSON fails; defaults to 50/100 (not 0) |516| **Optimizer** | Multi-strategy prompt extraction: JSON → regex → marker detection → code blocks |517| **Loop** | Warns when final prompt is unchanged; tracks best-performing iteration |518519### Extended Test Results (10 iterations)520521Real-world testing revealed important insights:522523```524Iteration Score Patterns Tool Calls Notes525────────────────────────────────────────────────5261 69/100 4 22 Baseline5272 66/100 3 14 -5283 61/100 3 17 -5294 72/100 3 20 ← Best score5305 59/100 4 16 -5316 50/100* 0 15 *Parser fallback activated5327 70/100 3 12 Recovery5338 64/100 3 14 -5349 64/100 3 18 -53510 70/100 3 19 Final536537* Iteration 6: JSON parsing failed, fallback returned neutral score538```539540**Key Learnings:**541- Scores fluctuate ±15 points between iterations due to stochastic model behavior542- Best score (72) was achieved mid-run, not at the end543- `use_best_prompt=True` correctly selected iteration 4's prompt544- Parsing failures now handled gracefully instead of returning 0 scores545546---547548## Architecture549550```551reasoning_trace_optimizer/552├── __init__.py # Public API exports553├── models.py # Data models (Pydantic)554│ ├── ThinkingBlock # Single reasoning segment555│ ├── ToolCall # Tool invocation record556│ ├── ReasoningTrace # Complete execution trace557│ ├── Pattern # Detected failure pattern558│ ├── AnalysisResult # Full analysis output559│ └── LoopResult # Optimization loop result560├── capture.py # TraceCapture - M2.1 API wrapper561├── analyzer.py # TraceAnalyzer - Pattern detection (with fallback parsing)562├── optimizer.py # PromptOptimizer - Prompt improvement (with fallback extraction)563├── loop.py # OptimizationLoop - Full cycle (with best-score tracking)564├── skill_generator.py # SkillGenerator - Create skills565└── cli.py # Command-line interface566```567568---569570## Integration571572### Claude Code Skill573574This project includes a Claude Code skill (`SKILL.md`) enabling:575576- **Auto-trigger on failure** - Analyze when agent tasks fail577- **On-demand analysis** - Use `/reasoning-trace-optimizer` command578- **Session analysis** - Analyze thinking from current conversation579580### Python Library581582```python583from reasoning_trace_optimizer import (584TraceCapture,585TraceAnalyzer,586PromptOptimizer,587OptimizationLoop,588LoopConfig,589SkillGenerator,590)591```592593---594595## Contributing596597This project is part of the [Agent Skills for Context Engineering](https://github.com/muratcankoylan/Agent-Skills-for-Context-Engineering) collection.598599---600601## License602603MIT License604605---606607## References608609- [MiniMax M2.1 Documentation](https://www.minimax.io/platform/docs)610- [MiniMax API Reference](https://www.minimax.io/platform/docs/M2.1)611- [Interleaved Thinking Guide](./docs/interleavedthinking.md)612- [Agent Generalization Research](./docs/agentthinking.md)613- [Anthropic API Compatibility](./docs/m2-1.md)614615---616617<p align="center">618<strong>Built in partnership with MiniMax AI</strong><br>619Showcasing the power of interleaved thinking for agent debugging620</p>621