Reasoning Trace Optimizer

Debug and optimize AI agents by analyzing reasoning traces with MiniMax M2.1's interleaved thinking

<a href="#key-features">Features</a> | <a href="#quick-start">Quick Start</a> | <a href="#how-it-works">How It Works</a> | <a href="#examples">Examples</a> | <a href="#api-reference">API Reference</a>

The Problem

Traditional AI agents fail in opaque ways. You see the final output, but not why decisions were made. When an agent:

Calls the wrong tool
Loses track of the goal
Makes up information

...you're left guessing where things went wrong.

The Solution

Reasoning Trace Optimizer uses MiniMax M2.1's unique interleaved thinking capability to expose the agent's reasoning process between every tool call. This enables:

Deep Debugging - See exactly where reasoning diverged from expected behavior
Pattern Detection - Automatically identify failure modes (context degradation, tool confusion, etc.)
Automated Optimization - Generate improved prompts based on detected issues
Shareable Skills - Convert learnings into reusable Agent Skills for team sharing

Why MiniMax M2.1?

M2.1's interleaved thinking is fundamentally different from traditional reasoning models:

Traditional:  Think → Act → Act → Act → Done
              ↑
              (reasoning only at start)

M2.1:         Think → Act → Think → Act → Think → Act → Done
              ↑            ↑              ↑
              (continuous reasoning between each tool call)

This matters for agents because:

Long tasks require maintaining focus across many turns
Tool outputs introduce unexpected information requiring adaptation
Debugging needs visibility into decision-making, not just outputs

The thinking block (Anthropic SDK) or reasoning_details field (OpenAI SDK) exposes this reasoning for analysis.

Key Features

Component	Description
TraceCapture	Wrap M2.1 API to capture all thinking blocks with full context
TraceAnalyzer	Detect patterns like context degradation, tool confusion, instruction drift
PromptOptimizer	Generate improved prompts based on analysis using M2.1
OptimizationLoop	Automated capture → analyze → improve → re-run cycle
SkillGenerator	Convert learnings into shareable Agent Skills

Pattern Detection

The analyzer automatically identifies these failure patterns:

Pattern	Description	Severity
`context_degradation`	Model loses information over long contexts	High
`tool_confusion`	Model misunderstands tool capabilities	High
`instruction_drift`	Model deviates from original instructions	Medium
`hallucination`	Model generates unsupported information	Critical
`goal_abandonment`	Model stops pursuing the original goal	High
`circular_reasoning`	Model repeats similar actions without progress	Medium
`premature_conclusion`	Model concludes before completing task	Medium
`missing_validation`	Model doesn't verify results	High

Each detected pattern includes:

Evidence - Specific excerpts from thinking blocks
Severity - Critical/High/Medium/Low
Suggestion - Concrete improvement for the prompt
Confidence - How certain the detection is

Quick Start

Installation

cd examples/interleaved-thinking
pip install -e .

Configuration

Set your MiniMax API key:

export ANTHROPIC_API_KEY=your_minimax_api_key
export ANTHROPIC_BASE_URL=https://api.minimax.io/anthropic

Or create a .env file:

ANTHROPIC_API_KEY=your_minimax_api_key
ANTHROPIC_BASE_URL=https://api.minimax.io/anthropic

Basic Usage

from reasoning_trace_optimizer import TraceCapture, TraceAnalyzer

# Capture reasoning trace
capture = TraceCapture()
trace = capture.run(
    task="Explain quantum computing",
    system_prompt="You are a science educator."
)

print(f"Captured {len(trace.thinking_blocks)} thinking blocks")

# Analyze the reasoning
analyzer = TraceAnalyzer()
analysis = analyzer.analyze(trace)

print(f"Overall Score: {analysis.overall_score}/100")
for pattern in analysis.patterns:
    print(f"  [{pattern.severity.value}] {pattern.type.value}")
    print(f"    Suggestion: {pattern.suggestion}")

How It Works

The Optimization Loop

┌─────────────────────────────────────────────────────────────────────────┐
│                       OPTIMIZATION LOOP                                 │
│                                                                         │
│   ┌──────────┐    ┌──────────┐    ┌──────────┐    ┌──────────┐          │
│   │  Agent   │───▶│ Capture  │───▶│ Analyze  │───▶│ Optimize │          │
│   │ Execute  │    │ Traces   │    │ Patterns │    │  Prompt  │          │
│   └──────────┘    └──────────┘    └──────────┘    └──────────┘          │
│        ▲                                               │                │
│        └───────────────────────────────────────────────┘                │
│                       (loop until converged or max iterations)          │
│                                                                         │
│   Convergence: Score improvement < threshold OR score > target          │
└─────────────────────────────────────────────────────────────────────────┘

What Gets Captured

For each agent execution, we capture:

Thinking Blocks - M2.1's reasoning before each action
Tool Calls - What tools were called with what inputs
Tool Results - What each tool returned
Final Response - The agent's output
Metadata - Tokens used, turns taken, success/failure

What Gets Analyzed

The analyzer examines thinking blocks to understand:

Current Understanding - What does the agent believe about the task?
Tool Interpretation - How did it interpret each tool result?
Alternatives Considered - What options did it evaluate?
Goal Awareness - Is it still pursuing the original objective?

Examples

Example 1: Basic Trace Capture

# examples/01_basic_capture.py
from reasoning_trace_optimizer import TraceCapture

capture = TraceCapture()
trace = capture.run(
    task="Explain what interleaved thinking is and why it matters for AI agents.",
    system_prompt="You are an AI researcher explaining concepts clearly."
)

# Output:
# Captured 1 thinking block
# Turn 0: "The user is asking me to explain 'interleaved thinking'..."

Example 2: Tool Usage with Analysis

# examples/02_tool_usage.py
from reasoning_trace_optimizer import TraceCapture, TraceAnalyzer

# Define tools
tools = [
    {
        "name": "get_weather",
        "description": "Get current weather for a city",
        "input_schema": {...}
    }
]

capture = TraceCapture()
trace = capture.run(
    task="Compare the weather in San Francisco and New York",
    tools=tools,
    tool_executor=execute_tool
)

# Analyze
analyzer = TraceAnalyzer()
analysis = analyzer.analyze(trace)

# Output:
# Score: 85/100
# Thinking Blocks: 3
# Tool Calls: 4 (get_weather x2, get_forecast x2)
# Patterns: None detected

Example 3: Full Optimization Loop

This example demonstrates a complex research task with 7 tools (web search, file operations, note-taking):

# examples/03_full_optimization.py
from reasoning_trace_optimizer import OptimizationLoop, LoopConfig, SkillGenerator

config = LoopConfig(
    max_iterations=3,
    min_score_threshold=85.0,
    convergence_threshold=5.0,
    save_artifacts=True,
)

loop = OptimizationLoop(config=config)
result = loop.run(
    task="""Research "context engineering for AI agents" and create a summary...""",
    initial_prompt="You are a research assistant.",
    tools=TOOLS,
    tool_executor=execute_tool,
)

# Generate shareable skill
generator = SkillGenerator()
skill_path = generator.generate(result, skill_name="research-agent")

Actual Output from Example 3:

======================================================================
OPTIMIZATION RESULTS
======================================================================

Total Iterations: 3
Converged: Yes

ITERATION 1 (Score: 69/100)
├── Task Completed: Yes
├── Thinking Blocks: 6
├── Tool Calls: 16
├── Patterns Found: 2
│   ├── [LOW] missing_validation
│   └── [LOW] incomplete_reasoning
├── Strengths: Excellent goal adherence, thorough source diversity
└── Warning: Prompt grew too large (2979 chars), limiting growth

ITERATION 2 (Score: 60/100)  ← Regression detected!
├── Task Completed: Yes
├── Thinking Blocks: 8
├── Tool Calls: 16
├── Patterns Found: 3
│   ├── [MEDIUM] incomplete_reasoning
│   ├── [MEDIUM] missing_validation
│   └── [LOW] tool_misuse

ITERATION 3 (Score: 66/100)
├── Task Completed: Yes
├── Thinking Blocks: 8
├── Tool Calls: 16
└── Patterns Found: 3

→ Using best prompt from iteration 1 (score: 67.6)

TOOL USAGE ACROSS ALL ITERATIONS:
├── read_url: 20 calls
├── web_search: 12 calls
├── list_directory: 7 calls
├── save_note: 6 calls
└── write_file: 3 calls

NOTES SAVED: 6 research notes with tagged findings
FILES WRITTEN: ./output/research_summary.md (11,357 chars)

GENERATED SKILL: ./generated_skills/comprehensive-research-agent/SKILL.md

Key Features Demonstrated:

Prompt Growth Limiting - Prevents prompt bloat by limiting expansion to 3x original size
Best Score Tracking - Automatically uses the best-performing prompt, even if later iterations regress
Regression Detection - Warns when scores drop and can stop after consecutive regressions

Generated Artifacts

Optimization Artifacts

Each optimization run creates artifacts for inspection:

optimization_artifacts/
├── summary.json              # Overall results
├── final_prompt.txt          # The optimized prompt
├── iteration_1/
│   ├── trace.json            # Full reasoning trace
│   ├── analysis.json         # Pattern detection results
│   └── optimization.json     # Prompt changes made
├── iteration_2/
│   └── ...
└── iteration_3/
    └── ...

Generated Skills

The SkillGenerator converts optimization learnings into shareable Agent Skills:

generated_skills/
└── comprehensive-research-agent/
    ├── SKILL.md              # The shareable skill
    └── references/
        ├── optimization_summary.json
        ├── optimized_prompt.txt
        └── patterns_found.json

Example Generated Skill Content:

## Patterns to Avoid

- **Missing Validation**: Accepting tool responses at face value without
  verifying the actual state change occurred.
- **Hallucinating Sources**: Citing sources that failed to load.
- **Ignoring Contradictions**: Proceeding when tool results conflict.

## Recommended Practices

- After every tool call, state the outcome explicitly
- Track sources separately: 'attempted' vs 'successful'
- Implement error recovery with alternative approaches
- Cross-reference key claims against multiple sources

API Reference

TraceCapture

capture = TraceCapture(
    api_key="...",                              # MiniMax API key
    base_url="https://api.minimax.io/anthropic", # API endpoint
    model="MiniMax-M2.1"                        # Model to use
)

trace = capture.run(
    task="...",                    # The task to execute
    system_prompt="...",           # System prompt
    tools=[...],                   # Tool definitions (Anthropic format)
    tool_executor=fn,              # Function to execute tools
    max_turns=10,                  # Maximum conversation turns
    max_tokens=4096                # Max tokens per response
)

TraceAnalyzer

analyzer = TraceAnalyzer(
    api_key="...",
    base_url="https://api.minimax.io/anthropic",
    model="MiniMax-M2.1"
)

analysis = analyzer.analyze(trace)
# Returns: AnalysisResult with patterns, scores, recommendations

quick_score = analyzer.quick_score(trace)
# Returns: float (0-100) for fast feedback

OptimizationLoop

config = LoopConfig(
    # Iteration control
    max_iterations=5,           # Maximum optimization iterations
    convergence_threshold=3.0,  # Stop if improvement < this %
    min_score_threshold=75.0,   # Stop if score exceeds this
    regression_threshold=8.0,   # Warn if score drops by this much

    # Optimization behavior
    use_best_prompt=True,       # Use best-performing prompt, not final
    max_prompt_growth=5.0,      # Limit prompt expansion to 5x original

    # Output options
    save_artifacts=True,        # Save traces and analyses
    artifacts_dir="./artifacts" # Where to save
)

loop = OptimizationLoop(config=config)
result = loop.run(task, initial_prompt, tools, tool_executor)
# Returns: LoopResult with iterations, final_prompt, scores

Optimization Safeguards:

Best Prompt Tracking: Keeps the prompt that produced the highest score
Prompt Growth Limiting: Prevents prompt bloat by limiting size expansion
Regression Detection: Warns on score drops, stops after consecutive regressions

Score Expectations:

Task Complexity	Typical Score Range	Notes
Simple (1-2 tools)	80-95	Straightforward tasks converge quickly
Medium (3-5 tools)	70-85	Multiple tool coordination adds variability
Complex (6+ tools, multi-step)	60-75	Inherent variance in long reasoning chains

Complex research tasks with many tools and steps typically plateau around 65-75 due to:

Tool output variability affecting reasoning paths
Multiple valid approaches leading to different scoring
The stochastic nature of multi-step agent execution

The optimizer focuses on relative improvement and pattern elimination rather than achieving a specific absolute score.

SkillGenerator

generator = SkillGenerator()
skill_path = generator.generate(
    result=loop_result,           # From OptimizationLoop
    skill_name="my-skill",        # Lowercase with hyphens
    output_dir="./generated_skills",
    title="Human Readable Title"
)

CLI Usage

# Capture a reasoning trace
rto capture "Explain interleaved thinking" -s "You are an AI researcher."

# Analyze a task and output results
rto analyze "Debug this code snippet" -o analysis.txt

# Run full optimization loop
rto optimize "Research AI papers" --max-iterations 5 --generate-skill

# Generate skill from previous optimization
rto generate-skill my-skill-name --artifacts-dir ./optimization_artifacts

Real-World Sources Used

Example 3 uses real documentation URLs for realistic simulation:

Source	URL
Anthropic Docs	`docs.anthropic.com/en/docs/build-with-claude/*`
Anthropic Research	`anthropic.com/research/building-effective-agents`
OpenAI Docs	`platform.openai.com/docs/guides/*`
MiniMax M2.1	`minimax.io/platform/docs/M2.1`
DAIR.AI	`promptingguide.ai/techniques`
LangChain	`python.langchain.com/docs/how_to/debugging`
arXiv Papers	`arxiv.org/abs/2307.03172` (Lost in the Middle)

Robustness Features

The optimizer includes several safeguards to handle real-world variability:

Parsing Resilience

LLM responses don't always produce valid JSON. The system handles this gracefully:

Component	Fallback Behavior
Analyzer	Extracts scores via regex patterns when JSON fails; defaults to 50/100 (not 0)
Optimizer	Multi-strategy prompt extraction: JSON → regex → marker detection → code blocks
Loop	Warns when final prompt is unchanged; tracks best-performing iteration

Extended Test Results (10 iterations)

Real-world testing revealed important insights:

Iteration  Score   Patterns  Tool Calls  Notes
────────────────────────────────────────────────
1          69/100    4         22        Baseline
2          66/100    3         14        -
3          61/100    3         17        -
4          72/100    3         20        ← Best score
5          59/100    4         16        -
6          50/100*   0         15        *Parser fallback activated
7          70/100    3         12        Recovery
8          64/100    3         14        -
9          64/100    3         18        -
10         70/100    3         19        Final

* Iteration 6: JSON parsing failed, fallback returned neutral score

Key Learnings:

Scores fluctuate ±15 points between iterations due to stochastic model behavior
Best score (72) was achieved mid-run, not at the end
use_best_prompt=True correctly selected iteration 4's prompt
Parsing failures now handled gracefully instead of returning 0 scores

Architecture

reasoning_trace_optimizer/
├── __init__.py          # Public API exports
├── models.py            # Data models (Pydantic)
│   ├── ThinkingBlock    # Single reasoning segment
│   ├── ToolCall         # Tool invocation record
│   ├── ReasoningTrace   # Complete execution trace
│   ├── Pattern          # Detected failure pattern
│   ├── AnalysisResult   # Full analysis output
│   └── LoopResult       # Optimization loop result
├── capture.py           # TraceCapture - M2.1 API wrapper
├── analyzer.py          # TraceAnalyzer - Pattern detection (with fallback parsing)
├── optimizer.py         # PromptOptimizer - Prompt improvement (with fallback extraction)
├── loop.py              # OptimizationLoop - Full cycle (with best-score tracking)
├── skill_generator.py   # SkillGenerator - Create skills
└── cli.py               # Command-line interface

Integration

Claude Code Skill

This project includes a Claude Code skill (SKILL.md) enabling:

Auto-trigger on failure - Analyze when agent tasks fail
On-demand analysis - Use /reasoning-trace-optimizer command
Session analysis - Analyze thinking from current conversation

Python Library

from reasoning_trace_optimizer import (
    TraceCapture,
    TraceAnalyzer,
    PromptOptimizer,
    OptimizationLoop,
    LoopConfig,
    SkillGenerator,
)

Contributing

This project is part of the Agent Skills for Context Engineering collection.

License

MIT License

References

Built in partnership with MiniMax AI Showcasing the power of interleaved thinking for agent debugging

Preparing the source view

Agent Skills for Context Engineering

examples/interleaved-thinking/README.md