Source from repo
Agent Skills for Context Engineering

A comprehensive collection of Agent Skills for context engineering, multi-agent architectures, and production agent systems.
muratcankoylanGitHub muratcankoylanSource repo Original GitHub link
Files
241
Skill
n/a
Size
2.6 MB
Entrypoint
SKILL.md
Format
git-repo
Open file
examples/interleaved-thinking/README.md

Syntax-highlighted preview of this file as included in the skill package.
Rendered Source
markdown621 linesFree
examples/interleaved-thinking/README.md
1# Reasoning Trace Optimizer
2 
3<p align="center">
4  <strong>Debug and optimize AI agents by analyzing reasoning traces with MiniMax M2.1's interleaved thinking</strong>
5</p>
6 
7<p align="center">
8  <a href="#key-features">Features</a> |
9  <a href="#quick-start">Quick Start</a> |
10  <a href="#how-it-works">How It Works</a> |
11  <a href="#examples">Examples</a> |
12  <a href="#api-reference">API Reference</a>
13</p>
14 
15---
16 
17## The Problem
18 
19Traditional AI agents fail in opaque ways. You see the final output, but not **why** decisions were made. When an agent:
20- Calls the wrong tool
21- Loses track of the goal
22- Makes up information
23 
24...you're left guessing where things went wrong.
25 
26## The Solution
27 
28**Reasoning Trace Optimizer** uses MiniMax M2.1's unique **interleaved thinking** capability to expose the agent's reasoning process between every tool call. This enables:
29 
301. **Deep Debugging** - See exactly where reasoning diverged from expected behavior
312. **Pattern Detection** - Automatically identify failure modes (context degradation, tool confusion, etc.)
323. **Automated Optimization** - Generate improved prompts based on detected issues
334. **Shareable Skills** - Convert learnings into reusable Agent Skills for team sharing
34 
35## Why MiniMax M2.1?
36 
37M2.1's **interleaved thinking** is fundamentally different from traditional reasoning models:
38 
39```
40Traditional:  Think → Act → Act → Act → Done
41              ↑
42              (reasoning only at start)
43 
44M2.1:         Think → Act → Think → Act → Think → Act → Done
45              ↑            ↑              ↑
46              (continuous reasoning between each tool call)
47```
48 
49This matters for agents because:
50- **Long tasks** require maintaining focus across many turns
51- **Tool outputs** introduce unexpected information requiring adaptation
52- **Debugging** needs visibility into decision-making, not just outputs
53 
54The `thinking` block (Anthropic SDK) or `reasoning_details` field (OpenAI SDK) exposes this reasoning for analysis.
55 
56---
57 
58## Key Features
59 
60| Component | Description |
61|-----------|-------------|
62| **TraceCapture** | Wrap M2.1 API to capture all thinking blocks with full context |
63| **TraceAnalyzer** | Detect patterns like context degradation, tool confusion, instruction drift |
64| **PromptOptimizer** | Generate improved prompts based on analysis using M2.1 |
65| **OptimizationLoop** | Automated capture → analyze → improve → re-run cycle |
66| **SkillGenerator** | Convert learnings into shareable Agent Skills |
67 
68### Pattern Detection
69 
70The analyzer automatically identifies these failure patterns:
71 
72| Pattern | Description | Severity |
73|---------|-------------|----------|
74| `context_degradation` | Model loses information over long contexts | High |
75| `tool_confusion` | Model misunderstands tool capabilities | High |
76| `instruction_drift` | Model deviates from original instructions | Medium |
77| `hallucination` | Model generates unsupported information | Critical |
78| `goal_abandonment` | Model stops pursuing the original goal | High |
79| `circular_reasoning` | Model repeats similar actions without progress | Medium |
80| `premature_conclusion` | Model concludes before completing task | Medium |
81| `missing_validation` | Model doesn't verify results | High |
82 
83Each detected pattern includes:
84- **Evidence** - Specific excerpts from thinking blocks
85- **Severity** - Critical/High/Medium/Low
86- **Suggestion** - Concrete improvement for the prompt
87- **Confidence** - How certain the detection is
88 
89---
90 
91## Quick Start
92 
93### Installation
94 
95```bash
96cd examples/interleaved-thinking
97pip install -e .
98```
99 
100### Configuration
101 
102Set your MiniMax API key:
103 
104```bash
105export ANTHROPIC_API_KEY=your_minimax_api_key
106export ANTHROPIC_BASE_URL=https://api.minimax.io/anthropic
107```
108 
109Or create a `.env` file:
110 
111```env
112ANTHROPIC_API_KEY=your_minimax_api_key
113ANTHROPIC_BASE_URL=https://api.minimax.io/anthropic
114```
115 
116### Basic Usage
117 
118```python
119from reasoning_trace_optimizer import TraceCapture, TraceAnalyzer
120 
121# Capture reasoning trace
122capture = TraceCapture()
123trace = capture.run(
124    task="Explain quantum computing",
125    system_prompt="You are a science educator."
126)
127 
128print(f"Captured {len(trace.thinking_blocks)} thinking blocks")
129 
130# Analyze the reasoning
131analyzer = TraceAnalyzer()
132analysis = analyzer.analyze(trace)
133 
134print(f"Overall Score: {analysis.overall_score}/100")
135for pattern in analysis.patterns:
136    print(f"  [{pattern.severity.value}] {pattern.type.value}")
137    print(f"    Suggestion: {pattern.suggestion}")
138```
139 
140---
141 
142## How It Works
143 
144### The Optimization Loop
145 
146```
147┌─────────────────────────────────────────────────────────────────────────┐
148│                       OPTIMIZATION LOOP                                 │
149│                                                                         │
150│   ┌──────────┐    ┌──────────┐    ┌──────────┐    ┌──────────┐          │
151│   │  Agent   │───▶│ Capture  │───▶│ Analyze  │───▶│ Optimize │          │
152│   │ Execute  │    │ Traces   │    │ Patterns │    │  Prompt  │          │
153│   └──────────┘    └──────────┘    └──────────┘    └──────────┘          │
154│        ▲                                               │                │
155│        └───────────────────────────────────────────────┘                │
156│                       (loop until converged or max iterations)          │
157│                                                                         │
158│   Convergence: Score improvement < threshold OR score > target          │
159└─────────────────────────────────────────────────────────────────────────┘
160```
161 
162### What Gets Captured
163 
164For each agent execution, we capture:
165 
1661. **Thinking Blocks** - M2.1's reasoning before each action
1672. **Tool Calls** - What tools were called with what inputs
1683. **Tool Results** - What each tool returned
1694. **Final Response** - The agent's output
1705. **Metadata** - Tokens used, turns taken, success/failure
171 
172### What Gets Analyzed
173 
174The analyzer examines thinking blocks to understand:
175 
176- **Current Understanding** - What does the agent believe about the task?
177- **Tool Interpretation** - How did it interpret each tool result?
178- **Alternatives Considered** - What options did it evaluate?
179- **Goal Awareness** - Is it still pursuing the original objective?
180 
181---
182 
183## Examples
184 
185### Example 1: Basic Trace Capture
186 
187```python
188# examples/01_basic_capture.py
189from reasoning_trace_optimizer import TraceCapture
190 
191capture = TraceCapture()
192trace = capture.run(
193    task="Explain what interleaved thinking is and why it matters for AI agents.",
194    system_prompt="You are an AI researcher explaining concepts clearly."
195)
196 
197# Output:
198# Captured 1 thinking block
199# Turn 0: "The user is asking me to explain 'interleaved thinking'..."
200```
201 
202### Example 2: Tool Usage with Analysis
203 
204```python
205# examples/02_tool_usage.py
206from reasoning_trace_optimizer import TraceCapture, TraceAnalyzer
207 
208# Define tools
209tools = [
210    {
211        "name": "get_weather",
212        "description": "Get current weather for a city",
213        "input_schema": {...}
214    }
215]
216 
217capture = TraceCapture()
218trace = capture.run(
219    task="Compare the weather in San Francisco and New York",
220    tools=tools,
221    tool_executor=execute_tool
222)
223 
224# Analyze
225analyzer = TraceAnalyzer()
226analysis = analyzer.analyze(trace)
227 
228# Output:
229# Score: 85/100
230# Thinking Blocks: 3
231# Tool Calls: 4 (get_weather x2, get_forecast x2)
232# Patterns: None detected
233```
234 
235### Example 3: Full Optimization Loop
236 
237This example demonstrates a complex research task with 7 tools (web search, file operations, note-taking):
238 
239```python
240# examples/03_full_optimization.py
241from reasoning_trace_optimizer import OptimizationLoop, LoopConfig, SkillGenerator
242 
243config = LoopConfig(
244    max_iterations=3,
245    min_score_threshold=85.0,
246    convergence_threshold=5.0,
247    save_artifacts=True,
248)
249 
250loop = OptimizationLoop(config=config)
251result = loop.run(
252    task="""Research "context engineering for AI agents" and create a summary...""",
253    initial_prompt="You are a research assistant.",
254    tools=TOOLS,
255    tool_executor=execute_tool,
256)
257 
258# Generate shareable skill
259generator = SkillGenerator()
260skill_path = generator.generate(result, skill_name="research-agent")
261```
262 
263**Actual Output from Example 3:**
264 
265```
266======================================================================
267OPTIMIZATION RESULTS
268======================================================================
269 
270Total Iterations: 3
271Converged: Yes
272 
273ITERATION 1 (Score: 69/100)
274├── Task Completed: Yes
275├── Thinking Blocks: 6
276├── Tool Calls: 16
277├── Patterns Found: 2
278│   ├── [LOW] missing_validation
279│   └── [LOW] incomplete_reasoning
280├── Strengths: Excellent goal adherence, thorough source diversity
281└── Warning: Prompt grew too large (2979 chars), limiting growth
282 
283ITERATION 2 (Score: 60/100)  ← Regression detected!
284├── Task Completed: Yes
285├── Thinking Blocks: 8
286├── Tool Calls: 16
287├── Patterns Found: 3
288│   ├── [MEDIUM] incomplete_reasoning
289│   ├── [MEDIUM] missing_validation
290│   └── [LOW] tool_misuse
291 
292ITERATION 3 (Score: 66/100)
293├── Task Completed: Yes
294├── Thinking Blocks: 8
295├── Tool Calls: 16
296└── Patterns Found: 3
297 
298→ Using best prompt from iteration 1 (score: 67.6)
299 
300TOOL USAGE ACROSS ALL ITERATIONS:
301├── read_url: 20 calls
302├── web_search: 12 calls
303├── list_directory: 7 calls
304├── save_note: 6 calls
305└── write_file: 3 calls
306 
307NOTES SAVED: 6 research notes with tagged findings
308FILES WRITTEN: ./output/research_summary.md (11,357 chars)
309 
310GENERATED SKILL: ./generated_skills/comprehensive-research-agent/SKILL.md
311```
312 
313**Key Features Demonstrated:**
314 
3151. **Prompt Growth Limiting** - Prevents prompt bloat by limiting expansion to 3x original size
3162. **Best Score Tracking** - Automatically uses the best-performing prompt, even if later iterations regress
3173. **Regression Detection** - Warns when scores drop and can stop after consecutive regressions
318 
319---
320 
321## Generated Artifacts
322 
323### Optimization Artifacts
324 
325Each optimization run creates artifacts for inspection:
326 
327```
328optimization_artifacts/
329├── summary.json              # Overall results
330├── final_prompt.txt          # The optimized prompt
331├── iteration_1/
332│   ├── trace.json            # Full reasoning trace
333│   ├── analysis.json         # Pattern detection results
334│   └── optimization.json     # Prompt changes made
335├── iteration_2/
336│   └── ...
337└── iteration_3/
338    └── ...
339```
340 
341### Generated Skills
342 
343The SkillGenerator converts optimization learnings into shareable Agent Skills:
344 
345```
346generated_skills/
347└── comprehensive-research-agent/
348    ├── SKILL.md              # The shareable skill
349    └── references/
350        ├── optimization_summary.json
351        ├── optimized_prompt.txt
352        └── patterns_found.json
353```
354 
355**Example Generated Skill Content:**
356 
357```markdown
358## Patterns to Avoid
359 
360- **Missing Validation**: Accepting tool responses at face value without
361  verifying the actual state change occurred.
362- **Hallucinating Sources**: Citing sources that failed to load.
363- **Ignoring Contradictions**: Proceeding when tool results conflict.
364 
365## Recommended Practices
366 
367- After every tool call, state the outcome explicitly
368- Track sources separately: 'attempted' vs 'successful'
369- Implement error recovery with alternative approaches
370- Cross-reference key claims against multiple sources
371```
372 
373---
374 
375## API Reference
376 
377### TraceCapture
378 
379```python
380capture = TraceCapture(
381    api_key="...",                              # MiniMax API key
382    base_url="https://api.minimax.io/anthropic", # API endpoint
383    model="MiniMax-M2.1"                        # Model to use
384)
385 
386trace = capture.run(
387    task="...",                    # The task to execute
388    system_prompt="...",           # System prompt
389    tools=[...],                   # Tool definitions (Anthropic format)
390    tool_executor=fn,              # Function to execute tools
391    max_turns=10,                  # Maximum conversation turns
392    max_tokens=4096                # Max tokens per response
393)
394```
395 
396### TraceAnalyzer
397 
398```python
399analyzer = TraceAnalyzer(
400    api_key="...",
401    base_url="https://api.minimax.io/anthropic",
402    model="MiniMax-M2.1"
403)
404 
405analysis = analyzer.analyze(trace)
406# Returns: AnalysisResult with patterns, scores, recommendations
407 
408quick_score = analyzer.quick_score(trace)
409# Returns: float (0-100) for fast feedback
410```
411 
412### OptimizationLoop
413 
414```python
415config = LoopConfig(
416    # Iteration control
417    max_iterations=5,           # Maximum optimization iterations
418    convergence_threshold=3.0,  # Stop if improvement < this %
419    min_score_threshold=75.0,   # Stop if score exceeds this
420    regression_threshold=8.0,   # Warn if score drops by this much
421 
422    # Optimization behavior
423    use_best_prompt=True,       # Use best-performing prompt, not final
424    max_prompt_growth=5.0,      # Limit prompt expansion to 5x original
425 
426    # Output options
427    save_artifacts=True,        # Save traces and analyses
428    artifacts_dir="./artifacts" # Where to save
429)
430 
431loop = OptimizationLoop(config=config)
432result = loop.run(task, initial_prompt, tools, tool_executor)
433# Returns: LoopResult with iterations, final_prompt, scores
434```
435 
436**Optimization Safeguards:**
437 
438- **Best Prompt Tracking**: Keeps the prompt that produced the highest score
439- **Prompt Growth Limiting**: Prevents prompt bloat by limiting size expansion
440- **Regression Detection**: Warns on score drops, stops after consecutive regressions
441 
442**Score Expectations:**
443 
444| Task Complexity | Typical Score Range | Notes |
445|-----------------|---------------------|-------|
446| Simple (1-2 tools) | 80-95 | Straightforward tasks converge quickly |
447| Medium (3-5 tools) | 70-85 | Multiple tool coordination adds variability |
448| Complex (6+ tools, multi-step) | 60-75 | Inherent variance in long reasoning chains |
449 
450Complex research tasks with many tools and steps typically plateau around **65-75** due to:
451- Tool output variability affecting reasoning paths
452- Multiple valid approaches leading to different scoring
453- The stochastic nature of multi-step agent execution
454 
455The optimizer focuses on **relative improvement** and **pattern elimination** rather than achieving a specific absolute score.
456 
457### SkillGenerator
458 
459```python
460generator = SkillGenerator()
461skill_path = generator.generate(
462    result=loop_result,           # From OptimizationLoop
463    skill_name="my-skill",        # Lowercase with hyphens
464    output_dir="./generated_skills",
465    title="Human Readable Title"
466)
467```
468 
469---
470 
471## CLI Usage
472 
473```bash
474# Capture a reasoning trace
475rto capture "Explain interleaved thinking" -s "You are an AI researcher."
476 
477# Analyze a task and output results
478rto analyze "Debug this code snippet" -o analysis.txt
479 
480# Run full optimization loop
481rto optimize "Research AI papers" --max-iterations 5 --generate-skill
482 
483# Generate skill from previous optimization
484rto generate-skill my-skill-name --artifacts-dir ./optimization_artifacts
485```
486 
487---
488 
489## Real-World Sources Used
490 
491Example 3 uses real documentation URLs for realistic simulation:
492 
493| Source | URL |
494|--------|-----|
495| Anthropic Docs | `docs.anthropic.com/en/docs/build-with-claude/*` |
496| Anthropic Research | `anthropic.com/research/building-effective-agents` |
497| OpenAI Docs | `platform.openai.com/docs/guides/*` |
498| MiniMax M2.1 | `minimax.io/platform/docs/M2.1` |
499| DAIR.AI | `promptingguide.ai/techniques` |
500| LangChain | `python.langchain.com/docs/how_to/debugging` |
501| arXiv Papers | `arxiv.org/abs/2307.03172` (Lost in the Middle) |
502 
503---
504 
505## Robustness Features
506 
507The optimizer includes several safeguards to handle real-world variability:
508 
509### Parsing Resilience
510 
511LLM responses don't always produce valid JSON. The system handles this gracefully:
512 
513| Component | Fallback Behavior |
514|-----------|-------------------|
515| **Analyzer** | Extracts scores via regex patterns when JSON fails; defaults to 50/100 (not 0) |
516| **Optimizer** | Multi-strategy prompt extraction: JSON → regex → marker detection → code blocks |
517| **Loop** | Warns when final prompt is unchanged; tracks best-performing iteration |
518 
519### Extended Test Results (10 iterations)
520 
521Real-world testing revealed important insights:
522 
523```
524Iteration  Score   Patterns  Tool Calls  Notes
525────────────────────────────────────────────────
5261          69/100    4         22        Baseline
5272          66/100    3         14        -
5283          61/100    3         17        -
5294          72/100    3         20        ← Best score
5305          59/100    4         16        -
5316          50/100*   0         15        *Parser fallback activated
5327          70/100    3         12        Recovery
5338          64/100    3         14        -
5349          64/100    3         18        -
53510         70/100    3         19        Final
536 
537* Iteration 6: JSON parsing failed, fallback returned neutral score
538```
539 
540**Key Learnings:**
541- Scores fluctuate ±15 points between iterations due to stochastic model behavior
542- Best score (72) was achieved mid-run, not at the end
543- `use_best_prompt=True` correctly selected iteration 4's prompt
544- Parsing failures now handled gracefully instead of returning 0 scores
545 
546---
547 
548## Architecture
549 
550```
551reasoning_trace_optimizer/
552├── __init__.py          # Public API exports
553├── models.py            # Data models (Pydantic)
554│   ├── ThinkingBlock    # Single reasoning segment
555│   ├── ToolCall         # Tool invocation record
556│   ├── ReasoningTrace   # Complete execution trace
557│   ├── Pattern          # Detected failure pattern
558│   ├── AnalysisResult   # Full analysis output
559│   └── LoopResult       # Optimization loop result
560├── capture.py           # TraceCapture - M2.1 API wrapper
561├── analyzer.py          # TraceAnalyzer - Pattern detection (with fallback parsing)
562├── optimizer.py         # PromptOptimizer - Prompt improvement (with fallback extraction)
563├── loop.py              # OptimizationLoop - Full cycle (with best-score tracking)
564├── skill_generator.py   # SkillGenerator - Create skills
565└── cli.py               # Command-line interface
566```
567 
568---
569 
570## Integration
571 
572### Claude Code Skill
573 
574This project includes a Claude Code skill (`SKILL.md`) enabling:
575 
576- **Auto-trigger on failure** - Analyze when agent tasks fail
577- **On-demand analysis** - Use `/reasoning-trace-optimizer` command
578- **Session analysis** - Analyze thinking from current conversation
579 
580### Python Library
581 
582```python
583from reasoning_trace_optimizer import (
584    TraceCapture,
585    TraceAnalyzer,
586    PromptOptimizer,
587    OptimizationLoop,
588    LoopConfig,
589    SkillGenerator,
590)
591```
592 
593---
594 
595## Contributing
596 
597This project is part of the [Agent Skills for Context Engineering](https://github.com/muratcankoylan/Agent-Skills-for-Context-Engineering) collection.
598 
599---
600 
601## License
602 
603MIT License
604 
605---
606 
607## References
608 
609- [MiniMax M2.1 Documentation](https://www.minimax.io/platform/docs)
610- [MiniMax API Reference](https://www.minimax.io/platform/docs/M2.1)
611- [Interleaved Thinking Guide](./docs/interleavedthinking.md)
612- [Agent Generalization Research](./docs/agentthinking.md)
613- [Anthropic API Compatibility](./docs/m2-1.md)
614 
615---
616 
617<p align="center">
618  <strong>Built in partnership with MiniMax AI</strong><br>
619  Showcasing the power of interleaved thinking for agent debugging
620</p>
621
Preparing the source view

Agent Skills for Context Engineering

examples/interleaved-thinking/README.md