Source from repo
Agent Skills for Context Engineering

A comprehensive collection of Agent Skills for context engineering, multi-agent architectures, and production agent systems.
muratcankoylanGitHub muratcankoylanSource repo Original GitHub link
Files
339
Skill
n/a
Size
4.3 MB
Entrypoint
SKILL.md
Format
git-repo
Open file
examples/interleaved-thinking/reasoning_trace_optimizer/analyzer.py

Syntax-highlighted preview of this file as included in the skill package.
Rendered Source
code466 linesFree
examples/interleaved-thinking/reasoning_trace_optimizer/analyzer.py
1"""
2TraceAnalyzer: Analyzes reasoning traces to detect patterns and issues.
3 
4Uses M2.1's own interleaved thinking to analyze agent reasoning traces,
5detecting patterns like context degradation, tool confusion, and instruction drift.
6"""
7 
8import json
9import os
10from typing import Any
11 
12import anthropic
13 
14from reasoning_trace_optimizer.models import (
15    AnalysisResult,
16    Pattern,
17    PatternType,
18    ReasoningTrace,
19    Severity,
20)
21 
22 
23ANALYSIS_SYSTEM_PROMPT = """You are an expert AI agent debugger specializing in analyzing reasoning traces.
24 
25Your task is to analyze an agent's interleaved thinking trace and identify:
261. **Patterns of failure** - detect specific failure modes with evidence
272. **Quality scores** - rate the agent's reasoning on multiple dimensions
283. **Actionable recommendations** - specific improvements for prompts/instructions
29 
30## Pattern Definitions
31 
32Detect these patterns with specific evidence from thinking blocks:
33 
34- **context_degradation**: Agent loses or forgets information from earlier in the conversation
35  - Look for: Repeated questions, contradicting earlier statements, missing key details
36- **tool_confusion**: Agent misunderstands what a tool does or how to use it
37  - Look for: Wrong tool selection, incorrect parameters, misinterpreting results
38- **instruction_drift**: Agent gradually deviates from original instructions/persona
39  - Look for: Changing behavior, ignoring constraints, different tone over time
40- **hallucination**: Agent generates information not supported by context or tools
41  - Look for: Made-up facts, fabricated tool results, unsourced claims
42- **incomplete_reasoning**: Agent reaches conclusions without thorough analysis
43  - Look for: Skipped steps, missing validation, superficial exploration
44- **tool_misuse**: Agent uses tools incorrectly or inefficiently
45  - Look for: Redundant calls, wrong parameters, unused results
46- **goal_abandonment**: Agent stops pursuing the original objective
47  - Look for: Topic drift, giving up, switching goals without reason
48- **circular_reasoning**: Agent repeats similar actions without progress
49  - Look for: Same queries repeated, looping behavior, no new information
50- **premature_conclusion**: Agent concludes before completing the task
51  - Look for: Early stops, incomplete answers, skipped requirements
52- **missing_validation**: Agent doesn't verify results or assumptions
53  - Look for: No cross-checking, accepting first result, no error handling
54 
55## Analysis Focus
56 
57You have access to the FULL reasoning trace including all thinking blocks between tool calls.
58This gives you unique insight into HOW the agent reasons, not just what it outputs.
59 
60For each thinking block, examine:
61- What is the agent's current understanding?
62- How does it interpret tool results?
63- What alternatives does it consider?
64- Does it maintain awareness of the original goal?
65 
66Provide your analysis in the specified JSON format with concrete evidence."""
67 
68 
69ANALYSIS_PROMPT_TEMPLATE = """Analyze the following agent reasoning trace:
70 
71## Task
72{task}
73 
74## System Prompt Given to Agent
75{system_prompt}
76 
77## Reasoning Trace
78{trace}
79 
80## Tool Calls Made
81{tool_calls}
82 
83## Final Outcome
84Success: {success}
85Final Response: {final_response}
86Error (if any): {error}
87 
88---
89 
90Provide your analysis as JSON with this exact structure:
91```json
92{{
93    "patterns": [
94        {{
95            "type": "<one of: context_degradation, tool_confusion, instruction_drift, hallucination, incomplete_reasoning, tool_misuse, goal_abandonment, circular_reasoning, premature_conclusion, missing_validation>",
96            "severity": "<one of: low, medium, high, critical>",
97            "description": "<what the pattern is>",
98            "evidence": ["<excerpt from thinking>", "<another excerpt>"],
99            "turn_indices": [0, 2],
100            "suggestion": "<how to fix this>",
101            "confidence": 0.85
102        }}
103    ],
104    "scores": {{
105        "reasoning_clarity": 75,
106        "goal_adherence": 80,
107        "tool_usage_quality": 60,
108        "error_recovery": 50,
109        "overall": 66
110    }},
111    "strengths": ["<strength 1>", "<strength 2>"],
112    "weaknesses": ["<weakness 1>", "<weakness 2>"],
113    "recommendations": [
114        "<specific actionable recommendation>",
115        "<another recommendation>"
116    ]
117}}
118```
119 
120Think carefully about each aspect before providing your analysis."""
121 
122 
123class TraceAnalyzer:
124    """
125    Analyzes reasoning traces using M2.1 to detect patterns and score quality.
126 
127    The analyzer uses M2.1's interleaved thinking to deeply understand
128    the agent's reasoning process and identify issues that wouldn't be
129    visible from outputs alone.
130 
131    Example:
132        ```python
133        analyzer = TraceAnalyzer()
134        result = analyzer.analyze(trace)
135 
136        print(f"Overall score: {result.overall_score}")
137        for pattern in result.patterns:
138            print(f"Found: {pattern.type.value} ({pattern.severity.value})")
139        ```
140    """
141 
142    def __init__(
143        self,
144        api_key: str | None = None,
145        base_url: str = "https://api.minimax.io/anthropic",
146        model: str = "MiniMax-M2.1",
147    ):
148        """
149        Initialize TraceAnalyzer with M2.1 configuration.
150 
151        Args:
152            api_key: MiniMax API key
153            base_url: API endpoint
154            model: Model for analysis (M2.1 recommended for best results)
155        """
156        self.model = model
157        self.client = anthropic.Anthropic(
158            api_key=api_key or os.environ.get("ANTHROPIC_API_KEY"),
159            base_url=base_url,
160        )
161 
162    def analyze(
163        self,
164        trace: ReasoningTrace,
165        max_tokens: int = 8192,
166    ) -> AnalysisResult:
167        """
168        Analyze a reasoning trace and return detailed analysis.
169 
170        Args:
171            trace: The reasoning trace to analyze
172            max_tokens: Maximum tokens for analysis response
173 
174        Returns:
175            AnalysisResult with patterns, scores, and recommendations
176        """
177        # Format trace for analysis
178        trace_text = self._format_trace_for_analysis(trace)
179        tool_calls_text = self._format_tool_calls(trace)
180 
181        prompt = ANALYSIS_PROMPT_TEMPLATE.format(
182            task=trace.task,
183            system_prompt=trace.system_prompt,
184            trace=trace_text,
185            tool_calls=tool_calls_text,
186            success=trace.success,
187            final_response=trace.final_response or "None",
188            error=trace.error or "None",
189        )
190 
191        # Call M2.1 for analysis
192        response = self.client.messages.create(
193            model=self.model,
194            max_tokens=max_tokens,
195            system=ANALYSIS_SYSTEM_PROMPT,
196            messages=[{"role": "user", "content": prompt}],
197        )
198 
199        # Extract thinking and text from response
200        analyzer_thinking = ""
201        analysis_text = ""
202 
203        for block in response.content:
204            if block.type == "thinking":
205                analyzer_thinking = block.thinking
206            elif block.type == "text":
207                analysis_text = block.text
208 
209        # Parse the JSON response
210        result = self._parse_analysis_response(analysis_text, trace.session_id)
211        result.analyzer_thinking = analyzer_thinking
212        result.analyzer_model = self.model
213 
214        return result
215 
216    def analyze_batch(
217        self,
218        traces: list[ReasoningTrace],
219    ) -> list[AnalysisResult]:
220        """Analyze multiple traces and return results."""
221        return [self.analyze(trace) for trace in traces]
222 
223    def quick_score(
224        self,
225        trace: ReasoningTrace,
226    ) -> float:
227        """
228        Get a quick overall score without full pattern analysis.
229 
230        Useful for optimization loops where you need fast feedback.
231 
232        Args:
233            trace: The reasoning trace to score
234 
235        Returns:
236            Overall score from 0-100
237        """
238        quick_prompt = f"""Rate this agent's performance from 0-100 based on its reasoning trace.
239 
240Task: {trace.task}
241Success: {trace.success}
242Turns: {trace.total_turns}
243 
244Thinking excerpts:
245{self._get_thinking_excerpts(trace, max_chars=2000)}
246 
247Respond with ONLY a number from 0-100."""
248 
249        response = self.client.messages.create(
250            model=self.model,
251            max_tokens=100,
252            messages=[{"role": "user", "content": quick_prompt}],
253        )
254 
255        # Extract score from response
256        for block in response.content:
257            if block.type == "text":
258                try:
259                    score = float(block.text.strip())
260                    return min(100, max(0, score))
261                except ValueError:
262                    pass
263 
264        return 50.0  # Default middle score if parsing fails
265 
266    def _format_trace_for_analysis(self, trace: ReasoningTrace) -> str:
267        """Format thinking blocks for analysis."""
268        parts = []
269        for i, thinking in enumerate(trace.thinking_blocks):
270            parts.append(f"[Turn {thinking.turn_index}] Thinking:")
271            parts.append(thinking.content)
272            parts.append("")
273 
274        return "\n".join(parts)
275 
276    def _format_tool_calls(self, trace: ReasoningTrace) -> str:
277        """Format tool calls for analysis."""
278        if not trace.tool_calls:
279            return "No tool calls made."
280 
281        parts = []
282        for tc in trace.tool_calls:
283            status = "Success" if tc.success else f"Failed: {tc.error}"
284            parts.append(
285                f"- {tc.name}({json.dumps(tc.input)}) -> {status}\n"
286                f"  Result: {tc.result[:200] if tc.result else 'None'}..."
287            )
288 
289        return "\n".join(parts)
290 
291    def _get_thinking_excerpts(self, trace: ReasoningTrace, max_chars: int = 2000) -> str:
292        """Get excerpts from thinking blocks."""
293        excerpts = []
294        remaining = max_chars
295 
296        for thinking in trace.thinking_blocks:
297            if remaining <= 0:
298                break
299            excerpt = thinking.content[:remaining]
300            excerpts.append(f"[Turn {thinking.turn_index}]: {excerpt}")
301            remaining -= len(excerpt) + 20
302 
303        return "\n\n".join(excerpts)
304 
305    def _parse_analysis_response(
306        self,
307        response_text: str,
308        trace_id: str,
309    ) -> AnalysisResult:
310        """Parse the JSON analysis response from M2.1."""
311        result = AnalysisResult(trace_id=trace_id)
312 
313        try:
314            # Extract JSON from response (may have markdown code blocks)
315            json_text = response_text
316            if "```json" in response_text:
317                json_text = response_text.split("```json")[1].split("```")[0]
318            elif "```" in response_text:
319                json_text = response_text.split("```")[1].split("```")[0]
320 
321            data = json.loads(json_text)
322 
323            # Parse patterns
324            for p in data.get("patterns", []):
325                try:
326                    pattern = Pattern(
327                        type=PatternType(p["type"]),
328                        severity=Severity(p["severity"]),
329                        description=p["description"],
330                        evidence=p.get("evidence", []),
331                        turn_indices=p.get("turn_indices", []),
332                        suggestion=p.get("suggestion", ""),
333                        confidence=p.get("confidence", 0.5),
334                    )
335                    result.patterns.append(pattern)
336                except (KeyError, ValueError):
337                    continue
338 
339            # Parse scores
340            scores = data.get("scores", {})
341            result.reasoning_clarity = scores.get("reasoning_clarity", 0)
342            result.goal_adherence = scores.get("goal_adherence", 0)
343            result.tool_usage_quality = scores.get("tool_usage_quality", 0)
344            result.error_recovery = scores.get("error_recovery", 0)
345            result.overall_score = scores.get("overall", 0)
346 
347            # Parse feedback
348            result.strengths = data.get("strengths", [])
349            result.weaknesses = data.get("weaknesses", [])
350            result.recommendations = data.get("recommendations", [])
351 
352        except (json.JSONDecodeError, KeyError) as e:
353            # If parsing fails, try fallback extraction and set reasonable defaults
354            result = self._fallback_parse_analysis(response_text, trace_id, str(e))
355 
356        # Warn if score is suspiciously low (likely parsing failure)
357        if result.overall_score == 0 and not result.patterns:
358            result.weaknesses.append("WARNING: Analysis may have failed - score is 0 with no patterns detected")
359            # Try to extract a score from the response text as fallback
360            fallback_score = self._extract_fallback_score(response_text)
361            if fallback_score > 0:
362                result.overall_score = fallback_score
363                result.recommendations.append(f"Score extracted via fallback: {fallback_score}")
364 
365        return result
366 
367    def _fallback_parse_analysis(
368        self,
369        response_text: str,
370        trace_id: str,
371        error_msg: str,
372    ) -> AnalysisResult:
373        """Fallback parsing when JSON extraction fails."""
374        import re
375 
376        result = AnalysisResult(trace_id=trace_id)
377 
378        # Try to extract score from text patterns like "Overall Score: 75" or "overall": 75
379        score_patterns = [
380            r'overall["\s:]+(\d+)',
381            r'Overall Score[:\s]+(\d+)',
382            r'"overall"[:\s]+(\d+)',
383            r'Score[:\s]+(\d+)/100',
384        ]
385 
386        for pattern in score_patterns:
387            match = re.search(pattern, response_text, re.IGNORECASE)
388            if match:
389                result.overall_score = min(100, max(0, int(match.group(1))))
390                break
391 
392        # If still no score, use a neutral default (not 0)
393        if result.overall_score == 0:
394            result.overall_score = 50  # Neutral default instead of 0
395 
396        result.recommendations = [
397            f"Analysis parsing failed ({error_msg}). Using fallback extraction.",
398            "Consider re-running analysis if results seem inconsistent."
399        ]
400        result.weaknesses = ["JSON parsing failed - analysis may be incomplete"]
401 
402        return result
403 
404    def _extract_fallback_score(self, response_text: str) -> float:
405        """Extract a score from response text when JSON parsing fails."""
406        import re
407 
408        patterns = [
409            r'overall["\s:]+(\d+)',
410            r'Overall Score[:\s]+(\d+)',
411            r'"overall"[:\s]+(\d+)',
412            r'(\d+)/100',
413            r'score[:\s]+(\d+)',
414        ]
415 
416        for pattern in patterns:
417            match = re.search(pattern, response_text, re.IGNORECASE)
418            if match:
419                score = int(match.group(1))
420                if 0 <= score <= 100:
421                    return float(score)
422 
423        return 0.0
424 
425 
426def format_analysis_report(analysis: AnalysisResult) -> str:
427    """Format an analysis result as a human-readable report."""
428    lines = [
429        "=" * 60,
430        "REASONING TRACE ANALYSIS REPORT",
431        "=" * 60,
432        "",
433        f"Overall Score: {analysis.overall_score}/100",
434        "",
435        "Scores:",
436        f"  - Reasoning Clarity: {analysis.reasoning_clarity}/100",
437        f"  - Goal Adherence: {analysis.goal_adherence}/100",
438        f"  - Tool Usage Quality: {analysis.tool_usage_quality}/100",
439        f"  - Error Recovery: {analysis.error_recovery}/100",
440        "",
441    ]
442 
443    if analysis.patterns:
444        lines.append("Detected Patterns:")
445        for p in analysis.patterns:
446            lines.append(f"\n  [{p.severity.value.upper()}] {p.type.value}")
447            lines.append(f"    {p.description}")
448            lines.append(f"    Suggestion: {p.suggestion}")
449 
450    if analysis.strengths:
451        lines.append("\nStrengths:")
452        for s in analysis.strengths:
453            lines.append(f"  + {s}")
454 
455    if analysis.weaknesses:
456        lines.append("\nWeaknesses:")
457        for w in analysis.weaknesses:
458            lines.append(f"  - {w}")
459 
460    if analysis.recommendations:
461        lines.append("\nRecommendations:")
462        for i, r in enumerate(analysis.recommendations, 1):
463            lines.append(f"  {i}. {r}")
464 
465    return "\n".join(lines)
466
Preparing the source view

Agent Skills for Context Engineering

examples/interleaved-thinking/reasoning_trace_optimizer/analyzer.py