Loading source
Pulling the file list, source metadata, and syntax-aware rendering for this listing.
Source from repo
A comprehensive collection of Agent Skills for context engineering, multi-agent architectures, and production agent systems.
Files
Skill
Size
Entrypoint
Format
Open file
Syntax-highlighted preview of this file as included in the skill package.
researcher/claims/index.jsonl
1{"claim_id":"claim-evaluation-browsecomp-variance","claim_text":"BrowseComp-style browsing performance is dominated by token usage, with tool calls and model choice as secondary drivers.","owning_skill":"evaluation","section":"Core Concepts / Performance Drivers","source_url":"docs/blogs.md","retrieved_at":"2026-05-15","evidence_strength":"secondary","volatility":"high","last_reviewed":"2026-05-15"}2{"claim_id":"claim-multi-agent-token-multiplier","claim_text":"Multi-agent systems can cost substantially more tokens than single-agent chat and should be justified by context isolation or parallel exploration.","owning_skill":"multi-agent-patterns","section":"Core Concepts / Token Economics","source_url":"docs/blogs.md","retrieved_at":"2026-05-15","evidence_strength":"secondary","volatility":"high","last_reviewed":"2026-05-15"}3{"claim_id":"claim-context-optimization-tool-output-dominance","claim_text":"Tool outputs frequently dominate agent trajectory tokens, so observation masking often yields the largest context-capacity gain.","owning_skill":"context-optimization","section":"Core Concepts / Observation masking","source_url":"docs/claude_research.md","retrieved_at":"2026-05-15","evidence_strength":"secondary","volatility":"medium","last_reviewed":"2026-05-15"}4{"claim_id":"claim-memory-locomo-filesystem-baseline","claim_text":"Filesystem-style memory baselines can outperform more specialized memory tooling on some long-conversation memory benchmarks.","owning_skill":"memory-systems","section":"Core Concepts","source_url":"skills/memory-systems/references/implementation.md","retrieved_at":"2026-05-15","evidence_strength":"secondary","volatility":"high","last_reviewed":"2026-05-15"}5{"claim_id":"claim-advanced-evaluation-position-swap","claim_text":"Pairwise LLM evaluation should mitigate position bias by judging both response orders and treating disagreement as lower confidence.","owning_skill":"advanced-evaluation","section":"Pairwise Comparison Implementation","source_url":"examples/llm-as-judge-skills/src/tools/evaluation/pairwise-compare.ts","retrieved_at":"2026-05-15","evidence_strength":"derived","volatility":"medium","last_reviewed":"2026-05-15"}6{"claim_id":"claim-harness-locked-evaluator","claim_text":"Autonomous loops need locked evaluators and narrow editable surfaces to prevent agents from approving their own weakened metrics.","owning_skill":"harness-engineering","section":"Core Concepts / Harness Boundary","source_url":"https://github.com/karpathy/autoresearch/blob/master/program.md","retrieved_at":"2026-05-15","evidence_strength":"primary","volatility":"low","last_reviewed":"2026-05-15"}7{"claim_id":"claim-context-compression-factory-benchmark","claim_text":"Structured, anchored compression preserves agent task continuity better than generic compression in a production-session probe evaluation, while artifact tracking remains weak across methods.","owning_skill":"context-compression","section":"Core Concepts / Artifact Trail","source_url":"docs/compression.md","retrieved_at":"2026-05-15","evidence_strength":"secondary","volatility":"medium","last_reviewed":"2026-05-15"}8{"claim_id":"claim-context-degradation-lost-middle-ruler","claim_text":"Long-context systems show middle-position recall degradation and advertised context length does not guarantee task performance at that length.","owning_skill":"context-degradation","section":"Core Concepts / Lost-in-Middle","source_url":"docs/claude_research.md","retrieved_at":"2026-05-15","evidence_strength":"secondary","volatility":"medium","last_reviewed":"2026-05-15"}9{"claim_id":"claim-context-degradation-distractor-shuffled","claim_text":"Distractors and context ordering can materially affect retrieval behavior; some shuffled haystack setups outperform coherent ordering for specific retrieval tasks.","owning_skill":"context-degradation","section":"Detailed Topics / Counterintuitive Findings","source_url":"docs/claude_research.md","retrieved_at":"2026-05-15","evidence_strength":"secondary","volatility":"medium","last_reviewed":"2026-05-15"}10{"claim_id":"claim-tool-design-vercel-d0-reduction","claim_text":"Vercel's d0 case study reports better measured outcomes after reducing an agent from many specialized tools to a small primitive tool set.","owning_skill":"tool-design","section":"Core Concepts / Consolidation Principle","source_url":"docs/vercel_tool.md","retrieved_at":"2026-05-15","evidence_strength":"secondary","volatility":"medium","last_reviewed":"2026-05-15"}11{"claim_id":"claim-project-development-vercel-d0-reduction","claim_text":"Vercel's d0 case study shows architectural reduction can improve agent success, latency, token usage, and step count when the underlying data layer is well documented.","owning_skill":"project-development","section":"Detailed Topics / Architectural Reduction","source_url":"docs/vercel_tool.md","retrieved_at":"2026-05-15","evidence_strength":"secondary","volatility":"medium","last_reviewed":"2026-05-15"}12{"claim_id":"claim-latent-briefing-public-results","claim_text":"Public Latent Briefing results report substantial worker-token reduction, material total-token savings, and low-single-digit-second compaction overhead on long-document QA workloads.","owning_skill":"latent-briefing","section":"Core Concepts / Reference result shape","source_url":"skills/latent-briefing/references/attention-matching-formulation.md","retrieved_at":"2026-05-15","evidence_strength":"secondary","volatility":"high","last_reviewed":"2026-05-15"}13