Loading source
Pulling the file list, source metadata, and syntax-aware rendering for this listing.
Source from repo
A comprehensive collection of Agent Skills for context engineering, multi-agent architectures, and production agent systems.
Files
Skill
Size
Entrypoint
Format
Open file
Syntax-highlighted preview of this file as included in the skill package.
researcher/benchmarks/router/prompts.jsonl
1{"prompt_id":"p001","prompt":"Explain why context windows degrade as they fill, and how attention mechanics make middle-of-context information less recoverable.","expected_primary_skill":"context-fundamentals","acceptable_secondary_skills":["context-degradation"],"rejected_skills":["bdi-mental-states"],"reason":"Foundational explanation of context behavior is core to context-fundamentals."}2{"prompt_id":"p002","prompt":"My agent is silently failing on long tasks. I see it ignoring information from the middle of the conversation. Help me diagnose what is going wrong.","expected_primary_skill":"context-degradation","acceptable_secondary_skills":["context-fundamentals","context-optimization"],"rejected_skills":["memory-systems"],"reason":"Diagnosing lost-in-middle is the defining use case for context-degradation."}3{"prompt_id":"p003","prompt":"My chat sessions are too long. Compress the conversation history so the agent can continue without losing decisions, files touched, or risks raised.","expected_primary_skill":"context-compression","acceptable_secondary_skills":["context-optimization"],"rejected_skills":["memory-systems"],"reason":"Compaction with preservation of structured state is context-compression."}4{"prompt_id":"p004","prompt":"Reduce my agent's per-trajectory token cost. I want to mask verbose tool outputs and partition work across sub-agents.","expected_primary_skill":"context-optimization","acceptable_secondary_skills":["multi-agent-patterns","filesystem-context"],"rejected_skills":["bdi-mental-states"],"reason":"Masking and partitioning for token efficiency is context-optimization."}5{"prompt_id":"p005","prompt":"I have an orchestrator that calls workers many times. I want to share its growing trajectory with workers without replaying full text. Is there a KV-cache approach?","expected_primary_skill":"latent-briefing","acceptable_secondary_skills":["context-optimization","multi-agent-patterns"],"rejected_skills":["memory-systems"],"reason":"Worker KV cache memory sharing is the latent-briefing pattern."}6{"prompt_id":"p006","prompt":"Pick a coordination pattern for an agent system that needs three specialists handed off to in sequence. Supervisor, swarm, or hierarchical?","expected_primary_skill":"multi-agent-patterns","acceptable_secondary_skills":["context-optimization"],"rejected_skills":["bdi-mental-states"],"reason":"Choosing supervisor vs swarm vs hierarchical is multi-agent-patterns."}7{"prompt_id":"p007","prompt":"Persist user preferences across sessions and track entities mentioned over time. Recommend a memory architecture with retrieval and update semantics.","expected_primary_skill":"memory-systems","acceptable_secondary_skills":["filesystem-context"],"rejected_skills":["context-compression"],"reason":"Persistent cross-session memory with entities is memory-systems."}8{"prompt_id":"p008","prompt":"Design a tool interface for a database query agent. Consolidate redundant search tools, write actionable error messages, and use MCP-style naming.","expected_primary_skill":"tool-design","acceptable_secondary_skills":["project-development"],"rejected_skills":["context-compression"],"reason":"Tool contracts and consolidation are tool-design."}9{"prompt_id":"p009","prompt":"Store giant tool outputs in files instead of returning them to context. Build scratchpads and let the agent grep them later.","expected_primary_skill":"filesystem-context","acceptable_secondary_skills":["context-optimization"],"rejected_skills":["memory-systems"],"reason":"File-backed offloading and durable scratchpads are filesystem-context."}10{"prompt_id":"p010","prompt":"Set up background coding agents that run in sandboxed VMs with warm pools and multiplayer support.","expected_primary_skill":"hosted-agents","acceptable_secondary_skills":["multi-agent-patterns"],"rejected_skills":["bdi-mental-states"],"reason":"Hosted, sandboxed, multiplayer agent infrastructure is hosted-agents."}11{"prompt_id":"p011","prompt":"Add deterministic regression checks and quality gates to my agent pipeline so we catch behavior changes before deployment.","expected_primary_skill":"evaluation","acceptable_secondary_skills":["project-development","advanced-evaluation"],"rejected_skills":["harness-engineering"],"reason":"Quality gates and regression suites are evaluation."}12{"prompt_id":"p012","prompt":"Build an LLM-as-judge with position-bias mitigation, calibrated rubrics, and pairwise comparison for model output evaluation.","expected_primary_skill":"advanced-evaluation","acceptable_secondary_skills":["evaluation"],"rejected_skills":["project-development"],"reason":"Judge design and pairwise bias mitigation are advanced-evaluation."}13{"prompt_id":"p013","prompt":"Design an autonomous research loop with locked rubrics, editable drafts, novelty gates, rollback, and human merge approval.","expected_primary_skill":"harness-engineering","acceptable_secondary_skills":["evaluation","project-development"],"rejected_skills":["hosted-agents"],"reason":"Locked/editable surfaces and human approval boundaries are harness-engineering."}14{"prompt_id":"p014","prompt":"Decide if my idea is suited to an LLM batch pipeline. If yes, sketch the acquire/prepare/process/parse/render stages and estimate cost.","expected_primary_skill":"project-development","acceptable_secondary_skills":["evaluation"],"rejected_skills":["harness-engineering"],"reason":"Task-model fit and batch pipeline shape are project-development."}15{"prompt_id":"p015","prompt":"Model an agent's beliefs, desires, and intentions in a way I can audit. Convert a small RDF graph into a structured belief state.","expected_primary_skill":"bdi-mental-states","acceptable_secondary_skills":[],"rejected_skills":["memory-systems"],"reason":"BDI beliefs/desires/intentions and RDF transforms are bdi-mental-states."}16{"prompt_id":"p016","prompt":"I want to ship a custom rubric for grading agent answers. The rubric needs four dimensions with weighted scoring.","expected_primary_skill":"evaluation","acceptable_secondary_skills":["advanced-evaluation"],"rejected_skills":["harness-engineering"],"reason":"Generic rubric construction is evaluation; calibration and bias would push to advanced-evaluation."}17{"prompt_id":"p017","prompt":"My LLM-as-judge gives different scores depending on whether response A or response B appears first. Help me mitigate position bias.","expected_primary_skill":"advanced-evaluation","acceptable_secondary_skills":["evaluation"],"rejected_skills":["multi-agent-patterns"],"reason":"Bias-mitigation for judges is advanced-evaluation."}18{"prompt_id":"p018","prompt":"Summarize a long agent session into a short handoff that preserves files touched, decisions made, risks, and next steps.","expected_primary_skill":"context-compression","acceptable_secondary_skills":["filesystem-context"],"rejected_skills":["context-optimization"],"reason":"Compaction with structured preservation is context-compression."}19{"prompt_id":"p019","prompt":"Improve cache hit rate, mask tool outputs, partition long-running work across subagents to cut tokens per task.","expected_primary_skill":"context-optimization","acceptable_secondary_skills":["context-compression","multi-agent-patterns"],"rejected_skills":["memory-systems"],"reason":"Multi-strategy token efficiency is context-optimization."}20{"prompt_id":"p020","prompt":"My agent forgets across sessions. Recommend a memory store, retrieval strategy, and update semantics for a long-running assistant.","expected_primary_skill":"memory-systems","acceptable_secondary_skills":["filesystem-context"],"rejected_skills":["context-compression"],"reason":"Cross-session memory store and retrieval are memory-systems."}21{"prompt_id":"p021","prompt":"I want to offload terminal output and log files so the agent can search them on demand instead of stuffing them into context.","expected_primary_skill":"filesystem-context","acceptable_secondary_skills":["context-optimization"],"rejected_skills":["memory-systems"],"reason":"File-backed log/terminal offloading is filesystem-context."}22{"prompt_id":"p022","prompt":"Build an autonomous research workflow that surfaces sources, scores them by rubric, drafts proposals, and prepares PRs without auto-merging.","expected_primary_skill":"harness-engineering","acceptable_secondary_skills":["project-development","evaluation"],"rejected_skills":["bdi-mental-states"],"reason":"Autonomous research with locked rubrics and human merge is harness-engineering."}23{"prompt_id":"p023","prompt":"I have a batch of 5000 documents to grade with an LLM. Walk me through pipeline structure, structured output, cost estimation, and iteration.","expected_primary_skill":"project-development","acceptable_secondary_skills":["evaluation","tool-design"],"rejected_skills":["harness-engineering"],"reason":"Batch pipeline structure and cost estimation are project-development."}24{"prompt_id":"p024","prompt":"My orchestrator-worker system explodes tokens on every worker call because it replays the full trajectory. Recommend a KV-based alternative if the runtime supports it.","expected_primary_skill":"latent-briefing","acceptable_secondary_skills":["multi-agent-patterns","context-optimization"],"rejected_skills":["memory-systems"],"reason":"Task-guided KV cache compaction for cross-agent state is latent-briefing."}25{"prompt_id":"p025","prompt":"How should I split work between a supervisor and three specialists when context isolation is more important than parallelism?","expected_primary_skill":"multi-agent-patterns","acceptable_secondary_skills":["context-optimization"],"rejected_skills":["harness-engineering"],"reason":"Supervisor coordination with isolation is multi-agent-patterns."}26{"prompt_id":"p026","prompt":"I am about to start a project that may not need LLMs. Help me evaluate task-model fit before I commit.","expected_primary_skill":"project-development","acceptable_secondary_skills":[],"rejected_skills":["harness-engineering"],"reason":"Task-model fit analysis is project-development."}27{"prompt_id":"p027","prompt":"Help me make my agent's tools cheaper to invoke by giving them clear descriptions, response-format options, and actionable errors.","expected_primary_skill":"tool-design","acceptable_secondary_skills":["context-optimization"],"rejected_skills":["memory-systems"],"reason":"Tool description quality and response formats are tool-design."}28{"prompt_id":"p028","prompt":"Diagnose why my agent gets distracted by irrelevant context in the middle of long sessions.","expected_primary_skill":"context-degradation","acceptable_secondary_skills":["context-optimization"],"rejected_skills":["memory-systems"],"reason":"Distraction by middle-context noise is context-degradation."}29{"prompt_id":"p029","prompt":"Plan the architecture for a multi-agent debate that requires consensus across heterogeneous specialists.","expected_primary_skill":"multi-agent-patterns","acceptable_secondary_skills":["evaluation","advanced-evaluation"],"rejected_skills":["bdi-mental-states"],"reason":"Coordination, consensus, and agent collaboration are multi-agent-patterns."}30{"prompt_id":"p030","prompt":"My agent loses long-horizon goals after context compaction. Help me keep its objectives intact.","expected_primary_skill":"context-compression","acceptable_secondary_skills":["context-degradation","filesystem-context"],"rejected_skills":["memory-systems"],"reason":"Goal preservation across compaction is context-compression."}31{"prompt_id":"p031","prompt":"Migrate an agent from token-stuffed prompts to a leaner system prompt with on-demand context loading from files.","expected_primary_skill":"filesystem-context","acceptable_secondary_skills":["context-optimization"],"rejected_skills":["memory-systems"],"reason":"Just-in-time file-backed context retrieval is filesystem-context."}32{"prompt_id":"p032","prompt":"Pick a memory framework for my assistant. I'm weighing Mem0, Letta, Zep, and Cognee.","expected_primary_skill":"memory-systems","acceptable_secondary_skills":[],"rejected_skills":["filesystem-context"],"reason":"Choosing among memory frameworks is memory-systems."}33{"prompt_id":"p033","prompt":"Design a quality gate that blocks deploys when the agent's pass rate drops below 0.85.","expected_primary_skill":"evaluation","acceptable_secondary_skills":["project-development"],"rejected_skills":["harness-engineering"],"reason":"Pass-rate quality gates for deployment are evaluation."}34{"prompt_id":"p034","prompt":"Choose evaluation metrics for an LLM-as-judge system handling pairwise creative writing comparisons.","expected_primary_skill":"advanced-evaluation","acceptable_secondary_skills":["evaluation"],"rejected_skills":["multi-agent-patterns"],"reason":"Pairwise judge metric design is advanced-evaluation."}35{"prompt_id":"p035","prompt":"Set up a background agent that lives on a sandboxed VM, picks tasks from a queue, and opens PRs.","expected_primary_skill":"hosted-agents","acceptable_secondary_skills":["multi-agent-patterns","harness-engineering"],"rejected_skills":["bdi-mental-states"],"reason":"Sandboxed VM background agents are hosted-agents."}36{"prompt_id":"p036","prompt":"Sketch a self-improving autonomous loop with locked metrics, rollback, novelty gates, and a parked human review queue.","expected_primary_skill":"harness-engineering","acceptable_secondary_skills":["evaluation","project-development"],"rejected_skills":["hosted-agents"],"reason":"Self-improvement governance with parked review is harness-engineering."}37{"prompt_id":"p037","prompt":"Outline why structured output design improves downstream parsing and recommend prompt patterns for it.","expected_primary_skill":"project-development","acceptable_secondary_skills":["tool-design"],"rejected_skills":["context-degradation"],"reason":"Pipeline-time structured output design is project-development."}38{"prompt_id":"p038","prompt":"Convert a JSON-LD/RDF context graph into the agent's belief base so its goal selection becomes auditable.","expected_primary_skill":"bdi-mental-states","acceptable_secondary_skills":["memory-systems"],"rejected_skills":["harness-engineering"],"reason":"RDF-to-beliefs and audit trails are bdi-mental-states."}39{"prompt_id":"p039","prompt":"Where in a long context should I place critical safety instructions to avoid lost-in-middle?","expected_primary_skill":"context-degradation","acceptable_secondary_skills":["context-fundamentals"],"rejected_skills":["memory-systems"],"reason":"Placement under U-shaped attention is context-degradation."}40{"prompt_id":"p040","prompt":"Explain the trade-offs between adding more context vs. retrieving smaller, more targeted context at inference time.","expected_primary_skill":"context-fundamentals","acceptable_secondary_skills":["context-optimization"],"rejected_skills":["memory-systems"],"reason":"Foundational context-quantity-vs-quality is context-fundamentals."}41{"prompt_id":"p041","prompt":"Audit my skill descriptions and tell me which ones overlap so my router does not pick the wrong skill.","expected_primary_skill":"tool-design","acceptable_secondary_skills":["advanced-evaluation","evaluation"],"rejected_skills":["bdi-mental-states"],"reason":"Description-quality auditing best matches tool-design (skills are tools); evaluation/advanced-evaluation are reasonable alternates."}42{"prompt_id":"p042","prompt":"Calibrate confidence thresholds for an automated reviewer so it routes only uncertain cases to humans.","expected_primary_skill":"advanced-evaluation","acceptable_secondary_skills":["evaluation","harness-engineering"],"rejected_skills":["memory-systems"],"reason":"Confidence calibration for judges is advanced-evaluation."}43{"prompt_id":"p043","prompt":"My production agent's daily token spend is up 40 percent month over month. Diagnose where the tokens are going and propose fixes.","expected_primary_skill":"context-optimization","acceptable_secondary_skills":["context-compression","filesystem-context","tool-design"],"rejected_skills":["bdi-mental-states"],"reason":"Token cost diagnosis and reduction is context-optimization."}44{"prompt_id":"p044","prompt":"Help me decide whether two sub-agents would help me solve this task or whether one agent with more tools is fine.","expected_primary_skill":"multi-agent-patterns","acceptable_secondary_skills":["tool-design","project-development"],"rejected_skills":["bdi-mental-states"],"reason":"When to introduce sub-agents is multi-agent-patterns."}45{"prompt_id":"p045","prompt":"Compute the area of a triangle given base 12 and height 7.","expected_primary_skill":"context-fundamentals","acceptable_secondary_skills":[],"rejected_skills":["latent-briefing","bdi-mental-states","harness-engineering"],"reason":"Negative control: no skill is a strong fit. Most generic match would be context-fundamentals as the catch-all."}46{"prompt_id":"p046","prompt":"Reformat this Python file with consistent indentation and remove trailing whitespace.","expected_primary_skill":"tool-design","acceptable_secondary_skills":[],"rejected_skills":["latent-briefing","bdi-mental-states","memory-systems"],"reason":"Negative control: a generic code-formatting task is not a strong fit. Closest is tool-design as a generic tool task."}47{"prompt_id":"p047","prompt":"Translate this English paragraph to French.","expected_primary_skill":"context-fundamentals","acceptable_secondary_skills":[],"rejected_skills":["latent-briefing","memory-systems","harness-engineering","bdi-mental-states"],"reason":"Negative control: no skill fits. Catch-all is context-fundamentals."}48{"prompt_id":"p048","prompt":"Plan how to evaluate whether my latent-briefing-style KV compaction actually preserves task accuracy. Include ablations and baselines.","expected_primary_skill":"advanced-evaluation","acceptable_secondary_skills":["latent-briefing","evaluation","harness-engineering"],"rejected_skills":["bdi-mental-states"],"reason":"Designing ablations and baselines for evaluation is advanced-evaluation; the topic is latent-briefing-adjacent."}49{"prompt_id":"p049","prompt":"My agent loop has been running for three days unattended. Tell me what should be in the durable scratchpad so a different agent could resume it tomorrow.","expected_primary_skill":"filesystem-context","acceptable_secondary_skills":["harness-engineering","memory-systems"],"rejected_skills":["bdi-mental-states"],"reason":"Durable scratchpad design for resumption is filesystem-context, with harness-engineering as the governance frame."}50{"prompt_id":"p050","prompt":"My agent decides which tools to use based on the system prompt. The tool descriptions are vague and the agent picks wrong tools half the time. Fix this end-to-end.","expected_primary_skill":"tool-design","acceptable_secondary_skills":["context-fundamentals"],"rejected_skills":["memory-systems","latent-briefing"],"reason":"Vague tool descriptions causing wrong tool selection is tool-design."}51