Loading source
Pulling the file list, source metadata, and syntax-aware rendering for this listing.
Source from repo
A comprehensive collection of Agent Skills for context engineering, multi-agent architectures, and production agent systems.
Files
Skill
Size
Entrypoint
Format
Open file
Syntax-highlighted preview of this file as included in the skill package.
researcher/benchmarks/PLAN.md
1# Benchmark Architecture Plan23The current benchmark harness verifies that the researcher OS itself is hard to game (deterministic structural checks and seven adversarial scenarios). It does not yet measure the thing users actually care about: **do these skills make agents better at the tasks they claim to help with?**45This document is the plan to close that gap, in four staged releases, with research-paper-grade methodology and the Cursor SDK as the execution layer.67## Status89| Stage | Release | What it measures | Cost | Status |10| --- | --- | --- | --- | --- |11| 0 | v2.2.0 (shipped) | Harness resistance to gaming, structural validity | $0 | done |12| 1 | v2.3.0 (shipped) | Per-skill health metrics (deterministic) | $0 | done; corpus 0.814 aggregate, 2 of 15 flagged |13| 2 | v2.3.0 (shipped) | Skill router accuracy (LLM-as-router) | Cursor credits (~$7 per full sweep) | done; baseline + post-fix delta published |14| 3 | v2.4.0 | Skill effectiveness on real agent tasks | Cursor credits, larger | scaffolded, one task built |15| 4 | v2.5.0 | Cross-skill composition | Cursor credits | future |1617### Shipped Stage 2 results (v2.3.0)1819Two full 600-run sweeps across `composer-2`, `claude-opus-4-7`, `gpt-5.5`, `gemini-3.1-pro` at seed=1, 3 replications per (prompt, model):2021- Baseline: `researcher/benchmarks/router/results-published/2026-05-15.md` (566 of 600; v1 runner died mid-sweep).22- Post-fix (description rewrites + hardened runner): `researcher/benchmarks/router/results-published/2026-05-15-v2.md` (600 of 600, includes delta-vs-baseline section).2324Headline finding: targeted description rewrites moved `context-fundamentals` top-1 from 0.255 to 0.489 (+23.4pp) and `project-development` from 0.750 to 1.000 (+25pp, now perfect). Three of four models gained on top-1; all four gained on top-3.2526## Goals And Non-Goals2728### Goals2930- Produce reproducible, model-agnostic evidence that skill loading improves agent behavior or, where it does not, surface that honestly.31- Make every measurement reproducible from a single CLI invocation plus a pinned config.32- Track results longitudinally so regressions are detectable.33- Disclose methodology fully: prompts, tasks, ground truth, scoring, raw outputs.34- Use deterministic checks first, model-judged measurements second, real task execution third.3536### Non-Goals3738- Build a general-purpose agent benchmarking framework. We benchmark this skill collection on representative tasks.39- Replace SWE-bench, BrowseComp, or other public benchmarks. We can subset them or use them as comparison points, not redo them.40- Train models. This is evaluation only.41- Run benchmarks that require paid APIs other than Cursor. Stage 2 and 3 spend Cursor credits via the SDK.4243## Methodology Principles4445These apply to every stage.4647### Reproducibility4849- Every benchmark run is described by a frozen config (model id, seed, fixture revision, repo commit SHA).50- Raw outputs (transcripts, judge JSON, task workspace diffs) are persisted with the run record.51- Each run appends a single line to a history JSONL with the config hash and pointers to raw artifacts.52- A CLI flag (`--seed`, `--config`) lets a third party reproduce a run exactly.5354### Statistical Discipline5556- Minimum 3 replications per condition (model x skill-state x task) for variance estimation.57- Report effect sizes with bootstrap 95% confidence intervals, not point estimates.58- Paired comparisons where possible (same task, different conditions) using a Wilcoxon signed-rank test on per-task differences.59- Sample size targets per stage are stated below; underpowered runs land as "preliminary" with that label visible in the dashboard.6061### Bias Mitigation6263- **Position bias** in router benchmarks: shuffle skill order across replications, report consistency.64- **Self-preference** in judge models: never use the same model as judge and candidate. Use a different family (e.g. Composer evaluates Claude outputs, GPT evaluates Composer outputs).65- **Length bias** in pairwise: include length-normalized scoring and a "shorter wins ties" rule.66- **Selection bias** in tasks: tasks include some where the relevant skill should not help (negative controls) so we can measure false-positive rates.6768### Ablations6970- Per-skill: with vs without each skill, holding all others constant.71- Leave-one-out: remove one skill at a time from the full corpus to find which skills carry the work.72- Order: run skill subsets in different orders to surface order-dependent effects.7374### Disclosure7576- All prompts published in `researcher/benchmarks/<stage>/`.77- All ground truth published as fixtures.78- All scoring code published in `researcher/scripts/`.79- Raw run outputs (with secrets redacted) committed under `researcher/benchmarks/<stage>/results/` or attached to release notes.8081## Stage 1: Deterministic Skill Health (v2.2.1, $0)8283The cheap and uncontroversial floor. Per-skill, deterministic scoring of structural quality. Catches drift, missing sections, stale claims, and underspecified gotchas before any user notices.8485### Metrics8687For each `skills/<name>/SKILL.md`:8889- `line_count`: must be at or below 500. Fail above.90- `frontmatter_valid`: name matches directory, description present, description third-person, description length within 1024 chars.91- `required_sections`: When to Activate, Core Concepts, Practical Guidance, Gotchas, Integration, References.92- `gotcha_count`: number of numbered items in Gotchas. Target >= 3.93- `code_example_count`: fenced code blocks.94- `internal_links_resolved`: every `skills/<other>/SKILL.md` link resolves to a real file.95- `external_link_count`: presence only; reachability is opt-in (`--check-urls`) because it requires network.96- `claim_coverage`: numeric claims (regex `\b\d+(\.\d+)?%\b`, `\b\d+x\b`, benchmark names) divided by the number of those claims that have a `claim_id` referencing `researcher/claims/index.jsonl`.97- `mechanism_coverage`: number of mechanisms in `researcher/mechanisms/registry.jsonl` owned by this skill.98- `activation_case_coverage`: number of activation cases in `researcher/fixtures/activation-cases.jsonl` whose `expected_primary_skill` is this skill.99100### Output101102`researcher/reports/skill-health.json`, regenerated by `researcher/scripts/skill_health.py`. Per-skill scores plus a weighted aggregate per skill plus a corpus-wide aggregate.103104`researcher/reports/skill-health-history.jsonl` for daily trend tracking, written by `loop_daily.py`.105106### Scoring107108Aggregate is a weighted sum with weights tuned to surface drift early:109110```1110.20 * normalize(required_sections)1120.15 * normalize(gotcha_count, target=3)1130.10 * normalize(code_example_count, target=2)1140.15 * normalize(internal_links_resolved)1150.10 * normalize(activation_case_coverage)1160.15 * normalize(claim_coverage)1170.10 * normalize(mechanism_coverage)1180.05 * binary(frontmatter_valid)119```120121Anything below 0.75 is flagged in the daily snapshot.122123### Why this matters124125Skill rot is invisible without metrics. A skill that loses its gotcha section, accumulates dead internal links, or drifts past 500 lines is structurally weaker; catching that in CI is cheap insurance.126127## Stage 2: Skill Router Benchmark (v2.3.0, ~$5 per full run with Cursor SDK)128129The first benchmark that exercises a real model. Tests whether the skill descriptions are good enough to route the right skill to a given task.130131### Hypothesis132133The activation-scenario descriptions in v2.2.0 frontmatter (replacing v2.1.x keyword triggers) should let a frontier model route prompts to the correct skill at high top-1 accuracy and very high top-3 accuracy.134135### Procedure1361371. **Fixture**: `researcher/benchmarks/router/prompts.jsonl` with 100 prompts. Each line: `{prompt_id, prompt, expected_primary_skill, acceptable_secondary_skills, rejected_skills, reason}`. Stage 1 ships with 50; expand to 100 over time.1381392. **Routing prompt**: A standard template (`researcher/benchmarks/router/routing-prompt.md`) that presents the 15 skill descriptions (shuffled per replication) and the task, asks the model to return a strict-JSON ranked list with confidence.1401413. **Runner**: `researcher/benchmarks/sdk-runner/src/runRouter.ts`. For each prompt x model x replication:142- Build the routing prompt with shuffled skill order.143- Call `Agent.prompt(routingPrompt, { settingSources: [], model: { id }, local: { cwd: temp } })`.144- `settingSources: []` ensures the router agent has no skill loaded; the descriptions in the prompt are the only signal.145- Parse JSON. If parse fails, record as `format_failure` (don't reward bad output).146- Compare ranked list to ground truth. Record top-1 and top-3 accuracy.1471484. **Models**: `composer-2`, `claude-opus-4-7`, `gpt-5.5`, `gemini-3.1-pro`. The list comes from `Cursor.models.list()` at run time; if a model is unavailable it is recorded as `model_unavailable` and the run continues.1491505. **Replications**: 3 per (prompt, model). 100 prompts x 4 models x 3 reps = 1200 calls per full run.151152### Cost analysis153154- Routing prompt is small (~3-5k input tokens, ~500 output tokens).155- At Cursor's free-with-credits pricing: well under $5 per full run.156- Even at unfavorable retail rates: estimated $5-15 per full run.157158### Reporting159160- Per-model leaderboard: top-1 accuracy with 95% CI, top-3 accuracy, format-failure rate.161- Per-skill confusion matrix: which skills get confused with which.162- Per-prompt drill-down for failures: which models failed, with what alternative skill.163- Append to `researcher/reports/router-history.jsonl` with model, fixture rev, repo SHA, accuracy.164165### Why this matters166167Skill descriptions are the only signal a deployed agent uses to decide whether to load a skill. If they don't route correctly, the rest of the harness is academic. This benchmark directly validates the v2.2.0 activation-scenario refactor.168169## Stage 3: Skill Effectiveness Benchmark (v2.4.0, ~$50-200 per full run)170171The benchmark that proves skills actually help.172173### Hypothesis174175Loading a relevant skill into an agent's context improves outcome quality, token efficiency, or both, on tasks that the skill claims to address. Loading an irrelevant skill should have no effect or only mild noise.176177### Procedure1781791. **Fixture**: `researcher/benchmarks/effectiveness/tasks/<id>-<slug>/`. Each task directory has:180- `task.md`: the prompt the agent receives.181- `starting/`: workspace seed copied into a temp directory before the run.182- `verify.sh`: deterministic ground-truth check returning exit code 0 if the task succeeded.183- `metadata.json`: relevant skills, irrelevant skills (for negative control), category, expected difficulty.1841852. **Conditions**: For each task, run six conditions:186- `control`: `settingSources: []`. No skills loaded.187- `target`: `settingSources: ["project"]` with only the target skill present. Other skills are temporarily moved out.188- `negative`: `settingSources: ["project"]` with only a known-irrelevant skill present.189- `full`: `settingSources: ["project"]` with all 15 skills present.190- `target_plus_one`: target skill plus one related skill.191- `target_plus_unrelated`: target skill plus one unrelated skill (interaction control).1921933. **Runner**: `researcher/benchmarks/sdk-runner/src/runEffectiveness.ts`. For each task x condition x model x replication:194- Build the task workspace from `starting/`.195- Copy only the in-scope skills into `.cursor/skills/` of the task workspace.196- Call `Agent.prompt(taskPrompt, { settingSources: ["project"], model: { id }, local: { cwd: taskWorkspace } })` (or local cloud option for parallel runs).197- On completion, run `verify.sh`; record exit code, durationMs, transcript token counts (from `run.conversation()`).198- Persist transcript JSON, workspace diff, verify output.1992004. **Initial task set**: 20 tasks across categories:201- **filesystem-context**: agent must offload a 5,000-line tool output then retrieve specific data from it.202- **context-compression**: agent gets a 100k-token chat history and must produce a 2k-token handoff that preserves named entities.203- **multi-agent-patterns**: agent must decide whether to use subagents for a parallelizable task and justify it.204- **memory-systems**: agent must persist a user preference across two simulated sessions.205- **tool-design**: agent must consolidate three overlapping tool calls into one.206- **evaluation**: agent must produce a rubric for a given task description.207- **advanced-evaluation**: agent must run a position-bias-mitigated pairwise comparison.208- **harness-engineering**: agent must identify which of four agent configurations is missing a locked evaluator.209- **context-degradation**: agent must place critical info at U-curve endpoints for a long context.210- **context-optimization**: agent must mask tool outputs above 2k tokens.211- **latent-briefing**: agent must decide whether KV cache compaction applies (positive case + negative case).212- **bdi-mental-states**: agent must convert a small RDF graph into a structured belief state.213- **hosted-agents**: agent must propose a warm-pool config for a multiplayer scenario.214- **project-development**: agent must evaluate task-model fit and propose a pipeline.215- **context-fundamentals**: agent must explain a context degradation pattern.216- Plus 5 negative-control tasks where no skill should help (basic arithmetic, plain code reformatting, etc.).2172185. **Models**: same as Stage 2.2192206. **Replications**: 3 per (task, condition, model). 20 tasks x 6 conditions x 4 models x 3 reps = 1440 agent runs per full sweep.221222### Cost analysis223224- Average effectiveness task is larger than routing prompts: 10-50k input tokens, 1-5k output tokens, multiple tool calls.225- Cursor free-with-credits: should fit in monthly allotment for one full sweep.226- Retail equivalent estimate: $50-200 per full sweep depending on which models are active.227228### Reporting229230- Per-skill effect size: success rate delta, token cost delta, durationMs delta between control and target.231- Per-skill effect plot: bar chart with 95% CI.232- Negative-control validation: irrelevant skill should show effect size near zero; if not, the test is biased.233- Per-model leaderboard: which model benefits most from skills.234- Append to `researcher/reports/effectiveness-history.jsonl`.235236### Why this matters237238This is the headline result. "Loading filesystem-context reduces tokens by N% with zero quality loss on tasks where it applies" is the kind of claim that justifies the existence of the skill collection.239240## Stage 4: Cross-Skill Composition (v2.5.0)241242Composition is where curated collections add or destroy value compared to individual skills. Tests:243244- Do two skills loaded together produce additive, synergistic, or conflicting guidance?245- Are integration sections accurate? When skill A's integration mentions skill B, does loading both actually compose?246- Are there ordering effects in how skills appear in context?247248Deferred to v2.5.0 because it requires Stage 3 infrastructure plus task design specifically targeting interactions. Sketched here, not designed in detail.249250## SDK Integration Details251252### Why the Cursor SDK253254- Free with the existing Cursor team credits the user already has.255- Same agent loop as the IDE and CLI, so results transfer to production usage.256- Multi-model: composer-2, claude-opus-4-7, gpt-5.5, gemini-3.1-pro through one interface.257- `settingSources` gives precise control over which skills load (this is the key affordance for ablation).258- Cloud runtime for parallel execution when local resources are insufficient.259260### settingSources and skill loading261262This is the central mechanism we exploit.263264- `settingSources: []` (default): no on-disk settings load. Agent has no skills. Use this as the **control condition** and for the router benchmark where we want descriptions to be the only signal.265- `settingSources: ["project"]`: loads `.cursor/` from the cwd. Used to load only specific skills by copying them into a controlled workspace.266- `settingSources: ["project", "plugins"]`: also loads plugin skills. Less useful for benchmarks; we want full control.267- `settingSources: "all"`: loads everything including user and team settings. Avoid in benchmarks; it leaks the caller's environment.268269For Stage 3, each task workspace is built fresh per condition: copy the `starting/` directory to a temp dir, then conditionally place skills into `.cursor/skills/`.270271### Result schema we rely on272273From the Cursor SDK reference:274275```typescript276interface RunResult {277id: string;278status: "finished" | "error" | "cancelled";279result?: string; // final assistant text280model?: ModelSelection; // resolved model281durationMs?: number;282git?: RunGitInfo;283}284```285286Plus `run.conversation(): Promise<ConversationTurn[]>` for the per-turn transcript. Token counts come from the conversation events (precise schema depends on SDK version; the runner abstracts this).287288### Runtime choice per stage289290- Stage 2 (router): local runtime, no cwd dependence beyond a temp dir. Fast.291- Stage 3 (effectiveness): local runtime for most tasks; cloud runtime for tasks that need filesystem isolation or parallel execution.292- Stage 4: cloud runtime by default for full parallelism.293294### Safety295296- Privacy Mode enabled on the Cursor account that runs benchmarks, per Cursor's [privacy docs](https://cursor.help/security-and-privacy/privacy.md). Eval data stays out of training.297- `apiKey` always passed explicitly, never relied on from env, so cross-tenant mistakes are impossible.298- `await using` syntax for every agent so disposal is automatic.299- Concrete rate-limit handling: backoff schedule per Cursor's docs, exponential 1s/2s/4s/8s/16s on 429.300301### Cost gates302303The runner implements per-run cost caps:304305- `--max-runs N`: hard cap on total agent invocations.306- `--max-budget-usd N`: estimated cost cap, fail fast if exceeded.307- `--dry-run`: print plan, do not call.308- `--models <list>`: subset to one model for development.309310Every runner prints a cost forecast before any agent call.311312## Implementation Order3133141. **Stage 1 (this PR, v2.2.1)**: `researcher/scripts/skill_health.py`, output file, integration with `loop_daily.py`. No API cost.3153162. **SDK runner scaffolding (this PR)**: `researcher/benchmarks/sdk-runner/` with package.json, tsconfig, common utilities, dry-run mode. Compiles and exits cleanly without an API key.3173183. **Router fixtures (this PR)**: 50 prompts in `researcher/benchmarks/router/prompts.jsonl`. Adversarial pairs for the v2.2.0 boundary-confusion cases (evaluation vs advanced-evaluation, etc.) plus single-skill positive controls.3193204. **First effectiveness task (this PR)**: `researcher/benchmarks/effectiveness/tasks/001-filesystem-context-offload/` fully built. Pattern for the other 19.3213225. **Verify (this PR)**: compile, run dry, run skill_health for real, all existing gates still pass.3233246. **Execute Stage 1 in CI (next PR after merge)**: add to `loop_daily.py` so skill health updates daily.3253267. **Execute Stage 2 (when env key provided)**: run router benchmark, publish results, iterate descriptions where confusions appear.3273288. **Build remaining 19 effectiveness tasks (rolling)**: prioritized by which skills carry the most user-facing claims.3293309. **Execute Stage 3 (when ready)**: full effectiveness sweep, publish per-skill effect sizes.33133210. **Stage 4 design and execution (v2.5.0)**.333334## Open Decisions335336These need user input before Stage 2 execution. They do not block scaffolding.3373381. **Privacy mode**: confirm enabled on the Cursor account that will run benchmarks.3392. **Higher rate limits**: confirm whether to email `[email protected]` for benchmark-grade limits per Cursor's docs, or rely on default limits for a smaller initial sweep.3403. **Models to include in v2.3.0 first run**: ship with composer-2 only, or include claude/gpt/gemini from day one?3414. **Publication policy**: full raw transcripts committed to the repo, or hosted separately and linked? Transcript size at full scale is multi-MB per sweep.3425. **Comparison points**: include public benchmark subsets (BrowseComp, SWE-bench) for cross-reference, or keep our task set self-contained for v2.4.0 and add cross-reference in v2.5.0?343344## What This Plan Does Not Solve345346- Tokens-per-task accounting depends on the per-turn data the SDK exposes. If `conversation()` does not include token counts we fall back to wall-clock and request-count as cost proxies until we instrument it.347- Cursor's model catalog is not stable across time; we record the resolved model id per run for reproducibility, but cross-version comparisons require care.348- The seed user-curated task set is small. Public credibility requires either growing it to 100+ tasks or aligning with an existing public benchmark.349- Real-world deployment differs from benchmark conditions. Effect sizes here are upper bounds, not guarantees.350351## How To Read Results352353When a future release shows benchmark numbers, the dashboard answers four questions in order:3543551. **Did the harness pass deterministic gates?** (always required, Stage 0 + 1)3562. **Can the descriptions route the right skill?** (Stage 2, per model)3573. **Does the skill actually help?** (Stage 3, per skill x model x task)3584. **Do skills compose?** (Stage 4, future)359360Skills that fail at any earlier stage do not need later stages to be justified for removal or rework.361