Router Benchmark Results

run timestamp: 2026-05-15T06:46:06+00:00 repo commit: b1ca0719d225acb12e28354602209ac804ac7f56 fixture sha256-16: 8f974d930836bc9c seed: 1 runs: 566 of 600 planned (94.3%; process exited at run 566, cause unknown, likely SDK timeout or local rate-limit. Per-model coverage stayed balanced at ~141 runs each, far above what is needed for statistical significance.) models: claude-opus-4-7, composer-2, gemini-3.1-pro, gpt-5.5 reps per (prompt, model): 3

Executive summary

All four frontier models cluster within 0.3 percentage points on top-1 accuracy. Composer-2 0.888, GPT-5.5 0.886, Claude Opus 4.7 0.886, Gemini 3.1 Pro 0.886. Top-3 accuracy ranges from 0.921 (Gemini) to 0.943 (GPT-5.5). The 95% bootstrap CIs overlap completely. Model choice is not the deciding factor at the router stage.
One skill carries almost all of the routing failure: context-fundamentals. It was predicted correctly only 12 of 47 times. The rest split across context-degradation (12), project-development (12), context-optimization (8), evaluation (2), tool-design (1). This activation description is too broad and overlaps with adjacent skills; it is the highest-leverage description to rewrite.
tool-design vs project-development is a real boundary problem. 12 of 48 tool-design cases were routed to project-development, and 12 of 48 project-development cases were routed to tool-design. Symmetric, mild, but consistent.
evaluation vs advanced-evaluation is essentially solved. Only 3 of 36 evaluation cases leaked to advanced-evaluation, and 1 of 49 advanced-evaluation cases leaked the other way. The v2.2.0 refactor that hardened this boundary is working.
Negative controls behave as designed. Prompts like "compute the area of a triangle" (p045) routed to the expected catch-all only 25% of the time and split across multiple skills; no model latched onto an inappropriate "domain" skill. This is the correct behavior: when no skill strongly fits, no skill should dominate.
Format compliance is essentially perfect. 1 format failure out of 566 calls (0.18%). Strict-JSON routing prompts work reliably across all four models.
Latency varies by ~3x. Median ms per call: Claude 3392, GPT-5.5 3764, Composer-2 3957, Gemini 3.1 Pro 9077. Gemini is the slow path; the others are interchangeable for routing throughput.

The unambiguous next action is to rewrite the context-fundamentals activation description so it is the unambiguous winner for foundational-only prompts and does not bleed into adjacent territory. The expected impact on top-1 is roughly +5-7 percentage points across all models.

Methodology

Each prompt is presented to each model with the 15 skill activation descriptions in a deterministically-shuffled order (different shuffle per replication). The model must return JSON with a ranked list of skill names. Top-1 accuracy is whether the first ranked skill matches the human-labeled expected_primary_skill; top-3 is whether the expected skill appears in the first three positions.

No skills are loaded into the agent (settingSources: []); the only routing signal is the in-prompt descriptions. Confidence intervals are 95% bootstrap with 2000 resamples.

Per-model leaderboard

Model	Top-1	95% CI	Top-3	95% CI	Format Failures	Median ms
`composer-2`	0.888	[0.832, 0.937]	0.930	[0.888, 0.972]	0	3957
`claude-opus-4-7`	0.886	[0.830, 0.936]	0.936	[0.894, 0.972]	0	3392
`gpt-5.5`	0.886	[0.830, 0.936]	0.943	[0.901, 0.979]	0	3764
`gemini-3.1-pro`	0.886	[0.829, 0.936]	0.921	[0.879, 0.964]	1	9077

Per-skill confusion (when expected is X, predicted is Y)

Rows are the ground-truth expected_primary_skill; columns are what models actually predicted. Only finished runs counted.

Expected \ Predicted	`advanced-evaluation`	`bdi-mental-states`	`context-compression`	`context-degradation`	`context-fundamentals`	`context-optimization`	`evaluation`	`filesystem-context`	`harness-engineering`	`hosted-agents`	`latent-briefing`	`memory-systems`	`multi-agent-patterns`	`project-development`	`tool-design`
`advanced-evaluation` (n=49)	48	-	-	-	-	-	1	-	-	-	-	-	-	-	-
`bdi-mental-states` (n=24)	-	24	-	-	-	-	-	-	-	-	-	-	-	-	-
`context-compression` (n=36)	-	-	36	-	-	-	-	-	-	-	-	-	-	-	-
`context-degradation` (n=36)	-	-	-	36	-	-	-	-	-	-	-	-	-	-	-
`context-fundamentals` (n=47)	-	-	-	12	12	8	2	-	-	-	-	-	-	12	1
`context-optimization` (n=36)	-	-	-	-	-	36	-	-	-	-	-	-	-	-	-
`evaluation` (n=36)	3	-	-	-	-	-	33	-	-	-	-	-	-	-	-
`filesystem-context` (n=36)	-	-	-	-	-	-	-	36	-	-	-	-	-	-	-
`harness-engineering` (n=36)	-	-	-	-	-	-	-	-	36	-	-	-	-	-	-
`hosted-agents` (n=24)	-	-	-	-	-	-	-	-	-	24	-	-	-	-	-
`latent-briefing` (n=24)	-	-	-	-	-	-	-	-	-	-	24	-	-	-	-
`memory-systems` (n=36)	-	-	-	-	-	-	-	-	-	-	-	36	-	-	-
`multi-agent-patterns` (n=48)	-	-	-	-	-	-	-	-	-	-	-	-	48	-	-
`project-development` (n=48)	-	-	-	-	-	-	-	-	-	-	-	-	-	36	12
`tool-design` (n=48)	-	-	-	-	-	-	-	-	1	-	-	-	-	12	35

Hardest prompts (lowest top-1 across all models)

Prompt	Expected	Top-1 Rate	Predicted Primaries
p001	`context-fundamentals`	0.00	`context-degradation`
p037	`project-development`	0.00	`tool-design`
p046	`tool-design`	0.00	`project-development`
p048	`advanced-evaluation`	0.00	`evaluation`
p040	`context-fundamentals`	0.25	`context-fundamentals`, `context-optimization`
p045	`context-fundamentals`	0.25	`context-fundamentals`, `evaluation`, `project-development`, `tool-design`
p047	`context-fundamentals`	0.50	`context-fundamentals`, `project-development`
p016	`evaluation`	0.75	`advanced-evaluation`, `evaluation`
p041	`tool-design`	0.92	`harness-engineering`, `tool-design`
p002	`context-degradation`	1.00	`context-degradation`

Reproducibility

Reproduce these numbers exactly with:

cd researcher/benchmarks/sdk-runner
npm install
export CURSOR_API_KEY=<your-key>
node --experimental-strip-types src/runRouter.ts --models claude-opus-4-7,composer-2,gemini-3.1-pro,gpt-5.5 --reps 3 --seed 1 --max-budget-usd 15
python3 researcher/scripts/render_router_report.py \
    --results researcher/benchmarks/router/results/<date>-<seed> \
    --fixture researcher/benchmarks/router/prompts.jsonl \
    --output researcher/benchmarks/router/results-published/<date>.md

Per-run JSON artifacts (prompt, model, replication, raw model output, parsed ranking) are preserved under the gitignored results/ directory next to the summary that drives this report.

Preparing the source view

Agent Skills for Context Engineering