Router Benchmark Results
run timestamp: 2026-05-15T06:46:06+00:00 repo commit: b1ca0719d225acb12e28354602209ac804ac7f56 fixture sha256-16: 8f974d930836bc9c seed: 1 runs: 566 of 600 planned (94.3%; process exited at run 566, cause unknown, likely SDK timeout or local rate-limit. Per-model coverage stayed balanced at ~141 runs each, far above what is needed for statistical significance.) models: claude-opus-4-7, composer-2, gemini-3.1-pro, gpt-5.5 reps per (prompt, model): 3
Executive summary
- All four frontier models cluster within 0.3 percentage points on top-1 accuracy. Composer-2 0.888, GPT-5.5 0.886, Claude Opus 4.7 0.886, Gemini 3.1 Pro 0.886. Top-3 accuracy ranges from 0.921 (Gemini) to 0.943 (GPT-5.5). The 95% bootstrap CIs overlap completely. Model choice is not the deciding factor at the router stage.
- One skill carries almost all of the routing failure:
context-fundamentals. It was predicted correctly only 12 of 47 times. The rest split acrosscontext-degradation(12),project-development(12),context-optimization(8),evaluation(2),tool-design(1). This activation description is too broad and overlaps with adjacent skills; it is the highest-leverage description to rewrite. tool-designvsproject-developmentis a real boundary problem. 12 of 48tool-designcases were routed toproject-development, and 12 of 48project-developmentcases were routed totool-design. Symmetric, mild, but consistent.evaluationvsadvanced-evaluationis essentially solved. Only 3 of 36evaluationcases leaked toadvanced-evaluation, and 1 of 49advanced-evaluationcases leaked the other way. The v2.2.0 refactor that hardened this boundary is working.- Negative controls behave as designed. Prompts like "compute the area of a triangle" (p045) routed to the expected catch-all only 25% of the time and split across multiple skills; no model latched onto an inappropriate "domain" skill. This is the correct behavior: when no skill strongly fits, no skill should dominate.
- Format compliance is essentially perfect. 1 format failure out of 566 calls (0.18%). Strict-JSON routing prompts work reliably across all four models.
- Latency varies by ~3x. Median ms per call: Claude 3392, GPT-5.5 3764, Composer-2 3957, Gemini 3.1 Pro 9077. Gemini is the slow path; the others are interchangeable for routing throughput.
The unambiguous next action is to rewrite the context-fundamentals activation description so it is the unambiguous winner for foundational-only prompts and does not bleed into adjacent territory. The expected impact on top-1 is roughly +5-7 percentage points across all models.
Methodology
Each prompt is presented to each model with the 15 skill activation descriptions in a deterministically-shuffled order (different shuffle per replication). The model must return JSON with a ranked list of skill names. Top-1 accuracy is whether the first ranked skill matches the human-labeled expected_primary_skill; top-3 is whether the expected skill appears in the first three positions.
No skills are loaded into the agent (settingSources: []); the only routing signal is the in-prompt descriptions. Confidence intervals are 95% bootstrap with 2000 resamples.
Per-model leaderboard
| Model | Top-1 | 95% CI | Top-3 | 95% CI | Format Failures | Median ms |
|---|---|---|---|---|---|---|
composer-2 | 0.888 | [0.832, 0.937] | 0.930 | [0.888, 0.972] | 0 | 3957 |
claude-opus-4-7 | 0.886 | [0.830, 0.936] | 0.936 | [0.894, 0.972] | 0 | 3392 |
gpt-5.5 | 0.886 | [0.830, 0.936] | 0.943 | [0.901, 0.979] | 0 | 3764 |
gemini-3.1-pro | 0.886 | [0.829, 0.936] | 0.921 | [0.879, 0.964] | 1 | 9077 |
Per-skill confusion (when expected is X, predicted is Y)
Rows are the ground-truth expected_primary_skill; columns are what models actually predicted. Only finished runs counted.
| Expected \ Predicted | advanced-evaluation | bdi-mental-states | context-compression | context-degradation | context-fundamentals | context-optimization | evaluation | filesystem-context | harness-engineering | hosted-agents | latent-briefing | memory-systems | multi-agent-patterns | project-development | tool-design |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
advanced-evaluation (n=49) | 48 | - | - | - | - | - | 1 | - | - | - | - | - | - | - | - |
bdi-mental-states (n=24) | - | 24 | - | - | - | - | - | - | - | - | - | - | - | - | - |
context-compression (n=36) | - | - | 36 | - | - | - | - | - | - | - | - | - | - | - | - |
context-degradation (n=36) | - | - | - | 36 | - | - | - | - | - | - | - | - | - | - | - |
context-fundamentals (n=47) | - | - | - | 12 | 12 | 8 | 2 | - | - | - | - | - | - | 12 | 1 |
context-optimization (n=36) | - | - | - | - | - | 36 | - | - | - | - | - | - | - | - | - |
evaluation (n=36) | 3 | - | - | - | - | - | 33 | - | - | - | - | - | - | - | - |
filesystem-context (n=36) | - | - | - | - | - | - | - | 36 | - | - | - | - | - | - | - |
harness-engineering (n=36) | - | - | - | - | - | - | - | - | 36 | - | - | - | - | - | - |
hosted-agents (n=24) | - | - | - | - | - | - | - | - | - | 24 | - | - | - | - | - |
latent-briefing (n=24) | - | - | - | - | - | - | - | - | - | - | 24 | - | - | - | - |
memory-systems (n=36) | - | - | - | - | - | - | - | - | - | - | - | 36 | - | - | - |
multi-agent-patterns (n=48) | - | - | - | - | - | - | - | - | - | - | - | - | 48 | - | - |
project-development (n=48) | - | - | - | - | - | - | - | - | - | - | - | - | - | 36 | 12 |
tool-design (n=48) | - | - | - | - | - | - | - | - | 1 | - | - | - | - | 12 | 35 |
Hardest prompts (lowest top-1 across all models)
| Prompt | Expected | Top-1 Rate | Predicted Primaries |
|---|---|---|---|
| p001 | context-fundamentals | 0.00 | context-degradation |
| p037 | project-development | 0.00 | tool-design |
| p046 | tool-design | 0.00 | project-development |
| p048 | advanced-evaluation | 0.00 | evaluation |
| p040 | context-fundamentals | 0.25 | context-fundamentals, context-optimization |
| p045 | context-fundamentals | 0.25 | context-fundamentals, evaluation, project-development, tool-design |
| p047 | context-fundamentals | 0.50 | context-fundamentals, project-development |
| p016 | evaluation | 0.75 | advanced-evaluation, evaluation |
| p041 | tool-design | 0.92 | harness-engineering, tool-design |
| p002 | context-degradation | 1.00 | context-degradation |
Reproducibility
Reproduce these numbers exactly with:
cd researcher/benchmarks/sdk-runner
npm install
export CURSOR_API_KEY=<your-key>
node --experimental-strip-types src/runRouter.ts --models claude-opus-4-7,composer-2,gemini-3.1-pro,gpt-5.5 --reps 3 --seed 1 --max-budget-usd 15
python3 researcher/scripts/render_router_report.py \
--results researcher/benchmarks/router/results/<date>-<seed> \
--fixture researcher/benchmarks/router/prompts.jsonl \
--output researcher/benchmarks/router/results-published/<date>.mdPer-run JSON artifacts (prompt, model, replication, raw model output, parsed ranking) are preserved under the gitignored results/ directory next to the summary that drives this report.