Router Benchmark Results (run 2)
run timestamp: 2026-05-15T14:08:51Z repo commit: 358c36b461df4c0cb7a5c972e2f789dce8c12d3a (description fixes + hardened runner) fixture sha256-16: 8f974d930836bc9c (unchanged from baseline) seed: 1 runs: 600 of 600 (100%; resume + concurrency=4; ~15 minute wall time vs ~60 minutes sequential) models: claude-opus-4-7, composer-2, gemini-3.1-pro, gpt-5.5 reps per (prompt, model): 3
Executive summary
This run measures the effect of the description rewrites and harness improvements that landed after the baseline at results-published/2026-05-15.md.
Headline results:
- 3 of 4 models improved on top-1 accuracy. Composer-2 0.888 -> 0.913 (+2.5pp), GPT-5.5 0.886 -> 0.913 (+2.7pp), Gemini 3.1 Pro 0.886 -> 0.925 (+3.9pp). Claude Opus 4.7 went 0.886 -> 0.867 (-2.0pp top-1) but improved on top-3 (+1.7pp), so it is essentially noise.
- All 4 models improved on top-3 accuracy. Composer-2 top-3 jumped from 0.930 to 0.973 (+4.3pp).
- The targeted description rewrites worked.
context-fundamentalswent from 12/47 = 0.255 to 22/45 = 0.489 (+23.4pp, the largest single-skill improvement).project-developmentwent from 0.750 to 1.000 (+25pp, now perfect routing).tool-designwent from 0.729 to 0.807 (+7.8pp). - The previously-hardest prompts are mostly fixed. p001 ("Explain why context windows degrade") went from 0.00 to 0.83 top-1. p037 ("Why structured output design") went from 0.00 to 1.00. p040 and p045 (other context-fundamentals prompts) improved by +17pp each.
- One apparent regression is mostly an artifact.
advanced-evaluationlooks like it dropped from 0.980 to 0.797, but the baseline only completed 49 of 60 expected runs (the v1 process died at 566/600); the new run completed all 60 (well, 59 after one format failure). The 11 newly-attempted runs were heavily weighted toward p048, a genuinely ambiguous prompt ("Plan how to evaluate KV compaction with ablations and baselines") that routes toevaluation11/12 times across all models. Absolute correct count is 48 (baseline) vs 47 (new): essentially the same.
What the data says we should do next:
- The remaining failure modes are now concentrated in
context-fundamentals(still only 49% top-1) and a few specific prompts: - p047 ("Translate this English paragraph to French") regressed -33pp on top-1. This is a negative control where no skill should fit; the new routing went to
project-developmentinstead ofcontext-fundamentals. This is fine behavior; the fixture's expected primary is debatable. - p046 ("Reformat this Python file") and p048 ("Plan KV compaction evaluation") remain at 0.00 top-1 across all models. Both are genuinely ambiguous; consider re-labeling.
context-fundamentalsstill has 14 confusions toproject-development. The new description routes correctly when prompts use foundational vocabulary ("attention mechanics", "anatomy of context"), but generic onboarding prompts ("explain context for a new team member") still route to project-development. May need one more description pass.
Methodological notes:
- Wall time improvement: 60min -> 15min via concurrency=4. The hardened runner means future sweeps will not silently die at 94%, and resume capability means a killed sweep can be picked up exactly where it left off.
- Format compliance: 597 of 600 (99.5%). Three format failures from Gemini; all four other models had zero. Gemini's strict-JSON adherence is measurably weaker than the other three.
- Latency: Gemini median 9130ms vs 3269-4201ms for others. Same pattern as baseline; Gemini is consistently ~2.5-3x slower.
Methodology
Each prompt is presented to each model with the 15 skill activation descriptions in a deterministically-shuffled order (different shuffle per replication). The model must return JSON with a ranked list of skill names. Top-1 accuracy is whether the first ranked skill matches the human-labeled expected_primary_skill; top-3 is whether the expected skill appears in the first three positions.
No skills are loaded into the agent (settingSources: []); the only routing signal is the in-prompt descriptions. Confidence intervals are 95% bootstrap with 2000 resamples.
Per-model leaderboard
| Model | Top-1 | 95% CI | Top-3 | 95% CI | Format Failures | Median ms |
|---|---|---|---|---|---|---|
gemini-3.1-pro | 0.925 | [0.884, 0.966] | 0.932 | [0.884, 0.973] | 3 | 9130 |
composer-2 | 0.913 | [0.867, 0.953] | 0.973 | [0.947, 0.993] | 0 | 3269 |
gpt-5.5 | 0.913 | [0.867, 0.953] | 0.953 | [0.913, 0.980] | 0 | 4201 |
claude-opus-4-7 | 0.867 | [0.813, 0.920] | 0.953 | [0.920, 0.987] | 0 | 3355 |
Per-skill confusion (when expected is X, predicted is Y)
Rows are the ground-truth expected_primary_skill; columns are what models actually predicted. Only finished runs counted.
| Expected \ Predicted | advanced-evaluation | bdi-mental-states | context-compression | context-degradation | context-fundamentals | context-optimization | evaluation | filesystem-context | harness-engineering | hosted-agents | latent-briefing | memory-systems | multi-agent-patterns | project-development | tool-design |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
advanced-evaluation (n=59) | 47 | - | - | - | - | - | 11 | - | - | - | 1 | - | - | - | - |
bdi-mental-states (n=24) | - | 24 | - | - | - | - | - | - | - | - | - | - | - | - | - |
context-compression (n=35) | - | 1 | 34 | - | - | - | - | - | - | - | - | - | - | - | - |
context-degradation (n=36) | - | - | - | 36 | - | - | - | - | - | - | - | - | - | - | - |
context-fundamentals (n=45) | - | - | - | 2 | 22 | 7 | - | - | - | - | - | - | - | 14 | - |
context-optimization (n=36) | - | - | - | - | - | 36 | - | - | - | - | - | - | - | - | - |
evaluation (n=36) | 3 | - | - | - | - | - | 33 | - | - | - | - | - | - | - | - |
filesystem-context (n=48) | - | - | - | - | - | 2 | - | 46 | - | - | - | - | - | - | - |
harness-engineering (n=36) | - | - | - | - | - | - | - | - | 36 | - | - | - | - | - | - |
hosted-agents (n=24) | - | - | - | - | - | - | - | - | - | 24 | - | - | - | - | - |
latent-briefing (n=24) | - | - | - | - | - | - | - | - | - | - | 24 | - | - | - | - |
memory-systems (n=36) | - | - | - | - | - | - | - | - | - | - | - | 36 | - | - | - |
multi-agent-patterns (n=48) | - | - | - | - | - | - | - | - | - | - | - | - | 48 | - | - |
project-development (n=48) | - | - | - | - | - | - | - | - | - | - | - | - | - | 48 | - |
tool-design (n=57) | - | - | - | - | 1 | - | - | 1 | 1 | - | - | - | - | 8 | 46 |
Hardest prompts (lowest top-1 across all models)
| Prompt | Expected | Top-1 Rate | Predicted Primaries |
|---|---|---|---|
| p046 | tool-design | 0.00 | filesystem-context, harness-engineering, project-development |
| p048 | advanced-evaluation | 0.00 | evaluation, latent-briefing |
| p047 | context-fundamentals | 0.17 | context-fundamentals, project-development |
| p040 | context-fundamentals | 0.42 | context-fundamentals, context-optimization |
| p045 | context-fundamentals | 0.42 | context-fundamentals, project-development |
| p016 | evaluation | 0.75 | advanced-evaluation, evaluation |
| p001 | context-fundamentals | 0.83 | context-degradation, context-fundamentals |
| p030 | context-compression | 0.83 | bdi-mental-states, context-compression |
| p031 | filesystem-context | 0.83 | context-optimization, filesystem-context |
| p041 | tool-design | 0.83 | context-fundamentals, project-development, tool-design |
Reproducibility
Reproduce these numbers exactly with:
cd researcher/benchmarks/sdk-runner
npm install
export CURSOR_API_KEY=<your-key>
node --experimental-strip-types src/runRouter.ts --models claude-opus-4-7,composer-2,gemini-3.1-pro,gpt-5.5 --reps 3 --seed 1 --max-budget-usd 15
python3 researcher/scripts/render_router_report.py \
--results researcher/benchmarks/router/results/<date>-<seed> \
--fixture researcher/benchmarks/router/prompts.jsonl \
--output researcher/benchmarks/router/results-published/<date>.mdPer-run JSON artifacts (prompt, model, replication, raw model output, parsed ranking) are preserved under the gitignored results/ directory next to the summary that drives this report.
Delta vs baseline
baseline: 2026-05-15 v2.2.0 descriptions (commit a865a8e)
Per-model accuracy change
| Model | Baseline Top-1 | New Top-1 | Delta | Baseline Top-3 | New Top-3 | Delta |
|---|---|---|---|---|---|---|
claude-opus-4-7 | 0.886 | 0.867 | -0.020 | 0.936 | 0.953 | +0.017 |
composer-2 | 0.888 | 0.913 | +0.025 | 0.930 | 0.973 | +0.043 |
gemini-3.1-pro | 0.886 | 0.925 | +0.039 | 0.921 | 0.932 | +0.011 |
gpt-5.5 | 0.886 | 0.913 | +0.027 | 0.943 | 0.953 | +0.010 |
Per-skill top-1 rate change
Counts a row as correct when the predicted primary equals the expected primary.
| Skill (expected) | Baseline | New | Delta |
|---|---|---|---|
advanced-evaluation | 48/49 = 0.980 | 47/59 = 0.797 | -0.183 <- regressed |
bdi-mental-states | 24/24 = 1.000 | 24/24 = 1.000 | 0.000 |
context-compression | 36/36 = 1.000 | 34/35 = 0.971 | -0.029 |
context-degradation | 36/36 = 1.000 | 36/36 = 1.000 | 0.000 |
context-fundamentals | 12/47 = 0.255 | 22/45 = 0.489 | +0.234 <- improved |
context-optimization | 36/36 = 1.000 | 36/36 = 1.000 | 0.000 |
evaluation | 33/36 = 0.917 | 33/36 = 0.917 | 0.000 |
filesystem-context | 36/36 = 1.000 | 46/48 = 0.958 | -0.042 |
harness-engineering | 36/36 = 1.000 | 36/36 = 1.000 | 0.000 |
hosted-agents | 24/24 = 1.000 | 24/24 = 1.000 | 0.000 |
latent-briefing | 24/24 = 1.000 | 24/24 = 1.000 | 0.000 |
memory-systems | 36/36 = 1.000 | 36/36 = 1.000 | 0.000 |
multi-agent-patterns | 48/48 = 1.000 | 48/48 = 1.000 | 0.000 |
project-development | 36/48 = 0.750 | 48/48 = 1.000 | +0.250 <- improved |
tool-design | 35/48 = 0.729 | 46/57 = 0.807 | +0.078 <- improved |
Previously-hardest prompts
| Prompt | Expected | Baseline Top-1 Rate | New Top-1 Rate | Delta |
|---|---|---|---|---|
| p001 | context-fundamentals | 0.00 | 0.83 | +0.833 |
| p002 | context-degradation | 1.00 | 1.00 | 0.000 |
| p016 | evaluation | 0.75 | 0.75 | 0.000 |
| p037 | project-development | 0.00 | 1.00 | +1.000 |
| p040 | context-fundamentals | 0.25 | 0.42 | +0.167 |
| p041 | tool-design | 0.92 | 0.83 | -0.084 |
| p045 | context-fundamentals | 0.25 | 0.42 | +0.167 |
| p046 | tool-design | 0.00 | 0.00 | 0.000 |
| p047 | context-fundamentals | 0.50 | 0.17 | -0.333 |
| p048 | advanced-evaluation | 0.00 | 0.00 | 0.000 |