Router Benchmark Results (run 2)

run timestamp: 2026-05-15T14:08:51Z repo commit: 358c36b461df4c0cb7a5c972e2f789dce8c12d3a (description fixes + hardened runner) fixture sha256-16: 8f974d930836bc9c (unchanged from baseline) seed: 1 runs: 600 of 600 (100%; resume + concurrency=4; ~15 minute wall time vs ~60 minutes sequential) models: claude-opus-4-7, composer-2, gemini-3.1-pro, gpt-5.5 reps per (prompt, model): 3

Executive summary

This run measures the effect of the description rewrites and harness improvements that landed after the baseline at results-published/2026-05-15.md.

Headline results:

3 of 4 models improved on top-1 accuracy. Composer-2 0.888 -> 0.913 (+2.5pp), GPT-5.5 0.886 -> 0.913 (+2.7pp), Gemini 3.1 Pro 0.886 -> 0.925 (+3.9pp). Claude Opus 4.7 went 0.886 -> 0.867 (-2.0pp top-1) but improved on top-3 (+1.7pp), so it is essentially noise.
All 4 models improved on top-3 accuracy. Composer-2 top-3 jumped from 0.930 to 0.973 (+4.3pp).
The targeted description rewrites worked. context-fundamentals went from 12/47 = 0.255 to 22/45 = 0.489 (+23.4pp, the largest single-skill improvement). project-development went from 0.750 to 1.000 (+25pp, now perfect routing). tool-design went from 0.729 to 0.807 (+7.8pp).
The previously-hardest prompts are mostly fixed. p001 ("Explain why context windows degrade") went from 0.00 to 0.83 top-1. p037 ("Why structured output design") went from 0.00 to 1.00. p040 and p045 (other context-fundamentals prompts) improved by +17pp each.
One apparent regression is mostly an artifact. advanced-evaluation looks like it dropped from 0.980 to 0.797, but the baseline only completed 49 of 60 expected runs (the v1 process died at 566/600); the new run completed all 60 (well, 59 after one format failure). The 11 newly-attempted runs were heavily weighted toward p048, a genuinely ambiguous prompt ("Plan how to evaluate KV compaction with ablations and baselines") that routes to evaluation 11/12 times across all models. Absolute correct count is 48 (baseline) vs 47 (new): essentially the same.

What the data says we should do next:

The remaining failure modes are now concentrated in context-fundamentals (still only 49% top-1) and a few specific prompts:
p047 ("Translate this English paragraph to French") regressed -33pp on top-1. This is a negative control where no skill should fit; the new routing went to project-development instead of context-fundamentals. This is fine behavior; the fixture's expected primary is debatable.
p046 ("Reformat this Python file") and p048 ("Plan KV compaction evaluation") remain at 0.00 top-1 across all models. Both are genuinely ambiguous; consider re-labeling.
context-fundamentals still has 14 confusions to project-development. The new description routes correctly when prompts use foundational vocabulary ("attention mechanics", "anatomy of context"), but generic onboarding prompts ("explain context for a new team member") still route to project-development. May need one more description pass.

Methodological notes:

Wall time improvement: 60min -> 15min via concurrency=4. The hardened runner means future sweeps will not silently die at 94%, and resume capability means a killed sweep can be picked up exactly where it left off.
Format compliance: 597 of 600 (99.5%). Three format failures from Gemini; all four other models had zero. Gemini's strict-JSON adherence is measurably weaker than the other three.
Latency: Gemini median 9130ms vs 3269-4201ms for others. Same pattern as baseline; Gemini is consistently ~2.5-3x slower.

Methodology

Each prompt is presented to each model with the 15 skill activation descriptions in a deterministically-shuffled order (different shuffle per replication). The model must return JSON with a ranked list of skill names. Top-1 accuracy is whether the first ranked skill matches the human-labeled expected_primary_skill; top-3 is whether the expected skill appears in the first three positions.

No skills are loaded into the agent (settingSources: []); the only routing signal is the in-prompt descriptions. Confidence intervals are 95% bootstrap with 2000 resamples.

Per-model leaderboard

Model	Top-1	95% CI	Top-3	95% CI	Format Failures	Median ms
`gemini-3.1-pro`	0.925	[0.884, 0.966]	0.932	[0.884, 0.973]	3	9130
`composer-2`	0.913	[0.867, 0.953]	0.973	[0.947, 0.993]	0	3269
`gpt-5.5`	0.913	[0.867, 0.953]	0.953	[0.913, 0.980]	0	4201
`claude-opus-4-7`	0.867	[0.813, 0.920]	0.953	[0.920, 0.987]	0	3355

Per-skill confusion (when expected is X, predicted is Y)

Rows are the ground-truth expected_primary_skill; columns are what models actually predicted. Only finished runs counted.

Expected \ Predicted	`advanced-evaluation`	`bdi-mental-states`	`context-compression`	`context-degradation`	`context-fundamentals`	`context-optimization`	`evaluation`	`filesystem-context`	`harness-engineering`	`hosted-agents`	`latent-briefing`	`memory-systems`	`multi-agent-patterns`	`project-development`	`tool-design`
`advanced-evaluation` (n=59)	47	-	-	-	-	-	11	-	-	-	1	-	-	-	-
`bdi-mental-states` (n=24)	-	24	-	-	-	-	-	-	-	-	-	-	-	-	-
`context-compression` (n=35)	-	1	34	-	-	-	-	-	-	-	-	-	-	-	-
`context-degradation` (n=36)	-	-	-	36	-	-	-	-	-	-	-	-	-	-	-
`context-fundamentals` (n=45)	-	-	-	2	22	7	-	-	-	-	-	-	-	14	-
`context-optimization` (n=36)	-	-	-	-	-	36	-	-	-	-	-	-	-	-	-
`evaluation` (n=36)	3	-	-	-	-	-	33	-	-	-	-	-	-	-	-
`filesystem-context` (n=48)	-	-	-	-	-	2	-	46	-	-	-	-	-	-	-
`harness-engineering` (n=36)	-	-	-	-	-	-	-	-	36	-	-	-	-	-	-
`hosted-agents` (n=24)	-	-	-	-	-	-	-	-	-	24	-	-	-	-	-
`latent-briefing` (n=24)	-	-	-	-	-	-	-	-	-	-	24	-	-	-	-
`memory-systems` (n=36)	-	-	-	-	-	-	-	-	-	-	-	36	-	-	-
`multi-agent-patterns` (n=48)	-	-	-	-	-	-	-	-	-	-	-	-	48	-	-
`project-development` (n=48)	-	-	-	-	-	-	-	-	-	-	-	-	-	48	-
`tool-design` (n=57)	-	-	-	-	1	-	-	1	1	-	-	-	-	8	46

Hardest prompts (lowest top-1 across all models)

Prompt	Expected	Top-1 Rate	Predicted Primaries
p046	`tool-design`	0.00	`filesystem-context`, `harness-engineering`, `project-development`
p048	`advanced-evaluation`	0.00	`evaluation`, `latent-briefing`
p047	`context-fundamentals`	0.17	`context-fundamentals`, `project-development`
p040	`context-fundamentals`	0.42	`context-fundamentals`, `context-optimization`
p045	`context-fundamentals`	0.42	`context-fundamentals`, `project-development`
p016	`evaluation`	0.75	`advanced-evaluation`, `evaluation`
p001	`context-fundamentals`	0.83	`context-degradation`, `context-fundamentals`
p030	`context-compression`	0.83	`bdi-mental-states`, `context-compression`
p031	`filesystem-context`	0.83	`context-optimization`, `filesystem-context`
p041	`tool-design`	0.83	`context-fundamentals`, `project-development`, `tool-design`

Reproducibility

Reproduce these numbers exactly with:

cd researcher/benchmarks/sdk-runner
npm install
export CURSOR_API_KEY=<your-key>
node --experimental-strip-types src/runRouter.ts --models claude-opus-4-7,composer-2,gemini-3.1-pro,gpt-5.5 --reps 3 --seed 1 --max-budget-usd 15
python3 researcher/scripts/render_router_report.py \
    --results researcher/benchmarks/router/results/<date>-<seed> \
    --fixture researcher/benchmarks/router/prompts.jsonl \
    --output researcher/benchmarks/router/results-published/<date>.md

Per-run JSON artifacts (prompt, model, replication, raw model output, parsed ranking) are preserved under the gitignored results/ directory next to the summary that drives this report.

Delta vs baseline

baseline: 2026-05-15 v2.2.0 descriptions (commit a865a8e)

Per-model accuracy change

Model	Baseline Top-1	New Top-1	Delta	Baseline Top-3	New Top-3	Delta
`claude-opus-4-7`	0.886	0.867	-0.020	0.936	0.953	+0.017
`composer-2`	0.888	0.913	+0.025	0.930	0.973	+0.043
`gemini-3.1-pro`	0.886	0.925	+0.039	0.921	0.932	+0.011
`gpt-5.5`	0.886	0.913	+0.027	0.943	0.953	+0.010

Per-skill top-1 rate change

Counts a row as correct when the predicted primary equals the expected primary.

Skill (expected)	Baseline	New	Delta
`advanced-evaluation`	48/49 = 0.980	47/59 = 0.797	-0.183 <- regressed
`bdi-mental-states`	24/24 = 1.000	24/24 = 1.000	0.000
`context-compression`	36/36 = 1.000	34/35 = 0.971	-0.029
`context-degradation`	36/36 = 1.000	36/36 = 1.000	0.000
`context-fundamentals`	12/47 = 0.255	22/45 = 0.489	+0.234 <- improved
`context-optimization`	36/36 = 1.000	36/36 = 1.000	0.000
`evaluation`	33/36 = 0.917	33/36 = 0.917	0.000
`filesystem-context`	36/36 = 1.000	46/48 = 0.958	-0.042
`harness-engineering`	36/36 = 1.000	36/36 = 1.000	0.000
`hosted-agents`	24/24 = 1.000	24/24 = 1.000	0.000
`latent-briefing`	24/24 = 1.000	24/24 = 1.000	0.000
`memory-systems`	36/36 = 1.000	36/36 = 1.000	0.000
`multi-agent-patterns`	48/48 = 1.000	48/48 = 1.000	0.000
`project-development`	36/48 = 0.750	48/48 = 1.000	+0.250 <- improved
`tool-design`	35/48 = 0.729	46/57 = 0.807	+0.078 <- improved

Previously-hardest prompts

Prompt	Expected	Baseline Top-1 Rate	New Top-1 Rate	Delta
p001	`context-fundamentals`	0.00	0.83	+0.833
p002	`context-degradation`	1.00	1.00	0.000
p016	`evaluation`	0.75	0.75	0.000
p037	`project-development`	0.00	1.00	+1.000
p040	`context-fundamentals`	0.25	0.42	+0.167
p041	`tool-design`	0.92	0.83	-0.084
p045	`context-fundamentals`	0.25	0.42	+0.167
p046	`tool-design`	0.00	0.00	0.000
p047	`context-fundamentals`	0.50	0.17	-0.333
p048	`advanced-evaluation`	0.00	0.00	0.000

Router Benchmark Results (run 2)

Executive summary

This run measures the effect of the description rewrites and harness improvements that landed after the baseline at results-published/2026-05-15.md.

Headline results:

3 of 4 models improved on top-1 accuracy. Composer-2 0.888 -> 0.913 (+2.5pp), GPT-5.5 0.886 -> 0.913 (+2.7pp), Gemini 3.1 Pro 0.886 -> 0.925 (+3.9pp). Claude Opus 4.7 went 0.886 -> 0.867 (-2.0pp top-1) but improved on top-3 (+1.7pp), so it is essentially noise.
All 4 models improved on top-3 accuracy. Composer-2 top-3 jumped from 0.930 to 0.973 (+4.3pp).
The targeted description rewrites worked. context-fundamentals went from 12/47 = 0.255 to 22/45 = 0.489 (+23.4pp, the largest single-skill improvement). project-development went from 0.750 to 1.000 (+25pp, now perfect routing). tool-design went from 0.729 to 0.807 (+7.8pp).
The previously-hardest prompts are mostly fixed. p001 ("Explain why context windows degrade") went from 0.00 to 0.83 top-1. p037 ("Why structured output design") went from 0.00 to 1.00. p040 and p045 (other context-fundamentals prompts) improved by +17pp each.
One apparent regression is mostly an artifact. advanced-evaluation looks like it dropped from 0.980 to 0.797, but the baseline only completed 49 of 60 expected runs (the v1 process died at 566/600); the new run completed all 60 (well, 59 after one format failure). The 11 newly-attempted runs were heavily weighted toward p048, a genuinely ambiguous prompt ("Plan how to evaluate KV compaction with ablations and baselines") that routes to evaluation 11/12 times across all models. Absolute correct count is 48 (baseline) vs 47 (new): essentially the same.

What the data says we should do next:

The remaining failure modes are now concentrated in context-fundamentals (still only 49% top-1) and a few specific prompts:
p047 ("Translate this English paragraph to French") regressed -33pp on top-1. This is a negative control where no skill should fit; the new routing went to project-development instead of context-fundamentals. This is fine behavior; the fixture's expected primary is debatable.
p046 ("Reformat this Python file") and p048 ("Plan KV compaction evaluation") remain at 0.00 top-1 across all models. Both are genuinely ambiguous; consider re-labeling.
context-fundamentals still has 14 confusions to project-development. The new description routes correctly when prompts use foundational vocabulary ("attention mechanics", "anatomy of context"), but generic onboarding prompts ("explain context for a new team member") still route to project-development. May need one more description pass.

Methodological notes:

Wall time improvement: 60min -> 15min via concurrency=4. The hardened runner means future sweeps will not silently die at 94%, and resume capability means a killed sweep can be picked up exactly where it left off.
Format compliance: 597 of 600 (99.5%). Three format failures from Gemini; all four other models had zero. Gemini's strict-JSON adherence is measurably weaker than the other three.
Latency: Gemini median 9130ms vs 3269-4201ms for others. Same pattern as baseline; Gemini is consistently ~2.5-3x slower.

Methodology

No skills are loaded into the agent (settingSources: []); the only routing signal is the in-prompt descriptions. Confidence intervals are 95% bootstrap with 2000 resamples.

Per-model leaderboard

Model	Top-1	95% CI	Top-3	95% CI	Format Failures	Median ms
`gemini-3.1-pro`	0.925	[0.884, 0.966]	0.932	[0.884, 0.973]	3	9130
`composer-2`	0.913	[0.867, 0.953]	0.973	[0.947, 0.993]	0	3269
`gpt-5.5`	0.913	[0.867, 0.953]	0.953	[0.913, 0.980]	0	4201
`claude-opus-4-7`	0.867	[0.813, 0.920]	0.953	[0.920, 0.987]	0	3355

Per-skill confusion (when expected is X, predicted is Y)

Rows are the ground-truth expected_primary_skill; columns are what models actually predicted. Only finished runs counted.

Expected \ Predicted	`advanced-evaluation`	`bdi-mental-states`	`context-compression`	`context-degradation`	`context-fundamentals`	`context-optimization`	`evaluation`	`filesystem-context`	`harness-engineering`	`hosted-agents`	`latent-briefing`	`memory-systems`	`multi-agent-patterns`	`project-development`	`tool-design`
`advanced-evaluation` (n=59)	47	-	-	-	-	-	11	-	-	-	1	-	-	-	-
`bdi-mental-states` (n=24)	-	24	-	-	-	-	-	-	-	-	-	-	-	-	-
`context-compression` (n=35)	-	1	34	-	-	-	-	-	-	-	-	-	-	-	-
`context-degradation` (n=36)	-	-	-	36	-	-	-	-	-	-	-	-	-	-	-
`context-fundamentals` (n=45)	-	-	-	2	22	7	-	-	-	-	-	-	-	14	-
`context-optimization` (n=36)	-	-	-	-	-	36	-	-	-	-	-	-	-	-	-
`evaluation` (n=36)	3	-	-	-	-	-	33	-	-	-	-	-	-	-	-
`filesystem-context` (n=48)	-	-	-	-	-	2	-	46	-	-	-	-	-	-	-
`harness-engineering` (n=36)	-	-	-	-	-	-	-	-	36	-	-	-	-	-	-
`hosted-agents` (n=24)	-	-	-	-	-	-	-	-	-	24	-	-	-	-	-
`latent-briefing` (n=24)	-	-	-	-	-	-	-	-	-	-	24	-	-	-	-
`memory-systems` (n=36)	-	-	-	-	-	-	-	-	-	-	-	36	-	-	-
`multi-agent-patterns` (n=48)	-	-	-	-	-	-	-	-	-	-	-	-	48	-	-
`project-development` (n=48)	-	-	-	-	-	-	-	-	-	-	-	-	-	48	-
`tool-design` (n=57)	-	-	-	-	1	-	-	1	1	-	-	-	-	8	46

Hardest prompts (lowest top-1 across all models)

Prompt	Expected	Top-1 Rate	Predicted Primaries
p046	`tool-design`	0.00	`filesystem-context`, `harness-engineering`, `project-development`
p048	`advanced-evaluation`	0.00	`evaluation`, `latent-briefing`
p047	`context-fundamentals`	0.17	`context-fundamentals`, `project-development`
p040	`context-fundamentals`	0.42	`context-fundamentals`, `context-optimization`
p045	`context-fundamentals`	0.42	`context-fundamentals`, `project-development`
p016	`evaluation`	0.75	`advanced-evaluation`, `evaluation`
p001	`context-fundamentals`	0.83	`context-degradation`, `context-fundamentals`
p030	`context-compression`	0.83	`bdi-mental-states`, `context-compression`
p031	`filesystem-context`	0.83	`context-optimization`, `filesystem-context`
p041	`tool-design`	0.83	`context-fundamentals`, `project-development`, `tool-design`

Reproducibility

Reproduce these numbers exactly with:

cd researcher/benchmarks/sdk-runner
npm install
export CURSOR_API_KEY=<your-key>
node --experimental-strip-types src/runRouter.ts --models claude-opus-4-7,composer-2,gemini-3.1-pro,gpt-5.5 --reps 3 --seed 1 --max-budget-usd 15
python3 researcher/scripts/render_router_report.py \
    --results researcher/benchmarks/router/results/<date>-<seed> \
    --fixture researcher/benchmarks/router/prompts.jsonl \
    --output researcher/benchmarks/router/results-published/<date>.md

Per-run JSON artifacts (prompt, model, replication, raw model output, parsed ranking) are preserved under the gitignored results/ directory next to the summary that drives this report.

Delta vs baseline

baseline: 2026-05-15 v2.2.0 descriptions (commit a865a8e)

Per-model accuracy change

Model	Baseline Top-1	New Top-1	Delta	Baseline Top-3	New Top-3	Delta
`claude-opus-4-7`	0.886	0.867	-0.020	0.936	0.953	+0.017
`composer-2`	0.888	0.913	+0.025	0.930	0.973	+0.043
`gemini-3.1-pro`	0.886	0.925	+0.039	0.921	0.932	+0.011
`gpt-5.5`	0.886	0.913	+0.027	0.943	0.953	+0.010

Per-skill top-1 rate change

Counts a row as correct when the predicted primary equals the expected primary.

Skill (expected)	Baseline	New	Delta
`advanced-evaluation`	48/49 = 0.980	47/59 = 0.797	-0.183 <- regressed
`bdi-mental-states`	24/24 = 1.000	24/24 = 1.000	0.000
`context-compression`	36/36 = 1.000	34/35 = 0.971	-0.029
`context-degradation`	36/36 = 1.000	36/36 = 1.000	0.000
`context-fundamentals`	12/47 = 0.255	22/45 = 0.489	+0.234 <- improved
`context-optimization`	36/36 = 1.000	36/36 = 1.000	0.000
`evaluation`	33/36 = 0.917	33/36 = 0.917	0.000
`filesystem-context`	36/36 = 1.000	46/48 = 0.958	-0.042
`harness-engineering`	36/36 = 1.000	36/36 = 1.000	0.000
`hosted-agents`	24/24 = 1.000	24/24 = 1.000	0.000
`latent-briefing`	24/24 = 1.000	24/24 = 1.000	0.000
`memory-systems`	36/36 = 1.000	36/36 = 1.000	0.000
`multi-agent-patterns`	48/48 = 1.000	48/48 = 1.000	0.000
`project-development`	36/48 = 0.750	48/48 = 1.000	+0.250 <- improved
`tool-design`	35/48 = 0.729	46/57 = 0.807	+0.078 <- improved

Previously-hardest prompts

Prompt	Expected	Baseline Top-1 Rate	New Top-1 Rate	Delta
p001	`context-fundamentals`	0.00	0.83	+0.833
p002	`context-degradation`	1.00	1.00	0.000
p016	`evaluation`	0.75	0.75	0.000
p037	`project-development`	0.00	1.00	+1.000
p040	`context-fundamentals`	0.25	0.42	+0.167
p041	`tool-design`	0.92	0.83	-0.084
p045	`context-fundamentals`	0.25	0.42	+0.167
p046	`tool-design`	0.00	0.00	0.000
p047	`context-fundamentals`	0.50	0.17	-0.333
p048	`advanced-evaluation`	0.00	0.00	0.000

Agent Skills for Context Engineering

researcher/benchmarks/router/results-published/2026-05-15-v2.md

Router Benchmark Results (run 2)

Executive summary

Methodology

Per-model leaderboard

Per-skill confusion (when expected is X, predicted is Y)

Hardest prompts (lowest top-1 across all models)

Reproducibility

Delta vs baseline

Per-model accuracy change

Per-skill top-1 rate change

Previously-hardest prompts

Preparing the source view

Agent Skills for Context Engineering

researcher/benchmarks/router/results-published/2026-05-15-v2.md

Router Benchmark Results (run 2)

Executive summary

Methodology

Per-model leaderboard

Per-skill confusion (when expected is X, predicted is Y)

Hardest prompts (lowest top-1 across all models)

Reproducibility

Delta vs baseline

Per-model accuracy change

Per-skill top-1 rate change

Previously-hardest prompts