Router Benchmark Results

run timestamp: 2026-05-19T05:52:52Z repo commit: 272702e0bb1ff4f78d45fb7253da872da170d458 fixture sha256-16: 8f974d930836bc9c seed: 1 runs: 600 models: claude-opus-4-7, composer-2, gemini-3.1-pro, gpt-5.5 reps per (prompt, model): 3

Executive Summary

This sweep validates the corpus-wide hardening pass after all 15 skill bodies, mechanism mappings, claim provenance records, corpus index entries, and activation fixtures were updated.

600/600 usable records, 0 format failures. The first pass produced transient empty/format-failed SDK outputs; those records were rerun through the runner's resume path and all completed successfully. The runner now retries transient format failures once.
No broad routing collapse from the corpus-wide pass. Three models stayed at or above 0.913 top-1. Claude Opus 4.7 is lower at 0.840, but its remaining misses are concentrated in the same known ambiguous boundaries as the previous report.
Newly hardened skills route cleanly. bdi-mental-states, hosted-agents, latent-briefing, memory-systems, and multi-agent-patterns are all perfect in this sweep.
Remaining failures are concentrated and already understood. p046 is a negative-control Python formatting task with no true matching skill, p048 is a genuinely ambiguous advanced-evaluation/evaluation/latent-briefing prompt, and context-fundamentals remains the weakest catch-all boundary.

Compared with 2026-05-15-v2.md, the material conclusion is unchanged: the description-benchmark loop worked, and the corpus-wide body/metadata hardening did not introduce a broad router regression. The next benchmark investment should be Stage 3 effectiveness, where full skill bodies are loaded.

Methodology

Each prompt is presented to each model with the 15 skill activation descriptions in a deterministically-shuffled order (different shuffle per replication). The model must return JSON with a ranked list of skill names. Top-1 accuracy is whether the first ranked skill matches the human-labeled expected_primary_skill; top-3 is whether the expected skill appears in the first three positions.

No skills are loaded into the agent (settingSources: []); the only routing signal is the in-prompt descriptions. Confidence intervals are 95% bootstrap with 2000 resamples.

Per-model leaderboard

Model	Top-1	95% CI	Top-3	95% CI	Median ms
`gemini-3.1-pro`	0.920	[0.873, 0.960]	0.933	[0.893, 0.973]	8631
`composer-2`	0.913	[0.867, 0.953]	0.947	[0.907, 0.980]	3004
`gpt-5.5`	0.913	[0.867, 0.953]	0.973	[0.947, 0.993]	4050
`claude-opus-4-7`	0.840	[0.780, 0.893]	0.933	[0.893, 0.973]	3178

Per-skill confusion (when expected is X, predicted is Y)

Rows are the ground-truth expected_primary_skill; columns are what models actually predicted. Only finished runs counted.

Expected \ Predicted	`advanced-evaluation`	`bdi-mental-states`	`context-compression`	`context-degradation`	`context-fundamentals`	`context-optimization`	`evaluation`	`filesystem-context`	`harness-engineering`	`hosted-agents`	`latent-briefing`	`memory-systems`	`multi-agent-patterns`	`project-development`	`tool-design`
`advanced-evaluation` (n=60)	48	-	-	-	-	-	10	-	-	-	2	-	-	-	-
`bdi-mental-states` (n=24)	-	24	-	-	-	-	-	-	-	-	-	-	-	-	-
`context-compression` (n=36)	-	2	34	-	-	-	-	-	-	-	-	-	-	-	-
`context-degradation` (n=36)	-	-	-	36	-	-	-	-	-	-	-	-	-	-	-
`context-fundamentals` (n=42)	-	-	-	5	19	6	-	-	-	-	-	-	-	12	-
`context-optimization` (n=36)	-	-	-	-	-	36	-	-	-	-	-	-	-	-	-
`evaluation` (n=36)	4	-	-	-	-	-	32	-	-	-	-	-	-	-	-
`filesystem-context` (n=48)	-	-	-	-	-	-	-	48	-	-	-	-	-	-	-
`harness-engineering` (n=36)	-	-	-	-	-	-	-	-	36	-	-	-	-	-	-
`hosted-agents` (n=24)	-	-	-	-	-	-	-	-	-	24	-	-	-	-	-
`latent-briefing` (n=24)	-	-	-	-	-	-	-	-	-	-	24	-	-	-	-
`memory-systems` (n=36)	-	-	-	-	-	-	-	-	-	-	-	36	-	-	-
`multi-agent-patterns` (n=48)	-	-	-	-	-	-	-	-	-	-	-	-	48	-	-
`project-development` (n=48)	-	-	-	-	-	-	-	-	-	-	-	-	-	48	-
`tool-design` (n=57)	-	-	-	-	1	-	-	4	-	-	-	-	-	7	45

Hardest prompts (lowest top-1 across all models)

Prompt	Expected	Top-1 Rate	Predicted Primaries
p046	`tool-design`	0.00	`filesystem-context`, `project-development`
p048	`advanced-evaluation`	0.00	`evaluation`, `latent-briefing`
p047	`context-fundamentals`	0.17	`context-fundamentals`, `project-development`
p045	`context-fundamentals`	0.33	`context-fundamentals`, `project-development`
p040	`context-fundamentals`	0.50	`context-fundamentals`, `context-optimization`
p001	`context-fundamentals`	0.58	`context-degradation`, `context-fundamentals`
p016	`evaluation`	0.67	`advanced-evaluation`, `evaluation`
p041	`tool-design`	0.75	`context-fundamentals`, `project-development`, `tool-design`
p030	`context-compression`	0.83	`bdi-mental-states`, `context-compression`
p002	`context-degradation`	1.00	`context-degradation`

Reproducibility

Reproduce these numbers exactly with:

cd researcher/benchmarks/sdk-runner
npm install
export CURSOR_API_KEY=<your-key>
node --experimental-strip-types src/runRouter.ts --models claude-opus-4-7,composer-2,gemini-3.1-pro,gpt-5.5 --reps 3 --seed 1 --max-budget-usd 15
python3 ../../scripts/render_router_report.py \
    --results ../router/results/<date>-<seed> \
    --fixture ../router/prompts.jsonl \
    --output ../router/results-published/<date>.md

Per-run JSON artifacts (prompt, model, replication, raw model output, parsed ranking) are preserved under the gitignored results/ directory next to the summary that drives this report.

Preparing the source view

Agent Skills for Context Engineering