Router Benchmark Results
run timestamp: 2026-05-19T05:52:52Z repo commit: 272702e0bb1ff4f78d45fb7253da872da170d458 fixture sha256-16: 8f974d930836bc9c seed: 1 runs: 600 models: claude-opus-4-7, composer-2, gemini-3.1-pro, gpt-5.5 reps per (prompt, model): 3
Executive Summary
This sweep validates the corpus-wide hardening pass after all 15 skill bodies, mechanism mappings, claim provenance records, corpus index entries, and activation fixtures were updated.
- 600/600 usable records, 0 format failures. The first pass produced transient empty/format-failed SDK outputs; those records were rerun through the runner's resume path and all completed successfully. The runner now retries transient format failures once.
- No broad routing collapse from the corpus-wide pass. Three models stayed at or above 0.913 top-1. Claude Opus 4.7 is lower at 0.840, but its remaining misses are concentrated in the same known ambiguous boundaries as the previous report.
- Newly hardened skills route cleanly.
bdi-mental-states,hosted-agents,latent-briefing,memory-systems, andmulti-agent-patternsare all perfect in this sweep. - Remaining failures are concentrated and already understood.
p046is a negative-control Python formatting task with no true matching skill,p048is a genuinely ambiguous advanced-evaluation/evaluation/latent-briefing prompt, andcontext-fundamentalsremains the weakest catch-all boundary.
Compared with 2026-05-15-v2.md, the material conclusion is unchanged: the description-benchmark loop worked, and the corpus-wide body/metadata hardening did not introduce a broad router regression. The next benchmark investment should be Stage 3 effectiveness, where full skill bodies are loaded.
Methodology
Each prompt is presented to each model with the 15 skill activation descriptions in a deterministically-shuffled order (different shuffle per replication). The model must return JSON with a ranked list of skill names. Top-1 accuracy is whether the first ranked skill matches the human-labeled expected_primary_skill; top-3 is whether the expected skill appears in the first three positions.
No skills are loaded into the agent (settingSources: []); the only routing signal is the in-prompt descriptions. Confidence intervals are 95% bootstrap with 2000 resamples.
Per-model leaderboard
| Model | Top-1 | 95% CI | Top-3 | 95% CI | Format Failures | Median ms |
|---|---|---|---|---|---|---|
gemini-3.1-pro | 0.920 | [0.873, 0.960] | 0.933 | [0.893, 0.973] | 0 | 8631 |
composer-2 | 0.913 | [0.867, 0.953] | 0.947 | [0.907, 0.980] | 0 | 3004 |
gpt-5.5 | 0.913 | [0.867, 0.953] | 0.973 | [0.947, 0.993] | 0 | 4050 |
claude-opus-4-7 | 0.840 | [0.780, 0.893] | 0.933 | [0.893, 0.973] | 0 | 3178 |
Per-skill confusion (when expected is X, predicted is Y)
Rows are the ground-truth expected_primary_skill; columns are what models actually predicted. Only finished runs counted.
| Expected \ Predicted | advanced-evaluation | bdi-mental-states | context-compression | context-degradation | context-fundamentals | context-optimization | evaluation | filesystem-context | harness-engineering | hosted-agents | latent-briefing | memory-systems | multi-agent-patterns | project-development | tool-design |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
advanced-evaluation (n=60) | 48 | - | - | - | - | - | 10 | - | - | - | 2 | - | - | - | - |
bdi-mental-states (n=24) | - | 24 | - | - | - | - | - | - | - | - | - | - | - | - | - |
context-compression (n=36) | - | 2 | 34 | - | - | - | - | - | - | - | - | - | - | - | - |
context-degradation (n=36) | - | - | - | 36 | - | - | - | - | - | - | - | - | - | - | - |
context-fundamentals (n=42) | - | - | - | 5 | 19 | 6 | - | - | - | - | - | - | - | 12 | - |
context-optimization (n=36) | - | - | - | - | - | 36 | - | - | - | - | - | - | - | - | - |
evaluation (n=36) | 4 | - | - | - | - | - | 32 | - | - | - | - | - | - | - | - |
filesystem-context (n=48) | - | - | - | - | - | - | - | 48 | - | - | - | - | - | - | - |
harness-engineering (n=36) | - | - | - | - | - | - | - | - | 36 | - | - | - | - | - | - |
hosted-agents (n=24) | - | - | - | - | - | - | - | - | - | 24 | - | - | - | - | - |
latent-briefing (n=24) | - | - | - | - | - | - | - | - | - | - | 24 | - | - | - | - |
memory-systems (n=36) | - | - | - | - | - | - | - | - | - | - | - | 36 | - | - | - |
multi-agent-patterns (n=48) | - | - | - | - | - | - | - | - | - | - | - | - | 48 | - | - |
project-development (n=48) | - | - | - | - | - | - | - | - | - | - | - | - | - | 48 | - |
tool-design (n=57) | - | - | - | - | 1 | - | - | 4 | - | - | - | - | - | 7 | 45 |
Hardest prompts (lowest top-1 across all models)
| Prompt | Expected | Top-1 Rate | Predicted Primaries |
|---|---|---|---|
| p046 | tool-design | 0.00 | filesystem-context, project-development |
| p048 | advanced-evaluation | 0.00 | evaluation, latent-briefing |
| p047 | context-fundamentals | 0.17 | context-fundamentals, project-development |
| p045 | context-fundamentals | 0.33 | context-fundamentals, project-development |
| p040 | context-fundamentals | 0.50 | context-fundamentals, context-optimization |
| p001 | context-fundamentals | 0.58 | context-degradation, context-fundamentals |
| p016 | evaluation | 0.67 | advanced-evaluation, evaluation |
| p041 | tool-design | 0.75 | context-fundamentals, project-development, tool-design |
| p030 | context-compression | 0.83 | bdi-mental-states, context-compression |
| p002 | context-degradation | 1.00 | context-degradation |
Reproducibility
Reproduce these numbers exactly with:
cd researcher/benchmarks/sdk-runner
npm install
export CURSOR_API_KEY=<your-key>
node --experimental-strip-types src/runRouter.ts --models claude-opus-4-7,composer-2,gemini-3.1-pro,gpt-5.5 --reps 3 --seed 1 --max-budget-usd 15
python3 ../../scripts/render_router_report.py \
--results ../router/results/<date>-<seed> \
--fixture ../router/prompts.jsonl \
--output ../router/results-published/<date>.mdPer-run JSON artifacts (prompt, model, replication, raw model output, parsed ranking) are preserved under the gitignored results/ directory next to the summary that drives this report.