Loading source
Pulling the file list, source metadata, and syntax-aware rendering for this listing.
Source from repo
A comprehensive collection of Agent Skills for context engineering, multi-agent architectures, and production agent systems.
Files
Skill
Size
Entrypoint
Format
Open file
Syntax-highlighted preview of this file as included in the skill package.
researcher/benchmarks/router/results-published/2026-05-15.md
1# Router Benchmark Results23_run timestamp: 2026-05-15T06:46:06+00:00_4_repo commit: `b1ca0719d225acb12e28354602209ac804ac7f56`_5_fixture sha256-16: `8f974d930836bc9c`_6_seed: 1_7_runs: 566 of 600 planned (94.3%; process exited at run 566, cause unknown, likely SDK timeout or local rate-limit. Per-model coverage stayed balanced at ~141 runs each, far above what is needed for statistical significance.)_8_models: claude-opus-4-7, composer-2, gemini-3.1-pro, gpt-5.5_9_reps per (prompt, model): 3_1011## Executive summary1213- **All four frontier models cluster within 0.3 percentage points on top-1 accuracy.** Composer-2 0.888, GPT-5.5 0.886, Claude Opus 4.7 0.886, Gemini 3.1 Pro 0.886. Top-3 accuracy ranges from 0.921 (Gemini) to 0.943 (GPT-5.5). The 95% bootstrap CIs overlap completely. **Model choice is not the deciding factor at the router stage.**14- **One skill carries almost all of the routing failure: `context-fundamentals`.** It was predicted correctly only 12 of 47 times. The rest split across `context-degradation` (12), `project-development` (12), `context-optimization` (8), `evaluation` (2), `tool-design` (1). This activation description is too broad and overlaps with adjacent skills; it is the highest-leverage description to rewrite.15- **`tool-design` vs `project-development` is a real boundary problem.** 12 of 48 `tool-design` cases were routed to `project-development`, and 12 of 48 `project-development` cases were routed to `tool-design`. Symmetric, mild, but consistent.16- **`evaluation` vs `advanced-evaluation` is essentially solved.** Only 3 of 36 `evaluation` cases leaked to `advanced-evaluation`, and 1 of 49 `advanced-evaluation` cases leaked the other way. The v2.2.0 refactor that hardened this boundary is working.17- **Negative controls behave as designed.** Prompts like "compute the area of a triangle" (p045) routed to the expected catch-all only 25% of the time and split across multiple skills; no model latched onto an inappropriate "domain" skill. This is the correct behavior: when no skill strongly fits, no skill should dominate.18- **Format compliance is essentially perfect.** 1 format failure out of 566 calls (0.18%). Strict-JSON routing prompts work reliably across all four models.19- **Latency varies by ~3x.** Median ms per call: Claude 3392, GPT-5.5 3764, Composer-2 3957, Gemini 3.1 Pro 9077. Gemini is the slow path; the others are interchangeable for routing throughput.2021The unambiguous next action is to rewrite the `context-fundamentals` activation description so it is the unambiguous winner for foundational-only prompts and does not bleed into adjacent territory. The expected impact on top-1 is roughly +5-7 percentage points across all models.2223## Methodology2425Each prompt is presented to each model with the 15 skill activation descriptions in a deterministically-shuffled order (different shuffle per replication). The model must return JSON with a ranked list of skill names. Top-1 accuracy is whether the first ranked skill matches the human-labeled `expected_primary_skill`; top-3 is whether the expected skill appears in the first three positions.2627No skills are loaded into the agent (`settingSources: []`); the only routing signal is the in-prompt descriptions. Confidence intervals are 95% bootstrap with 2000 resamples.2829## Per-model leaderboard3031| Model | Top-1 | 95% CI | Top-3 | 95% CI | Format Failures | Median ms |32| --- | --- | --- | --- | --- | --- | --- |33| `composer-2` | 0.888 | [0.832, 0.937] | 0.930 | [0.888, 0.972] | 0 | 3957 |34| `claude-opus-4-7` | 0.886 | [0.830, 0.936] | 0.936 | [0.894, 0.972] | 0 | 3392 |35| `gpt-5.5` | 0.886 | [0.830, 0.936] | 0.943 | [0.901, 0.979] | 0 | 3764 |36| `gemini-3.1-pro` | 0.886 | [0.829, 0.936] | 0.921 | [0.879, 0.964] | 1 | 9077 |3738## Per-skill confusion (when expected is X, predicted is Y)3940Rows are the ground-truth `expected_primary_skill`; columns are what models actually predicted. Only `finished` runs counted.4142| Expected \ Predicted | `advanced-evaluation` | `bdi-mental-states` | `context-compression` | `context-degradation` | `context-fundamentals` | `context-optimization` | `evaluation` | `filesystem-context` | `harness-engineering` | `hosted-agents` | `latent-briefing` | `memory-systems` | `multi-agent-patterns` | `project-development` | `tool-design` |43| --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |44| `advanced-evaluation` (n=49) | **48** | - | - | - | - | - | 1 | - | - | - | - | - | - | - | - |45| `bdi-mental-states` (n=24) | - | **24** | - | - | - | - | - | - | - | - | - | - | - | - | - |46| `context-compression` (n=36) | - | - | **36** | - | - | - | - | - | - | - | - | - | - | - | - |47| `context-degradation` (n=36) | - | - | - | **36** | - | - | - | - | - | - | - | - | - | - | - |48| `context-fundamentals` (n=47) | - | - | - | 12 | **12** | 8 | 2 | - | - | - | - | - | - | 12 | 1 |49| `context-optimization` (n=36) | - | - | - | - | - | **36** | - | - | - | - | - | - | - | - | - |50| `evaluation` (n=36) | 3 | - | - | - | - | - | **33** | - | - | - | - | - | - | - | - |51| `filesystem-context` (n=36) | - | - | - | - | - | - | - | **36** | - | - | - | - | - | - | - |52| `harness-engineering` (n=36) | - | - | - | - | - | - | - | - | **36** | - | - | - | - | - | - |53| `hosted-agents` (n=24) | - | - | - | - | - | - | - | - | - | **24** | - | - | - | - | - |54| `latent-briefing` (n=24) | - | - | - | - | - | - | - | - | - | - | **24** | - | - | - | - |55| `memory-systems` (n=36) | - | - | - | - | - | - | - | - | - | - | - | **36** | - | - | - |56| `multi-agent-patterns` (n=48) | - | - | - | - | - | - | - | - | - | - | - | - | **48** | - | - |57| `project-development` (n=48) | - | - | - | - | - | - | - | - | - | - | - | - | - | **36** | 12 |58| `tool-design` (n=48) | - | - | - | - | - | - | - | - | 1 | - | - | - | - | 12 | **35** |5960## Hardest prompts (lowest top-1 across all models)6162| Prompt | Expected | Top-1 Rate | Predicted Primaries |63| --- | --- | --- | --- |64| p001 | `context-fundamentals` | 0.00 | `context-degradation` |65| p037 | `project-development` | 0.00 | `tool-design` |66| p046 | `tool-design` | 0.00 | `project-development` |67| p048 | `advanced-evaluation` | 0.00 | `evaluation` |68| p040 | `context-fundamentals` | 0.25 | `context-fundamentals`, `context-optimization` |69| p045 | `context-fundamentals` | 0.25 | `context-fundamentals`, `evaluation`, `project-development`, `tool-design` |70| p047 | `context-fundamentals` | 0.50 | `context-fundamentals`, `project-development` |71| p016 | `evaluation` | 0.75 | `advanced-evaluation`, `evaluation` |72| p041 | `tool-design` | 0.92 | `harness-engineering`, `tool-design` |73| p002 | `context-degradation` | 1.00 | `context-degradation` |7475## Reproducibility7677Reproduce these numbers exactly with:7879```bash80cd researcher/benchmarks/sdk-runner81npm install82export CURSOR_API_KEY=<your-key>83node --experimental-strip-types src/runRouter.ts --models claude-opus-4-7,composer-2,gemini-3.1-pro,gpt-5.5 --reps 3 --seed 1 --max-budget-usd 1584python3 researcher/scripts/render_router_report.py \85--results researcher/benchmarks/router/results/<date>-<seed> \86--fixture researcher/benchmarks/router/prompts.jsonl \87--output researcher/benchmarks/router/results-published/<date>.md88```8990Per-run JSON artifacts (prompt, model, replication, raw model output, parsed ranking) are preserved under the gitignored `results/` directory next to the summary that drives this report.91