Loading source
Pulling the file list, source metadata, and syntax-aware rendering for this listing.
Source from repo
A comprehensive collection of Agent Skills for context engineering, multi-agent architectures, and production agent systems.
Files
Skill
Size
Entrypoint
Format
Open file
Syntax-highlighted preview of this file as included in the skill package.
researcher/benchmarks/router/results-published/2026-05-15-v2.md
1# Router Benchmark Results (run 2)23_run timestamp: 2026-05-15T14:08:51Z_4_repo commit: `358c36b461df4c0cb7a5c972e2f789dce8c12d3a` (description fixes + hardened runner)_5_fixture sha256-16: `8f974d930836bc9c` (unchanged from baseline)_6_seed: 1_7_runs: 600 of 600 (100%; resume + concurrency=4; ~15 minute wall time vs ~60 minutes sequential)_8_models: claude-opus-4-7, composer-2, gemini-3.1-pro, gpt-5.5_9_reps per (prompt, model): 3_1011## Executive summary1213This run measures the effect of the description rewrites and harness improvements that landed after the baseline at `results-published/2026-05-15.md`.1415**Headline results:**1617- **3 of 4 models improved on top-1 accuracy.** Composer-2 0.888 -> 0.913 (+2.5pp), GPT-5.5 0.886 -> 0.913 (+2.7pp), Gemini 3.1 Pro 0.886 -> 0.925 (+3.9pp). Claude Opus 4.7 went 0.886 -> 0.867 (-2.0pp top-1) but improved on top-3 (+1.7pp), so it is essentially noise.18- **All 4 models improved on top-3 accuracy.** Composer-2 top-3 jumped from 0.930 to 0.973 (+4.3pp).19- **The targeted description rewrites worked.** `context-fundamentals` went from 12/47 = 0.255 to 22/45 = 0.489 (**+23.4pp**, the largest single-skill improvement). `project-development` went from 0.750 to 1.000 (**+25pp, now perfect routing**). `tool-design` went from 0.729 to 0.807 (+7.8pp).20- **The previously-hardest prompts are mostly fixed.** p001 ("Explain why context windows degrade") went from 0.00 to 0.83 top-1. p037 ("Why structured output design") went from 0.00 to 1.00. p040 and p045 (other context-fundamentals prompts) improved by +17pp each.21- **One apparent regression is mostly an artifact.** `advanced-evaluation` looks like it dropped from 0.980 to 0.797, but the baseline only completed 49 of 60 expected runs (the v1 process died at 566/600); the new run completed all 60 (well, 59 after one format failure). The 11 newly-attempted runs were heavily weighted toward p048, a genuinely ambiguous prompt ("Plan how to evaluate KV compaction with ablations and baselines") that routes to `evaluation` 11/12 times across all models. Absolute correct count is 48 (baseline) vs 47 (new): essentially the same.2223**What the data says we should do next:**2425- The remaining failure modes are now concentrated in `context-fundamentals` (still only 49% top-1) and a few specific prompts:26- p047 ("Translate this English paragraph to French") regressed -33pp on top-1. This is a negative control where no skill should fit; the new routing went to `project-development` instead of `context-fundamentals`. This is fine behavior; the fixture's expected primary is debatable.27- p046 ("Reformat this Python file") and p048 ("Plan KV compaction evaluation") remain at 0.00 top-1 across all models. Both are genuinely ambiguous; consider re-labeling.28- `context-fundamentals` still has 14 confusions to `project-development`. The new description routes correctly when prompts use foundational vocabulary ("attention mechanics", "anatomy of context"), but generic onboarding prompts ("explain context for a new team member") still route to project-development. May need one more description pass.2930**Methodological notes:**3132- **Wall time improvement:** 60min -> 15min via concurrency=4. The hardened runner means future sweeps will not silently die at 94%, and resume capability means a killed sweep can be picked up exactly where it left off.33- **Format compliance:** 597 of 600 (99.5%). Three format failures from Gemini; all four other models had zero. Gemini's strict-JSON adherence is measurably weaker than the other three.34- **Latency:** Gemini median 9130ms vs 3269-4201ms for others. Same pattern as baseline; Gemini is consistently ~2.5-3x slower.3536## Methodology3738Each prompt is presented to each model with the 15 skill activation descriptions in a deterministically-shuffled order (different shuffle per replication). The model must return JSON with a ranked list of skill names. Top-1 accuracy is whether the first ranked skill matches the human-labeled `expected_primary_skill`; top-3 is whether the expected skill appears in the first three positions.3940No skills are loaded into the agent (`settingSources: []`); the only routing signal is the in-prompt descriptions. Confidence intervals are 95% bootstrap with 2000 resamples.4142## Per-model leaderboard4344| Model | Top-1 | 95% CI | Top-3 | 95% CI | Format Failures | Median ms |45| --- | --- | --- | --- | --- | --- | --- |46| `gemini-3.1-pro` | 0.925 | [0.884, 0.966] | 0.932 | [0.884, 0.973] | 3 | 9130 |47| `composer-2` | 0.913 | [0.867, 0.953] | 0.973 | [0.947, 0.993] | 0 | 3269 |48| `gpt-5.5` | 0.913 | [0.867, 0.953] | 0.953 | [0.913, 0.980] | 0 | 4201 |49| `claude-opus-4-7` | 0.867 | [0.813, 0.920] | 0.953 | [0.920, 0.987] | 0 | 3355 |5051## Per-skill confusion (when expected is X, predicted is Y)5253Rows are the ground-truth `expected_primary_skill`; columns are what models actually predicted. Only `finished` runs counted.5455| Expected \ Predicted | `advanced-evaluation` | `bdi-mental-states` | `context-compression` | `context-degradation` | `context-fundamentals` | `context-optimization` | `evaluation` | `filesystem-context` | `harness-engineering` | `hosted-agents` | `latent-briefing` | `memory-systems` | `multi-agent-patterns` | `project-development` | `tool-design` |56| --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |57| `advanced-evaluation` (n=59) | **47** | - | - | - | - | - | 11 | - | - | - | 1 | - | - | - | - |58| `bdi-mental-states` (n=24) | - | **24** | - | - | - | - | - | - | - | - | - | - | - | - | - |59| `context-compression` (n=35) | - | 1 | **34** | - | - | - | - | - | - | - | - | - | - | - | - |60| `context-degradation` (n=36) | - | - | - | **36** | - | - | - | - | - | - | - | - | - | - | - |61| `context-fundamentals` (n=45) | - | - | - | 2 | **22** | 7 | - | - | - | - | - | - | - | 14 | - |62| `context-optimization` (n=36) | - | - | - | - | - | **36** | - | - | - | - | - | - | - | - | - |63| `evaluation` (n=36) | 3 | - | - | - | - | - | **33** | - | - | - | - | - | - | - | - |64| `filesystem-context` (n=48) | - | - | - | - | - | 2 | - | **46** | - | - | - | - | - | - | - |65| `harness-engineering` (n=36) | - | - | - | - | - | - | - | - | **36** | - | - | - | - | - | - |66| `hosted-agents` (n=24) | - | - | - | - | - | - | - | - | - | **24** | - | - | - | - | - |67| `latent-briefing` (n=24) | - | - | - | - | - | - | - | - | - | - | **24** | - | - | - | - |68| `memory-systems` (n=36) | - | - | - | - | - | - | - | - | - | - | - | **36** | - | - | - |69| `multi-agent-patterns` (n=48) | - | - | - | - | - | - | - | - | - | - | - | - | **48** | - | - |70| `project-development` (n=48) | - | - | - | - | - | - | - | - | - | - | - | - | - | **48** | - |71| `tool-design` (n=57) | - | - | - | - | 1 | - | - | 1 | 1 | - | - | - | - | 8 | **46** |7273## Hardest prompts (lowest top-1 across all models)7475| Prompt | Expected | Top-1 Rate | Predicted Primaries |76| --- | --- | --- | --- |77| p046 | `tool-design` | 0.00 | `filesystem-context`, `harness-engineering`, `project-development` |78| p048 | `advanced-evaluation` | 0.00 | `evaluation`, `latent-briefing` |79| p047 | `context-fundamentals` | 0.17 | `context-fundamentals`, `project-development` |80| p040 | `context-fundamentals` | 0.42 | `context-fundamentals`, `context-optimization` |81| p045 | `context-fundamentals` | 0.42 | `context-fundamentals`, `project-development` |82| p016 | `evaluation` | 0.75 | `advanced-evaluation`, `evaluation` |83| p001 | `context-fundamentals` | 0.83 | `context-degradation`, `context-fundamentals` |84| p030 | `context-compression` | 0.83 | `bdi-mental-states`, `context-compression` |85| p031 | `filesystem-context` | 0.83 | `context-optimization`, `filesystem-context` |86| p041 | `tool-design` | 0.83 | `context-fundamentals`, `project-development`, `tool-design` |8788## Reproducibility8990Reproduce these numbers exactly with:9192```bash93cd researcher/benchmarks/sdk-runner94npm install95export CURSOR_API_KEY=<your-key>96node --experimental-strip-types src/runRouter.ts --models claude-opus-4-7,composer-2,gemini-3.1-pro,gpt-5.5 --reps 3 --seed 1 --max-budget-usd 1597python3 researcher/scripts/render_router_report.py \98--results researcher/benchmarks/router/results/<date>-<seed> \99--fixture researcher/benchmarks/router/prompts.jsonl \100--output researcher/benchmarks/router/results-published/<date>.md101```102103Per-run JSON artifacts (prompt, model, replication, raw model output, parsed ranking) are preserved under the gitignored `results/` directory next to the summary that drives this report.104105## Delta vs baseline106107_baseline: 2026-05-15 v2.2.0 descriptions (commit a865a8e)_108109### Per-model accuracy change110111| Model | Baseline Top-1 | New Top-1 | Delta | Baseline Top-3 | New Top-3 | Delta |112| --- | --- | --- | --- | --- | --- | --- |113| `claude-opus-4-7` | 0.886 | 0.867 | -0.020 | 0.936 | 0.953 | +0.017 |114| `composer-2` | 0.888 | 0.913 | +0.025 | 0.930 | 0.973 | +0.043 |115| `gemini-3.1-pro` | 0.886 | 0.925 | +0.039 | 0.921 | 0.932 | +0.011 |116| `gpt-5.5` | 0.886 | 0.913 | +0.027 | 0.943 | 0.953 | +0.010 |117118### Per-skill top-1 rate change119120Counts a row as correct when the predicted primary equals the expected primary.121122| Skill (expected) | Baseline | New | Delta |123| --- | --- | --- | --- |124| `advanced-evaluation` | 48/49 = 0.980 | 47/59 = 0.797 | -0.183 <- regressed |125| `bdi-mental-states` | 24/24 = 1.000 | 24/24 = 1.000 | 0.000 |126| `context-compression` | 36/36 = 1.000 | 34/35 = 0.971 | -0.029 |127| `context-degradation` | 36/36 = 1.000 | 36/36 = 1.000 | 0.000 |128| `context-fundamentals` | 12/47 = 0.255 | 22/45 = 0.489 | +0.234 <- improved |129| `context-optimization` | 36/36 = 1.000 | 36/36 = 1.000 | 0.000 |130| `evaluation` | 33/36 = 0.917 | 33/36 = 0.917 | 0.000 |131| `filesystem-context` | 36/36 = 1.000 | 46/48 = 0.958 | -0.042 |132| `harness-engineering` | 36/36 = 1.000 | 36/36 = 1.000 | 0.000 |133| `hosted-agents` | 24/24 = 1.000 | 24/24 = 1.000 | 0.000 |134| `latent-briefing` | 24/24 = 1.000 | 24/24 = 1.000 | 0.000 |135| `memory-systems` | 36/36 = 1.000 | 36/36 = 1.000 | 0.000 |136| `multi-agent-patterns` | 48/48 = 1.000 | 48/48 = 1.000 | 0.000 |137| `project-development` | 36/48 = 0.750 | 48/48 = 1.000 | +0.250 <- improved |138| `tool-design` | 35/48 = 0.729 | 46/57 = 0.807 | +0.078 <- improved |139140### Previously-hardest prompts141142| Prompt | Expected | Baseline Top-1 Rate | New Top-1 Rate | Delta |143| --- | --- | --- | --- | --- |144| p001 | `context-fundamentals` | 0.00 | 0.83 | +0.833 |145| p002 | `context-degradation` | 1.00 | 1.00 | 0.000 |146| p016 | `evaluation` | 0.75 | 0.75 | 0.000 |147| p037 | `project-development` | 0.00 | 1.00 | +1.000 |148| p040 | `context-fundamentals` | 0.25 | 0.42 | +0.167 |149| p041 | `tool-design` | 0.92 | 0.83 | -0.084 |150| p045 | `context-fundamentals` | 0.25 | 0.42 | +0.167 |151| p046 | `tool-design` | 0.00 | 0.00 | 0.000 |152| p047 | `context-fundamentals` | 0.50 | 0.17 | -0.333 |153| p048 | `advanced-evaluation` | 0.00 | 0.00 | 0.000 |154