Loading source
Pulling the file list, source metadata, and syntax-aware rendering for this listing.
Source from repo
A comprehensive collection of Agent Skills for context engineering, multi-agent architectures, and production agent systems.
Files
Skill
Size
Entrypoint
Format
Open file
Syntax-highlighted preview of this file as included in the skill package.
researcher/benchmarks/router/results-published/2026-05-19.md
1# Router Benchmark Results23_run timestamp: 2026-05-19T05:52:52Z_4_repo commit: `272702e0bb1ff4f78d45fb7253da872da170d458`_5_fixture sha256-16: `8f974d930836bc9c`_6_seed: 1_7_runs: 600_8_models: claude-opus-4-7, composer-2, gemini-3.1-pro, gpt-5.5_9_reps per (prompt, model): 3_1011## Executive Summary1213This sweep validates the corpus-wide hardening pass after all 15 skill bodies, mechanism mappings, claim provenance records, corpus index entries, and activation fixtures were updated.1415- **600/600 usable records, 0 format failures.** The first pass produced transient empty/format-failed SDK outputs; those records were rerun through the runner's resume path and all completed successfully. The runner now retries transient format failures once.16- **No broad routing collapse from the corpus-wide pass.** Three models stayed at or above 0.913 top-1. Claude Opus 4.7 is lower at 0.840, but its remaining misses are concentrated in the same known ambiguous boundaries as the previous report.17- **Newly hardened skills route cleanly.** `bdi-mental-states`, `hosted-agents`, `latent-briefing`, `memory-systems`, and `multi-agent-patterns` are all perfect in this sweep.18- **Remaining failures are concentrated and already understood.** `p046` is a negative-control Python formatting task with no true matching skill, `p048` is a genuinely ambiguous advanced-evaluation/evaluation/latent-briefing prompt, and `context-fundamentals` remains the weakest catch-all boundary.1920Compared with `2026-05-15-v2.md`, the material conclusion is unchanged: the description-benchmark loop worked, and the corpus-wide body/metadata hardening did not introduce a broad router regression. The next benchmark investment should be Stage 3 effectiveness, where full skill bodies are loaded.2122## Methodology2324Each prompt is presented to each model with the 15 skill activation descriptions in a deterministically-shuffled order (different shuffle per replication). The model must return JSON with a ranked list of skill names. Top-1 accuracy is whether the first ranked skill matches the human-labeled `expected_primary_skill`; top-3 is whether the expected skill appears in the first three positions.2526No skills are loaded into the agent (`settingSources: []`); the only routing signal is the in-prompt descriptions. Confidence intervals are 95% bootstrap with 2000 resamples.2728## Per-model leaderboard2930| Model | Top-1 | 95% CI | Top-3 | 95% CI | Format Failures | Median ms |31| --- | --- | --- | --- | --- | --- | --- |32| `gemini-3.1-pro` | 0.920 | [0.873, 0.960] | 0.933 | [0.893, 0.973] | 0 | 8631 |33| `composer-2` | 0.913 | [0.867, 0.953] | 0.947 | [0.907, 0.980] | 0 | 3004 |34| `gpt-5.5` | 0.913 | [0.867, 0.953] | 0.973 | [0.947, 0.993] | 0 | 4050 |35| `claude-opus-4-7` | 0.840 | [0.780, 0.893] | 0.933 | [0.893, 0.973] | 0 | 3178 |3637## Per-skill confusion (when expected is X, predicted is Y)3839Rows are the ground-truth `expected_primary_skill`; columns are what models actually predicted. Only `finished` runs counted.4041| Expected \ Predicted | `advanced-evaluation` | `bdi-mental-states` | `context-compression` | `context-degradation` | `context-fundamentals` | `context-optimization` | `evaluation` | `filesystem-context` | `harness-engineering` | `hosted-agents` | `latent-briefing` | `memory-systems` | `multi-agent-patterns` | `project-development` | `tool-design` |42| --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |43| `advanced-evaluation` (n=60) | **48** | - | - | - | - | - | 10 | - | - | - | 2 | - | - | - | - |44| `bdi-mental-states` (n=24) | - | **24** | - | - | - | - | - | - | - | - | - | - | - | - | - |45| `context-compression` (n=36) | - | 2 | **34** | - | - | - | - | - | - | - | - | - | - | - | - |46| `context-degradation` (n=36) | - | - | - | **36** | - | - | - | - | - | - | - | - | - | - | - |47| `context-fundamentals` (n=42) | - | - | - | 5 | **19** | 6 | - | - | - | - | - | - | - | 12 | - |48| `context-optimization` (n=36) | - | - | - | - | - | **36** | - | - | - | - | - | - | - | - | - |49| `evaluation` (n=36) | 4 | - | - | - | - | - | **32** | - | - | - | - | - | - | - | - |50| `filesystem-context` (n=48) | - | - | - | - | - | - | - | **48** | - | - | - | - | - | - | - |51| `harness-engineering` (n=36) | - | - | - | - | - | - | - | - | **36** | - | - | - | - | - | - |52| `hosted-agents` (n=24) | - | - | - | - | - | - | - | - | - | **24** | - | - | - | - | - |53| `latent-briefing` (n=24) | - | - | - | - | - | - | - | - | - | - | **24** | - | - | - | - |54| `memory-systems` (n=36) | - | - | - | - | - | - | - | - | - | - | - | **36** | - | - | - |55| `multi-agent-patterns` (n=48) | - | - | - | - | - | - | - | - | - | - | - | - | **48** | - | - |56| `project-development` (n=48) | - | - | - | - | - | - | - | - | - | - | - | - | - | **48** | - |57| `tool-design` (n=57) | - | - | - | - | 1 | - | - | 4 | - | - | - | - | - | 7 | **45** |5859## Hardest prompts (lowest top-1 across all models)6061| Prompt | Expected | Top-1 Rate | Predicted Primaries |62| --- | --- | --- | --- |63| p046 | `tool-design` | 0.00 | `filesystem-context`, `project-development` |64| p048 | `advanced-evaluation` | 0.00 | `evaluation`, `latent-briefing` |65| p047 | `context-fundamentals` | 0.17 | `context-fundamentals`, `project-development` |66| p045 | `context-fundamentals` | 0.33 | `context-fundamentals`, `project-development` |67| p040 | `context-fundamentals` | 0.50 | `context-fundamentals`, `context-optimization` |68| p001 | `context-fundamentals` | 0.58 | `context-degradation`, `context-fundamentals` |69| p016 | `evaluation` | 0.67 | `advanced-evaluation`, `evaluation` |70| p041 | `tool-design` | 0.75 | `context-fundamentals`, `project-development`, `tool-design` |71| p030 | `context-compression` | 0.83 | `bdi-mental-states`, `context-compression` |72| p002 | `context-degradation` | 1.00 | `context-degradation` |7374## Reproducibility7576Reproduce these numbers exactly with:7778```bash79cd researcher/benchmarks/sdk-runner80npm install81export CURSOR_API_KEY=<your-key>82node --experimental-strip-types src/runRouter.ts --models claude-opus-4-7,composer-2,gemini-3.1-pro,gpt-5.5 --reps 3 --seed 1 --max-budget-usd 1583python3 ../../scripts/render_router_report.py \84--results ../router/results/<date>-<seed> \85--fixture ../router/prompts.jsonl \86--output ../router/results-published/<date>.md87```8889Per-run JSON artifacts (prompt, model, replication, raw model output, parsed ranking) are preserved under the gitignored `results/` directory next to the summary that drives this report.90