Source from repo
Agent Skills for Context Engineering

A comprehensive collection of Agent Skills for context engineering, multi-agent architectures, and production agent systems.
muratcankoylanGitHub muratcankoylanSource repo Original GitHub link
Files
339
Skill
n/a
Size
4.3 MB
Entrypoint
SKILL.md
Format
git-repo
Open file
researcher/benchmarks/router/results-published/2026-05-15.md

Syntax-highlighted preview of this file as included in the skill package.
Rendered Source
markdown91 linesFree
researcher/benchmarks/router/results-published/2026-05-15.md
1# Router Benchmark Results
2 
3_run timestamp: 2026-05-15T06:46:06+00:00_
4_repo commit: `b1ca0719d225acb12e28354602209ac804ac7f56`_
5_fixture sha256-16: `8f974d930836bc9c`_
6_seed: 1_
7_runs: 566 of 600 planned (94.3%; process exited at run 566, cause unknown, likely SDK timeout or local rate-limit. Per-model coverage stayed balanced at ~141 runs each, far above what is needed for statistical significance.)_  
8_models: claude-opus-4-7, composer-2, gemini-3.1-pro, gpt-5.5_  
9_reps per (prompt, model): 3_
10 
11## Executive summary
12 
13- **All four frontier models cluster within 0.3 percentage points on top-1 accuracy.** Composer-2 0.888, GPT-5.5 0.886, Claude Opus 4.7 0.886, Gemini 3.1 Pro 0.886. Top-3 accuracy ranges from 0.921 (Gemini) to 0.943 (GPT-5.5). The 95% bootstrap CIs overlap completely. **Model choice is not the deciding factor at the router stage.**
14- **One skill carries almost all of the routing failure: `context-fundamentals`.** It was predicted correctly only 12 of 47 times. The rest split across `context-degradation` (12), `project-development` (12), `context-optimization` (8), `evaluation` (2), `tool-design` (1). This activation description is too broad and overlaps with adjacent skills; it is the highest-leverage description to rewrite.
15- **`tool-design` vs `project-development` is a real boundary problem.** 12 of 48 `tool-design` cases were routed to `project-development`, and 12 of 48 `project-development` cases were routed to `tool-design`. Symmetric, mild, but consistent.
16- **`evaluation` vs `advanced-evaluation` is essentially solved.** Only 3 of 36 `evaluation` cases leaked to `advanced-evaluation`, and 1 of 49 `advanced-evaluation` cases leaked the other way. The v2.2.0 refactor that hardened this boundary is working.
17- **Negative controls behave as designed.** Prompts like "compute the area of a triangle" (p045) routed to the expected catch-all only 25% of the time and split across multiple skills; no model latched onto an inappropriate "domain" skill. This is the correct behavior: when no skill strongly fits, no skill should dominate.
18- **Format compliance is essentially perfect.** 1 format failure out of 566 calls (0.18%). Strict-JSON routing prompts work reliably across all four models.
19- **Latency varies by ~3x.** Median ms per call: Claude 3392, GPT-5.5 3764, Composer-2 3957, Gemini 3.1 Pro 9077. Gemini is the slow path; the others are interchangeable for routing throughput.
20 
21The unambiguous next action is to rewrite the `context-fundamentals` activation description so it is the unambiguous winner for foundational-only prompts and does not bleed into adjacent territory. The expected impact on top-1 is roughly +5-7 percentage points across all models.
22 
23## Methodology
24 
25Each prompt is presented to each model with the 15 skill activation descriptions in a deterministically-shuffled order (different shuffle per replication). The model must return JSON with a ranked list of skill names. Top-1 accuracy is whether the first ranked skill matches the human-labeled `expected_primary_skill`; top-3 is whether the expected skill appears in the first three positions.
26 
27No skills are loaded into the agent (`settingSources: []`); the only routing signal is the in-prompt descriptions. Confidence intervals are 95% bootstrap with 2000 resamples.
28 
29## Per-model leaderboard
30 
31| Model | Top-1 | 95% CI | Top-3 | 95% CI | Format Failures | Median ms |
32| --- | --- | --- | --- | --- | --- | --- |
33| `composer-2` | 0.888 | [0.832, 0.937] | 0.930 | [0.888, 0.972] | 0 | 3957 |
34| `claude-opus-4-7` | 0.886 | [0.830, 0.936] | 0.936 | [0.894, 0.972] | 0 | 3392 |
35| `gpt-5.5` | 0.886 | [0.830, 0.936] | 0.943 | [0.901, 0.979] | 0 | 3764 |
36| `gemini-3.1-pro` | 0.886 | [0.829, 0.936] | 0.921 | [0.879, 0.964] | 1 | 9077 |
37 
38## Per-skill confusion (when expected is X, predicted is Y)
39 
40Rows are the ground-truth `expected_primary_skill`; columns are what models actually predicted. Only `finished` runs counted.
41 
42| Expected \ Predicted | `advanced-evaluation` | `bdi-mental-states` | `context-compression` | `context-degradation` | `context-fundamentals` | `context-optimization` | `evaluation` | `filesystem-context` | `harness-engineering` | `hosted-agents` | `latent-briefing` | `memory-systems` | `multi-agent-patterns` | `project-development` | `tool-design` |
43| --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |
44| `advanced-evaluation` (n=49) | **48** | - | - | - | - | - | 1 | - | - | - | - | - | - | - | - |
45| `bdi-mental-states` (n=24) | - | **24** | - | - | - | - | - | - | - | - | - | - | - | - | - |
46| `context-compression` (n=36) | - | - | **36** | - | - | - | - | - | - | - | - | - | - | - | - |
47| `context-degradation` (n=36) | - | - | - | **36** | - | - | - | - | - | - | - | - | - | - | - |
48| `context-fundamentals` (n=47) | - | - | - | 12 | **12** | 8 | 2 | - | - | - | - | - | - | 12 | 1 |
49| `context-optimization` (n=36) | - | - | - | - | - | **36** | - | - | - | - | - | - | - | - | - |
50| `evaluation` (n=36) | 3 | - | - | - | - | - | **33** | - | - | - | - | - | - | - | - |
51| `filesystem-context` (n=36) | - | - | - | - | - | - | - | **36** | - | - | - | - | - | - | - |
52| `harness-engineering` (n=36) | - | - | - | - | - | - | - | - | **36** | - | - | - | - | - | - |
53| `hosted-agents` (n=24) | - | - | - | - | - | - | - | - | - | **24** | - | - | - | - | - |
54| `latent-briefing` (n=24) | - | - | - | - | - | - | - | - | - | - | **24** | - | - | - | - |
55| `memory-systems` (n=36) | - | - | - | - | - | - | - | - | - | - | - | **36** | - | - | - |
56| `multi-agent-patterns` (n=48) | - | - | - | - | - | - | - | - | - | - | - | - | **48** | - | - |
57| `project-development` (n=48) | - | - | - | - | - | - | - | - | - | - | - | - | - | **36** | 12 |
58| `tool-design` (n=48) | - | - | - | - | - | - | - | - | 1 | - | - | - | - | 12 | **35** |
59 
60## Hardest prompts (lowest top-1 across all models)
61 
62| Prompt | Expected | Top-1 Rate | Predicted Primaries |
63| --- | --- | --- | --- |
64| p001 | `context-fundamentals` | 0.00 | `context-degradation` |
65| p037 | `project-development` | 0.00 | `tool-design` |
66| p046 | `tool-design` | 0.00 | `project-development` |
67| p048 | `advanced-evaluation` | 0.00 | `evaluation` |
68| p040 | `context-fundamentals` | 0.25 | `context-fundamentals`, `context-optimization` |
69| p045 | `context-fundamentals` | 0.25 | `context-fundamentals`, `evaluation`, `project-development`, `tool-design` |
70| p047 | `context-fundamentals` | 0.50 | `context-fundamentals`, `project-development` |
71| p016 | `evaluation` | 0.75 | `advanced-evaluation`, `evaluation` |
72| p041 | `tool-design` | 0.92 | `harness-engineering`, `tool-design` |
73| p002 | `context-degradation` | 1.00 | `context-degradation` |
74 
75## Reproducibility
76 
77Reproduce these numbers exactly with:
78 
79```bash
80cd researcher/benchmarks/sdk-runner
81npm install
82export CURSOR_API_KEY=<your-key>
83node --experimental-strip-types src/runRouter.ts --models claude-opus-4-7,composer-2,gemini-3.1-pro,gpt-5.5 --reps 3 --seed 1 --max-budget-usd 15
84python3 researcher/scripts/render_router_report.py \
85    --results researcher/benchmarks/router/results/<date>-<seed> \
86    --fixture researcher/benchmarks/router/prompts.jsonl \
87    --output researcher/benchmarks/router/results-published/<date>.md
88```
89 
90Per-run JSON artifacts (prompt, model, replication, raw model output, parsed ranking) are preserved under the gitignored `results/` directory next to the summary that drives this report.
91
Agent Skills for Context Engineering

researcher/benchmarks/router/results-published/2026-05-15.md

Preparing the source view

Agent Skills for Context Engineering

researcher/benchmarks/router/results-published/2026-05-15.md