Source from repo
Agent Skills for Context Engineering

A comprehensive collection of Agent Skills for context engineering, multi-agent architectures, and production agent systems.
muratcankoylanGitHub muratcankoylanSource repo Original GitHub link
Files
339
Skill
n/a
Size
4.3 MB
Entrypoint
SKILL.md
Format
git-repo
Open file
researcher/benchmarks/router/results-published/2026-05-15-v2.md

Syntax-highlighted preview of this file as included in the skill package.
Rendered Source
markdown154 linesFree
researcher/benchmarks/router/results-published/2026-05-15-v2.md
1# Router Benchmark Results (run 2)
2 
3_run timestamp: 2026-05-15T14:08:51Z_
4_repo commit: `358c36b461df4c0cb7a5c972e2f789dce8c12d3a` (description fixes + hardened runner)_
5_fixture sha256-16: `8f974d930836bc9c` (unchanged from baseline)_
6_seed: 1_
7_runs: 600 of 600 (100%; resume + concurrency=4; ~15 minute wall time vs ~60 minutes sequential)_  
8_models: claude-opus-4-7, composer-2, gemini-3.1-pro, gpt-5.5_  
9_reps per (prompt, model): 3_
10 
11## Executive summary
12 
13This run measures the effect of the description rewrites and harness improvements that landed after the baseline at `results-published/2026-05-15.md`.
14 
15**Headline results:**
16 
17- **3 of 4 models improved on top-1 accuracy.** Composer-2 0.888 -> 0.913 (+2.5pp), GPT-5.5 0.886 -> 0.913 (+2.7pp), Gemini 3.1 Pro 0.886 -> 0.925 (+3.9pp). Claude Opus 4.7 went 0.886 -> 0.867 (-2.0pp top-1) but improved on top-3 (+1.7pp), so it is essentially noise.
18- **All 4 models improved on top-3 accuracy.** Composer-2 top-3 jumped from 0.930 to 0.973 (+4.3pp).
19- **The targeted description rewrites worked.** `context-fundamentals` went from 12/47 = 0.255 to 22/45 = 0.489 (**+23.4pp**, the largest single-skill improvement). `project-development` went from 0.750 to 1.000 (**+25pp, now perfect routing**). `tool-design` went from 0.729 to 0.807 (+7.8pp).
20- **The previously-hardest prompts are mostly fixed.** p001 ("Explain why context windows degrade") went from 0.00 to 0.83 top-1. p037 ("Why structured output design") went from 0.00 to 1.00. p040 and p045 (other context-fundamentals prompts) improved by +17pp each.
21- **One apparent regression is mostly an artifact.** `advanced-evaluation` looks like it dropped from 0.980 to 0.797, but the baseline only completed 49 of 60 expected runs (the v1 process died at 566/600); the new run completed all 60 (well, 59 after one format failure). The 11 newly-attempted runs were heavily weighted toward p048, a genuinely ambiguous prompt ("Plan how to evaluate KV compaction with ablations and baselines") that routes to `evaluation` 11/12 times across all models. Absolute correct count is 48 (baseline) vs 47 (new): essentially the same.
22 
23**What the data says we should do next:**
24 
25- The remaining failure modes are now concentrated in `context-fundamentals` (still only 49% top-1) and a few specific prompts:
26  - p047 ("Translate this English paragraph to French") regressed -33pp on top-1. This is a negative control where no skill should fit; the new routing went to `project-development` instead of `context-fundamentals`. This is fine behavior; the fixture's expected primary is debatable.
27  - p046 ("Reformat this Python file") and p048 ("Plan KV compaction evaluation") remain at 0.00 top-1 across all models. Both are genuinely ambiguous; consider re-labeling.
28- `context-fundamentals` still has 14 confusions to `project-development`. The new description routes correctly when prompts use foundational vocabulary ("attention mechanics", "anatomy of context"), but generic onboarding prompts ("explain context for a new team member") still route to project-development. May need one more description pass.
29 
30**Methodological notes:**
31 
32- **Wall time improvement:** 60min -> 15min via concurrency=4. The hardened runner means future sweeps will not silently die at 94%, and resume capability means a killed sweep can be picked up exactly where it left off.
33- **Format compliance:** 597 of 600 (99.5%). Three format failures from Gemini; all four other models had zero. Gemini's strict-JSON adherence is measurably weaker than the other three.
34- **Latency:** Gemini median 9130ms vs 3269-4201ms for others. Same pattern as baseline; Gemini is consistently ~2.5-3x slower.
35 
36## Methodology
37 
38Each prompt is presented to each model with the 15 skill activation descriptions in a deterministically-shuffled order (different shuffle per replication). The model must return JSON with a ranked list of skill names. Top-1 accuracy is whether the first ranked skill matches the human-labeled `expected_primary_skill`; top-3 is whether the expected skill appears in the first three positions.
39 
40No skills are loaded into the agent (`settingSources: []`); the only routing signal is the in-prompt descriptions. Confidence intervals are 95% bootstrap with 2000 resamples.
41 
42## Per-model leaderboard
43 
44| Model | Top-1 | 95% CI | Top-3 | 95% CI | Format Failures | Median ms |
45| --- | --- | --- | --- | --- | --- | --- |
46| `gemini-3.1-pro` | 0.925 | [0.884, 0.966] | 0.932 | [0.884, 0.973] | 3 | 9130 |
47| `composer-2` | 0.913 | [0.867, 0.953] | 0.973 | [0.947, 0.993] | 0 | 3269 |
48| `gpt-5.5` | 0.913 | [0.867, 0.953] | 0.953 | [0.913, 0.980] | 0 | 4201 |
49| `claude-opus-4-7` | 0.867 | [0.813, 0.920] | 0.953 | [0.920, 0.987] | 0 | 3355 |
50 
51## Per-skill confusion (when expected is X, predicted is Y)
52 
53Rows are the ground-truth `expected_primary_skill`; columns are what models actually predicted. Only `finished` runs counted.
54 
55| Expected \ Predicted | `advanced-evaluation` | `bdi-mental-states` | `context-compression` | `context-degradation` | `context-fundamentals` | `context-optimization` | `evaluation` | `filesystem-context` | `harness-engineering` | `hosted-agents` | `latent-briefing` | `memory-systems` | `multi-agent-patterns` | `project-development` | `tool-design` |
56| --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |
57| `advanced-evaluation` (n=59) | **47** | - | - | - | - | - | 11 | - | - | - | 1 | - | - | - | - |
58| `bdi-mental-states` (n=24) | - | **24** | - | - | - | - | - | - | - | - | - | - | - | - | - |
59| `context-compression` (n=35) | - | 1 | **34** | - | - | - | - | - | - | - | - | - | - | - | - |
60| `context-degradation` (n=36) | - | - | - | **36** | - | - | - | - | - | - | - | - | - | - | - |
61| `context-fundamentals` (n=45) | - | - | - | 2 | **22** | 7 | - | - | - | - | - | - | - | 14 | - |
62| `context-optimization` (n=36) | - | - | - | - | - | **36** | - | - | - | - | - | - | - | - | - |
63| `evaluation` (n=36) | 3 | - | - | - | - | - | **33** | - | - | - | - | - | - | - | - |
64| `filesystem-context` (n=48) | - | - | - | - | - | 2 | - | **46** | - | - | - | - | - | - | - |
65| `harness-engineering` (n=36) | - | - | - | - | - | - | - | - | **36** | - | - | - | - | - | - |
66| `hosted-agents` (n=24) | - | - | - | - | - | - | - | - | - | **24** | - | - | - | - | - |
67| `latent-briefing` (n=24) | - | - | - | - | - | - | - | - | - | - | **24** | - | - | - | - |
68| `memory-systems` (n=36) | - | - | - | - | - | - | - | - | - | - | - | **36** | - | - | - |
69| `multi-agent-patterns` (n=48) | - | - | - | - | - | - | - | - | - | - | - | - | **48** | - | - |
70| `project-development` (n=48) | - | - | - | - | - | - | - | - | - | - | - | - | - | **48** | - |
71| `tool-design` (n=57) | - | - | - | - | 1 | - | - | 1 | 1 | - | - | - | - | 8 | **46** |
72 
73## Hardest prompts (lowest top-1 across all models)
74 
75| Prompt | Expected | Top-1 Rate | Predicted Primaries |
76| --- | --- | --- | --- |
77| p046 | `tool-design` | 0.00 | `filesystem-context`, `harness-engineering`, `project-development` |
78| p048 | `advanced-evaluation` | 0.00 | `evaluation`, `latent-briefing` |
79| p047 | `context-fundamentals` | 0.17 | `context-fundamentals`, `project-development` |
80| p040 | `context-fundamentals` | 0.42 | `context-fundamentals`, `context-optimization` |
81| p045 | `context-fundamentals` | 0.42 | `context-fundamentals`, `project-development` |
82| p016 | `evaluation` | 0.75 | `advanced-evaluation`, `evaluation` |
83| p001 | `context-fundamentals` | 0.83 | `context-degradation`, `context-fundamentals` |
84| p030 | `context-compression` | 0.83 | `bdi-mental-states`, `context-compression` |
85| p031 | `filesystem-context` | 0.83 | `context-optimization`, `filesystem-context` |
86| p041 | `tool-design` | 0.83 | `context-fundamentals`, `project-development`, `tool-design` |
87 
88## Reproducibility
89 
90Reproduce these numbers exactly with:
91 
92```bash
93cd researcher/benchmarks/sdk-runner
94npm install
95export CURSOR_API_KEY=<your-key>
96node --experimental-strip-types src/runRouter.ts --models claude-opus-4-7,composer-2,gemini-3.1-pro,gpt-5.5 --reps 3 --seed 1 --max-budget-usd 15
97python3 researcher/scripts/render_router_report.py \
98    --results researcher/benchmarks/router/results/<date>-<seed> \
99    --fixture researcher/benchmarks/router/prompts.jsonl \
100    --output researcher/benchmarks/router/results-published/<date>.md
101```
102 
103Per-run JSON artifacts (prompt, model, replication, raw model output, parsed ranking) are preserved under the gitignored `results/` directory next to the summary that drives this report.
104 
105## Delta vs baseline
106 
107_baseline: 2026-05-15 v2.2.0 descriptions (commit a865a8e)_
108 
109### Per-model accuracy change
110 
111| Model | Baseline Top-1 | New Top-1 | Delta | Baseline Top-3 | New Top-3 | Delta |
112| --- | --- | --- | --- | --- | --- | --- |
113| `claude-opus-4-7` | 0.886 | 0.867 | -0.020 | 0.936 | 0.953 | +0.017 |
114| `composer-2` | 0.888 | 0.913 | +0.025 | 0.930 | 0.973 | +0.043 |
115| `gemini-3.1-pro` | 0.886 | 0.925 | +0.039 | 0.921 | 0.932 | +0.011 |
116| `gpt-5.5` | 0.886 | 0.913 | +0.027 | 0.943 | 0.953 | +0.010 |
117 
118### Per-skill top-1 rate change
119 
120Counts a row as correct when the predicted primary equals the expected primary.
121 
122| Skill (expected) | Baseline | New | Delta |
123| --- | --- | --- | --- |
124| `advanced-evaluation` | 48/49 = 0.980 | 47/59 = 0.797 | -0.183 <- regressed |
125| `bdi-mental-states` | 24/24 = 1.000 | 24/24 = 1.000 | 0.000 |
126| `context-compression` | 36/36 = 1.000 | 34/35 = 0.971 | -0.029 |
127| `context-degradation` | 36/36 = 1.000 | 36/36 = 1.000 | 0.000 |
128| `context-fundamentals` | 12/47 = 0.255 | 22/45 = 0.489 | +0.234 <- improved |
129| `context-optimization` | 36/36 = 1.000 | 36/36 = 1.000 | 0.000 |
130| `evaluation` | 33/36 = 0.917 | 33/36 = 0.917 | 0.000 |
131| `filesystem-context` | 36/36 = 1.000 | 46/48 = 0.958 | -0.042 |
132| `harness-engineering` | 36/36 = 1.000 | 36/36 = 1.000 | 0.000 |
133| `hosted-agents` | 24/24 = 1.000 | 24/24 = 1.000 | 0.000 |
134| `latent-briefing` | 24/24 = 1.000 | 24/24 = 1.000 | 0.000 |
135| `memory-systems` | 36/36 = 1.000 | 36/36 = 1.000 | 0.000 |
136| `multi-agent-patterns` | 48/48 = 1.000 | 48/48 = 1.000 | 0.000 |
137| `project-development` | 36/48 = 0.750 | 48/48 = 1.000 | +0.250 <- improved |
138| `tool-design` | 35/48 = 0.729 | 46/57 = 0.807 | +0.078 <- improved |
139 
140### Previously-hardest prompts
141 
142| Prompt | Expected | Baseline Top-1 Rate | New Top-1 Rate | Delta |
143| --- | --- | --- | --- | --- |
144| p001 | `context-fundamentals` | 0.00 | 0.83 | +0.833 |
145| p002 | `context-degradation` | 1.00 | 1.00 | 0.000 |
146| p016 | `evaluation` | 0.75 | 0.75 | 0.000 |
147| p037 | `project-development` | 0.00 | 1.00 | +1.000 |
148| p040 | `context-fundamentals` | 0.25 | 0.42 | +0.167 |
149| p041 | `tool-design` | 0.92 | 0.83 | -0.084 |
150| p045 | `context-fundamentals` | 0.25 | 0.42 | +0.167 |
151| p046 | `tool-design` | 0.00 | 0.00 | 0.000 |
152| p047 | `context-fundamentals` | 0.50 | 0.17 | -0.333 |
153| p048 | `advanced-evaluation` | 0.00 | 0.00 | 0.000 |
154
Agent Skills for Context Engineering

researcher/benchmarks/router/results-published/2026-05-15-v2.md

Preparing the source view

Agent Skills for Context Engineering

researcher/benchmarks/router/results-published/2026-05-15-v2.md