Source from repo
Agent Skills for Context Engineering

A comprehensive collection of Agent Skills for context engineering, multi-agent architectures, and production agent systems.
muratcankoylanGitHub muratcankoylanSource repo Original GitHub link
Files
339
Skill
n/a
Size
4.3 MB
Entrypoint
SKILL.md
Format
git-repo
Open file
researcher/benchmarks/router/results-published/2026-05-19.md

Syntax-highlighted preview of this file as included in the skill package.
Rendered Source
markdown90 linesFree
researcher/benchmarks/router/results-published/2026-05-19.md
1# Router Benchmark Results
2 
3_run timestamp: 2026-05-19T05:52:52Z_
4_repo commit: `272702e0bb1ff4f78d45fb7253da872da170d458`_
5_fixture sha256-16: `8f974d930836bc9c`_
6_seed: 1_
7_runs: 600_  
8_models: claude-opus-4-7, composer-2, gemini-3.1-pro, gpt-5.5_  
9_reps per (prompt, model): 3_
10 
11## Executive Summary
12 
13This sweep validates the corpus-wide hardening pass after all 15 skill bodies, mechanism mappings, claim provenance records, corpus index entries, and activation fixtures were updated.
14 
15- **600/600 usable records, 0 format failures.** The first pass produced transient empty/format-failed SDK outputs; those records were rerun through the runner's resume path and all completed successfully. The runner now retries transient format failures once.
16- **No broad routing collapse from the corpus-wide pass.** Three models stayed at or above 0.913 top-1. Claude Opus 4.7 is lower at 0.840, but its remaining misses are concentrated in the same known ambiguous boundaries as the previous report.
17- **Newly hardened skills route cleanly.** `bdi-mental-states`, `hosted-agents`, `latent-briefing`, `memory-systems`, and `multi-agent-patterns` are all perfect in this sweep.
18- **Remaining failures are concentrated and already understood.** `p046` is a negative-control Python formatting task with no true matching skill, `p048` is a genuinely ambiguous advanced-evaluation/evaluation/latent-briefing prompt, and `context-fundamentals` remains the weakest catch-all boundary.
19 
20Compared with `2026-05-15-v2.md`, the material conclusion is unchanged: the description-benchmark loop worked, and the corpus-wide body/metadata hardening did not introduce a broad router regression. The next benchmark investment should be Stage 3 effectiveness, where full skill bodies are loaded.
21 
22## Methodology
23 
24Each prompt is presented to each model with the 15 skill activation descriptions in a deterministically-shuffled order (different shuffle per replication). The model must return JSON with a ranked list of skill names. Top-1 accuracy is whether the first ranked skill matches the human-labeled `expected_primary_skill`; top-3 is whether the expected skill appears in the first three positions.
25 
26No skills are loaded into the agent (`settingSources: []`); the only routing signal is the in-prompt descriptions. Confidence intervals are 95% bootstrap with 2000 resamples.
27 
28## Per-model leaderboard
29 
30| Model | Top-1 | 95% CI | Top-3 | 95% CI | Format Failures | Median ms |
31| --- | --- | --- | --- | --- | --- | --- |
32| `gemini-3.1-pro` | 0.920 | [0.873, 0.960] | 0.933 | [0.893, 0.973] | 0 | 8631 |
33| `composer-2` | 0.913 | [0.867, 0.953] | 0.947 | [0.907, 0.980] | 0 | 3004 |
34| `gpt-5.5` | 0.913 | [0.867, 0.953] | 0.973 | [0.947, 0.993] | 0 | 4050 |
35| `claude-opus-4-7` | 0.840 | [0.780, 0.893] | 0.933 | [0.893, 0.973] | 0 | 3178 |
36 
37## Per-skill confusion (when expected is X, predicted is Y)
38 
39Rows are the ground-truth `expected_primary_skill`; columns are what models actually predicted. Only `finished` runs counted.
40 
41| Expected \ Predicted | `advanced-evaluation` | `bdi-mental-states` | `context-compression` | `context-degradation` | `context-fundamentals` | `context-optimization` | `evaluation` | `filesystem-context` | `harness-engineering` | `hosted-agents` | `latent-briefing` | `memory-systems` | `multi-agent-patterns` | `project-development` | `tool-design` |
42| --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |
43| `advanced-evaluation` (n=60) | **48** | - | - | - | - | - | 10 | - | - | - | 2 | - | - | - | - |
44| `bdi-mental-states` (n=24) | - | **24** | - | - | - | - | - | - | - | - | - | - | - | - | - |
45| `context-compression` (n=36) | - | 2 | **34** | - | - | - | - | - | - | - | - | - | - | - | - |
46| `context-degradation` (n=36) | - | - | - | **36** | - | - | - | - | - | - | - | - | - | - | - |
47| `context-fundamentals` (n=42) | - | - | - | 5 | **19** | 6 | - | - | - | - | - | - | - | 12 | - |
48| `context-optimization` (n=36) | - | - | - | - | - | **36** | - | - | - | - | - | - | - | - | - |
49| `evaluation` (n=36) | 4 | - | - | - | - | - | **32** | - | - | - | - | - | - | - | - |
50| `filesystem-context` (n=48) | - | - | - | - | - | - | - | **48** | - | - | - | - | - | - | - |
51| `harness-engineering` (n=36) | - | - | - | - | - | - | - | - | **36** | - | - | - | - | - | - |
52| `hosted-agents` (n=24) | - | - | - | - | - | - | - | - | - | **24** | - | - | - | - | - |
53| `latent-briefing` (n=24) | - | - | - | - | - | - | - | - | - | - | **24** | - | - | - | - |
54| `memory-systems` (n=36) | - | - | - | - | - | - | - | - | - | - | - | **36** | - | - | - |
55| `multi-agent-patterns` (n=48) | - | - | - | - | - | - | - | - | - | - | - | - | **48** | - | - |
56| `project-development` (n=48) | - | - | - | - | - | - | - | - | - | - | - | - | - | **48** | - |
57| `tool-design` (n=57) | - | - | - | - | 1 | - | - | 4 | - | - | - | - | - | 7 | **45** |
58 
59## Hardest prompts (lowest top-1 across all models)
60 
61| Prompt | Expected | Top-1 Rate | Predicted Primaries |
62| --- | --- | --- | --- |
63| p046 | `tool-design` | 0.00 | `filesystem-context`, `project-development` |
64| p048 | `advanced-evaluation` | 0.00 | `evaluation`, `latent-briefing` |
65| p047 | `context-fundamentals` | 0.17 | `context-fundamentals`, `project-development` |
66| p045 | `context-fundamentals` | 0.33 | `context-fundamentals`, `project-development` |
67| p040 | `context-fundamentals` | 0.50 | `context-fundamentals`, `context-optimization` |
68| p001 | `context-fundamentals` | 0.58 | `context-degradation`, `context-fundamentals` |
69| p016 | `evaluation` | 0.67 | `advanced-evaluation`, `evaluation` |
70| p041 | `tool-design` | 0.75 | `context-fundamentals`, `project-development`, `tool-design` |
71| p030 | `context-compression` | 0.83 | `bdi-mental-states`, `context-compression` |
72| p002 | `context-degradation` | 1.00 | `context-degradation` |
73 
74## Reproducibility
75 
76Reproduce these numbers exactly with:
77 
78```bash
79cd researcher/benchmarks/sdk-runner
80npm install
81export CURSOR_API_KEY=<your-key>
82node --experimental-strip-types src/runRouter.ts --models claude-opus-4-7,composer-2,gemini-3.1-pro,gpt-5.5 --reps 3 --seed 1 --max-budget-usd 15
83python3 ../../scripts/render_router_report.py \
84    --results ../router/results/<date>-<seed> \
85    --fixture ../router/prompts.jsonl \
86    --output ../router/results-published/<date>.md
87```
88 
89Per-run JSON artifacts (prompt, model, replication, raw model output, parsed ranking) are preserved under the gitignored `results/` directory next to the summary that drives this report.
90
Preparing the source view

Agent Skills for Context Engineering

researcher/benchmarks/router/results-published/2026-05-19.md