Source from repo
Agent Skills for Context Engineering

A comprehensive collection of Agent Skills for context engineering, multi-agent architectures, and production agent systems.
muratcankoylanGitHub muratcankoylanSource repo Original GitHub link
Files
339
Skill
n/a
Size
4.3 MB
Entrypoint
SKILL.md
Format
git-repo
Open file
researcher/benchmarks/PLAN.md

Syntax-highlighted preview of this file as included in the skill package.
Rendered Source
markdown361 linesFree
researcher/benchmarks/PLAN.md
1# Benchmark Architecture Plan
2 
3The current benchmark harness verifies that the researcher OS itself is hard to game (deterministic structural checks and seven adversarial scenarios). It does not yet measure the thing users actually care about: **do these skills make agents better at the tasks they claim to help with?**
4 
5This document is the plan to close that gap, in four staged releases, with research-paper-grade methodology and the Cursor SDK as the execution layer.
6 
7## Status
8 
9| Stage | Release | What it measures | Cost | Status |
10| --- | --- | --- | --- | --- |
11| 0 | v2.2.0 (shipped) | Harness resistance to gaming, structural validity | $0 | done |
12| 1 | v2.3.0 (shipped) | Per-skill health metrics (deterministic) | $0 | done; corpus 0.814 aggregate, 2 of 15 flagged |
13| 2 | v2.3.0 (shipped) | Skill router accuracy (LLM-as-router) | Cursor credits (~$7 per full sweep) | done; baseline + post-fix delta published |
14| 3 | v2.4.0 | Skill effectiveness on real agent tasks | Cursor credits, larger | scaffolded, one task built |
15| 4 | v2.5.0 | Cross-skill composition | Cursor credits | future |
16 
17### Shipped Stage 2 results (v2.3.0)
18 
19Two full 600-run sweeps across `composer-2`, `claude-opus-4-7`, `gpt-5.5`, `gemini-3.1-pro` at seed=1, 3 replications per (prompt, model):
20 
21- Baseline: `researcher/benchmarks/router/results-published/2026-05-15.md` (566 of 600; v1 runner died mid-sweep).
22- Post-fix (description rewrites + hardened runner): `researcher/benchmarks/router/results-published/2026-05-15-v2.md` (600 of 600, includes delta-vs-baseline section).
23 
24Headline finding: targeted description rewrites moved `context-fundamentals` top-1 from 0.255 to 0.489 (+23.4pp) and `project-development` from 0.750 to 1.000 (+25pp, now perfect). Three of four models gained on top-1; all four gained on top-3.
25 
26## Goals And Non-Goals
27 
28### Goals
29 
30- Produce reproducible, model-agnostic evidence that skill loading improves agent behavior or, where it does not, surface that honestly.
31- Make every measurement reproducible from a single CLI invocation plus a pinned config.
32- Track results longitudinally so regressions are detectable.
33- Disclose methodology fully: prompts, tasks, ground truth, scoring, raw outputs.
34- Use deterministic checks first, model-judged measurements second, real task execution third.
35 
36### Non-Goals
37 
38- Build a general-purpose agent benchmarking framework. We benchmark this skill collection on representative tasks.
39- Replace SWE-bench, BrowseComp, or other public benchmarks. We can subset them or use them as comparison points, not redo them.
40- Train models. This is evaluation only.
41- Run benchmarks that require paid APIs other than Cursor. Stage 2 and 3 spend Cursor credits via the SDK.
42 
43## Methodology Principles
44 
45These apply to every stage.
46 
47### Reproducibility
48 
49- Every benchmark run is described by a frozen config (model id, seed, fixture revision, repo commit SHA).
50- Raw outputs (transcripts, judge JSON, task workspace diffs) are persisted with the run record.
51- Each run appends a single line to a history JSONL with the config hash and pointers to raw artifacts.
52- A CLI flag (`--seed`, `--config`) lets a third party reproduce a run exactly.
53 
54### Statistical Discipline
55 
56- Minimum 3 replications per condition (model x skill-state x task) for variance estimation.
57- Report effect sizes with bootstrap 95% confidence intervals, not point estimates.
58- Paired comparisons where possible (same task, different conditions) using a Wilcoxon signed-rank test on per-task differences.
59- Sample size targets per stage are stated below; underpowered runs land as "preliminary" with that label visible in the dashboard.
60 
61### Bias Mitigation
62 
63- **Position bias** in router benchmarks: shuffle skill order across replications, report consistency.
64- **Self-preference** in judge models: never use the same model as judge and candidate. Use a different family (e.g. Composer evaluates Claude outputs, GPT evaluates Composer outputs).
65- **Length bias** in pairwise: include length-normalized scoring and a "shorter wins ties" rule.
66- **Selection bias** in tasks: tasks include some where the relevant skill should not help (negative controls) so we can measure false-positive rates.
67 
68### Ablations
69 
70- Per-skill: with vs without each skill, holding all others constant.
71- Leave-one-out: remove one skill at a time from the full corpus to find which skills carry the work.
72- Order: run skill subsets in different orders to surface order-dependent effects.
73 
74### Disclosure
75 
76- All prompts published in `researcher/benchmarks/<stage>/`.
77- All ground truth published as fixtures.
78- All scoring code published in `researcher/scripts/`.
79- Raw run outputs (with secrets redacted) committed under `researcher/benchmarks/<stage>/results/` or attached to release notes.
80 
81## Stage 1: Deterministic Skill Health (v2.2.1, $0)
82 
83The cheap and uncontroversial floor. Per-skill, deterministic scoring of structural quality. Catches drift, missing sections, stale claims, and underspecified gotchas before any user notices.
84 
85### Metrics
86 
87For each `skills/<name>/SKILL.md`:
88 
89- `line_count`: must be at or below 500. Fail above.
90- `frontmatter_valid`: name matches directory, description present, description third-person, description length within 1024 chars.
91- `required_sections`: When to Activate, Core Concepts, Practical Guidance, Gotchas, Integration, References.
92- `gotcha_count`: number of numbered items in Gotchas. Target >= 3.
93- `code_example_count`: fenced code blocks.
94- `internal_links_resolved`: every `skills/<other>/SKILL.md` link resolves to a real file.
95- `external_link_count`: presence only; reachability is opt-in (`--check-urls`) because it requires network.
96- `claim_coverage`: numeric claims (regex `\b\d+(\.\d+)?%\b`, `\b\d+x\b`, benchmark names) divided by the number of those claims that have a `claim_id` referencing `researcher/claims/index.jsonl`.
97- `mechanism_coverage`: number of mechanisms in `researcher/mechanisms/registry.jsonl` owned by this skill.
98- `activation_case_coverage`: number of activation cases in `researcher/fixtures/activation-cases.jsonl` whose `expected_primary_skill` is this skill.
99 
100### Output
101 
102`researcher/reports/skill-health.json`, regenerated by `researcher/scripts/skill_health.py`. Per-skill scores plus a weighted aggregate per skill plus a corpus-wide aggregate.
103 
104`researcher/reports/skill-health-history.jsonl` for daily trend tracking, written by `loop_daily.py`.
105 
106### Scoring
107 
108Aggregate is a weighted sum with weights tuned to surface drift early:
109 
110```
1110.20 * normalize(required_sections)
1120.15 * normalize(gotcha_count, target=3)
1130.10 * normalize(code_example_count, target=2)
1140.15 * normalize(internal_links_resolved)
1150.10 * normalize(activation_case_coverage)
1160.15 * normalize(claim_coverage)
1170.10 * normalize(mechanism_coverage)
1180.05 * binary(frontmatter_valid)
119```
120 
121Anything below 0.75 is flagged in the daily snapshot.
122 
123### Why this matters
124 
125Skill rot is invisible without metrics. A skill that loses its gotcha section, accumulates dead internal links, or drifts past 500 lines is structurally weaker; catching that in CI is cheap insurance.
126 
127## Stage 2: Skill Router Benchmark (v2.3.0, ~$5 per full run with Cursor SDK)
128 
129The first benchmark that exercises a real model. Tests whether the skill descriptions are good enough to route the right skill to a given task.
130 
131### Hypothesis
132 
133The activation-scenario descriptions in v2.2.0 frontmatter (replacing v2.1.x keyword triggers) should let a frontier model route prompts to the correct skill at high top-1 accuracy and very high top-3 accuracy.
134 
135### Procedure
136 
1371. **Fixture**: `researcher/benchmarks/router/prompts.jsonl` with 100 prompts. Each line: `{prompt_id, prompt, expected_primary_skill, acceptable_secondary_skills, rejected_skills, reason}`. Stage 1 ships with 50; expand to 100 over time.
138 
1392. **Routing prompt**: A standard template (`researcher/benchmarks/router/routing-prompt.md`) that presents the 15 skill descriptions (shuffled per replication) and the task, asks the model to return a strict-JSON ranked list with confidence.
140 
1413. **Runner**: `researcher/benchmarks/sdk-runner/src/runRouter.ts`. For each prompt x model x replication:
142   - Build the routing prompt with shuffled skill order.
143   - Call `Agent.prompt(routingPrompt, { settingSources: [], model: { id }, local: { cwd: temp } })`.
144   - `settingSources: []` ensures the router agent has no skill loaded; the descriptions in the prompt are the only signal.
145   - Parse JSON. If parse fails, record as `format_failure` (don't reward bad output).
146   - Compare ranked list to ground truth. Record top-1 and top-3 accuracy.
147 
1484. **Models**: `composer-2`, `claude-opus-4-7`, `gpt-5.5`, `gemini-3.1-pro`. The list comes from `Cursor.models.list()` at run time; if a model is unavailable it is recorded as `model_unavailable` and the run continues.
149 
1505. **Replications**: 3 per (prompt, model). 100 prompts x 4 models x 3 reps = 1200 calls per full run.
151 
152### Cost analysis
153 
154- Routing prompt is small (~3-5k input tokens, ~500 output tokens).
155- At Cursor's free-with-credits pricing: well under $5 per full run.
156- Even at unfavorable retail rates: estimated $5-15 per full run.
157 
158### Reporting
159 
160- Per-model leaderboard: top-1 accuracy with 95% CI, top-3 accuracy, format-failure rate.
161- Per-skill confusion matrix: which skills get confused with which.
162- Per-prompt drill-down for failures: which models failed, with what alternative skill.
163- Append to `researcher/reports/router-history.jsonl` with model, fixture rev, repo SHA, accuracy.
164 
165### Why this matters
166 
167Skill descriptions are the only signal a deployed agent uses to decide whether to load a skill. If they don't route correctly, the rest of the harness is academic. This benchmark directly validates the v2.2.0 activation-scenario refactor.
168 
169## Stage 3: Skill Effectiveness Benchmark (v2.4.0, ~$50-200 per full run)
170 
171The benchmark that proves skills actually help.
172 
173### Hypothesis
174 
175Loading a relevant skill into an agent's context improves outcome quality, token efficiency, or both, on tasks that the skill claims to address. Loading an irrelevant skill should have no effect or only mild noise.
176 
177### Procedure
178 
1791. **Fixture**: `researcher/benchmarks/effectiveness/tasks/<id>-<slug>/`. Each task directory has:
180   - `task.md`: the prompt the agent receives.
181   - `starting/`: workspace seed copied into a temp directory before the run.
182   - `verify.sh`: deterministic ground-truth check returning exit code 0 if the task succeeded.
183   - `metadata.json`: relevant skills, irrelevant skills (for negative control), category, expected difficulty.
184 
1852. **Conditions**: For each task, run six conditions:
186   - `control`: `settingSources: []`. No skills loaded.
187   - `target`: `settingSources: ["project"]` with only the target skill present. Other skills are temporarily moved out.
188   - `negative`: `settingSources: ["project"]` with only a known-irrelevant skill present.
189   - `full`: `settingSources: ["project"]` with all 15 skills present.
190   - `target_plus_one`: target skill plus one related skill.
191   - `target_plus_unrelated`: target skill plus one unrelated skill (interaction control).
192 
1933. **Runner**: `researcher/benchmarks/sdk-runner/src/runEffectiveness.ts`. For each task x condition x model x replication:
194   - Build the task workspace from `starting/`.
195   - Copy only the in-scope skills into `.cursor/skills/` of the task workspace.
196   - Call `Agent.prompt(taskPrompt, { settingSources: ["project"], model: { id }, local: { cwd: taskWorkspace } })` (or local cloud option for parallel runs).
197   - On completion, run `verify.sh`; record exit code, durationMs, transcript token counts (from `run.conversation()`).
198   - Persist transcript JSON, workspace diff, verify output.
199 
2004. **Initial task set**: 20 tasks across categories:
201   - **filesystem-context**: agent must offload a 5,000-line tool output then retrieve specific data from it.
202   - **context-compression**: agent gets a 100k-token chat history and must produce a 2k-token handoff that preserves named entities.
203   - **multi-agent-patterns**: agent must decide whether to use subagents for a parallelizable task and justify it.
204   - **memory-systems**: agent must persist a user preference across two simulated sessions.
205   - **tool-design**: agent must consolidate three overlapping tool calls into one.
206   - **evaluation**: agent must produce a rubric for a given task description.
207   - **advanced-evaluation**: agent must run a position-bias-mitigated pairwise comparison.
208   - **harness-engineering**: agent must identify which of four agent configurations is missing a locked evaluator.
209   - **context-degradation**: agent must place critical info at U-curve endpoints for a long context.
210   - **context-optimization**: agent must mask tool outputs above 2k tokens.
211   - **latent-briefing**: agent must decide whether KV cache compaction applies (positive case + negative case).
212   - **bdi-mental-states**: agent must convert a small RDF graph into a structured belief state.
213   - **hosted-agents**: agent must propose a warm-pool config for a multiplayer scenario.
214   - **project-development**: agent must evaluate task-model fit and propose a pipeline.
215   - **context-fundamentals**: agent must explain a context degradation pattern.
216   - Plus 5 negative-control tasks where no skill should help (basic arithmetic, plain code reformatting, etc.).
217 
2185. **Models**: same as Stage 2.
219 
2206. **Replications**: 3 per (task, condition, model). 20 tasks x 6 conditions x 4 models x 3 reps = 1440 agent runs per full sweep.
221 
222### Cost analysis
223 
224- Average effectiveness task is larger than routing prompts: 10-50k input tokens, 1-5k output tokens, multiple tool calls.
225- Cursor free-with-credits: should fit in monthly allotment for one full sweep.
226- Retail equivalent estimate: $50-200 per full sweep depending on which models are active.
227 
228### Reporting
229 
230- Per-skill effect size: success rate delta, token cost delta, durationMs delta between control and target.
231- Per-skill effect plot: bar chart with 95% CI.
232- Negative-control validation: irrelevant skill should show effect size near zero; if not, the test is biased.
233- Per-model leaderboard: which model benefits most from skills.
234- Append to `researcher/reports/effectiveness-history.jsonl`.
235 
236### Why this matters
237 
238This is the headline result. "Loading filesystem-context reduces tokens by N% with zero quality loss on tasks where it applies" is the kind of claim that justifies the existence of the skill collection.
239 
240## Stage 4: Cross-Skill Composition (v2.5.0)
241 
242Composition is where curated collections add or destroy value compared to individual skills. Tests:
243 
244- Do two skills loaded together produce additive, synergistic, or conflicting guidance?
245- Are integration sections accurate? When skill A's integration mentions skill B, does loading both actually compose?
246- Are there ordering effects in how skills appear in context?
247 
248Deferred to v2.5.0 because it requires Stage 3 infrastructure plus task design specifically targeting interactions. Sketched here, not designed in detail.
249 
250## SDK Integration Details
251 
252### Why the Cursor SDK
253 
254- Free with the existing Cursor team credits the user already has.
255- Same agent loop as the IDE and CLI, so results transfer to production usage.
256- Multi-model: composer-2, claude-opus-4-7, gpt-5.5, gemini-3.1-pro through one interface.
257- `settingSources` gives precise control over which skills load (this is the key affordance for ablation).
258- Cloud runtime for parallel execution when local resources are insufficient.
259 
260### settingSources and skill loading
261 
262This is the central mechanism we exploit.
263 
264- `settingSources: []` (default): no on-disk settings load. Agent has no skills. Use this as the **control condition** and for the router benchmark where we want descriptions to be the only signal.
265- `settingSources: ["project"]`: loads `.cursor/` from the cwd. Used to load only specific skills by copying them into a controlled workspace.
266- `settingSources: ["project", "plugins"]`: also loads plugin skills. Less useful for benchmarks; we want full control.
267- `settingSources: "all"`: loads everything including user and team settings. Avoid in benchmarks; it leaks the caller's environment.
268 
269For Stage 3, each task workspace is built fresh per condition: copy the `starting/` directory to a temp dir, then conditionally place skills into `.cursor/skills/`.
270 
271### Result schema we rely on
272 
273From the Cursor SDK reference:
274 
275```typescript
276interface RunResult {
277  id: string;
278  status: "finished" | "error" | "cancelled";
279  result?: string;          // final assistant text
280  model?: ModelSelection;   // resolved model
281  durationMs?: number;
282  git?: RunGitInfo;
283}
284```
285 
286Plus `run.conversation(): Promise<ConversationTurn[]>` for the per-turn transcript. Token counts come from the conversation events (precise schema depends on SDK version; the runner abstracts this).
287 
288### Runtime choice per stage
289 
290- Stage 2 (router): local runtime, no cwd dependence beyond a temp dir. Fast.
291- Stage 3 (effectiveness): local runtime for most tasks; cloud runtime for tasks that need filesystem isolation or parallel execution.
292- Stage 4: cloud runtime by default for full parallelism.
293 
294### Safety
295 
296- Privacy Mode enabled on the Cursor account that runs benchmarks, per Cursor's [privacy docs](https://cursor.help/security-and-privacy/privacy.md). Eval data stays out of training.
297- `apiKey` always passed explicitly, never relied on from env, so cross-tenant mistakes are impossible.
298- `await using` syntax for every agent so disposal is automatic.
299- Concrete rate-limit handling: backoff schedule per Cursor's docs, exponential 1s/2s/4s/8s/16s on 429.
300 
301### Cost gates
302 
303The runner implements per-run cost caps:
304 
305- `--max-runs N`: hard cap on total agent invocations.
306- `--max-budget-usd N`: estimated cost cap, fail fast if exceeded.
307- `--dry-run`: print plan, do not call.
308- `--models <list>`: subset to one model for development.
309 
310Every runner prints a cost forecast before any agent call.
311 
312## Implementation Order
313 
3141. **Stage 1 (this PR, v2.2.1)**: `researcher/scripts/skill_health.py`, output file, integration with `loop_daily.py`. No API cost.
315 
3162. **SDK runner scaffolding (this PR)**: `researcher/benchmarks/sdk-runner/` with package.json, tsconfig, common utilities, dry-run mode. Compiles and exits cleanly without an API key.
317 
3183. **Router fixtures (this PR)**: 50 prompts in `researcher/benchmarks/router/prompts.jsonl`. Adversarial pairs for the v2.2.0 boundary-confusion cases (evaluation vs advanced-evaluation, etc.) plus single-skill positive controls.
319 
3204. **First effectiveness task (this PR)**: `researcher/benchmarks/effectiveness/tasks/001-filesystem-context-offload/` fully built. Pattern for the other 19.
321 
3225. **Verify (this PR)**: compile, run dry, run skill_health for real, all existing gates still pass.
323 
3246. **Execute Stage 1 in CI (next PR after merge)**: add to `loop_daily.py` so skill health updates daily.
325 
3267. **Execute Stage 2 (when env key provided)**: run router benchmark, publish results, iterate descriptions where confusions appear.
327 
3288. **Build remaining 19 effectiveness tasks (rolling)**: prioritized by which skills carry the most user-facing claims.
329 
3309. **Execute Stage 3 (when ready)**: full effectiveness sweep, publish per-skill effect sizes.
331 
33210. **Stage 4 design and execution (v2.5.0)**.
333 
334## Open Decisions
335 
336These need user input before Stage 2 execution. They do not block scaffolding.
337 
3381. **Privacy mode**: confirm enabled on the Cursor account that will run benchmarks.
3392. **Higher rate limits**: confirm whether to email `[email protected]` for benchmark-grade limits per Cursor's docs, or rely on default limits for a smaller initial sweep.
3403. **Models to include in v2.3.0 first run**: ship with composer-2 only, or include claude/gpt/gemini from day one?
3414. **Publication policy**: full raw transcripts committed to the repo, or hosted separately and linked? Transcript size at full scale is multi-MB per sweep.
3425. **Comparison points**: include public benchmark subsets (BrowseComp, SWE-bench) for cross-reference, or keep our task set self-contained for v2.4.0 and add cross-reference in v2.5.0?
343 
344## What This Plan Does Not Solve
345 
346- Tokens-per-task accounting depends on the per-turn data the SDK exposes. If `conversation()` does not include token counts we fall back to wall-clock and request-count as cost proxies until we instrument it.
347- Cursor's model catalog is not stable across time; we record the resolved model id per run for reproducibility, but cross-version comparisons require care.
348- The seed user-curated task set is small. Public credibility requires either growing it to 100+ tasks or aligning with an existing public benchmark.
349- Real-world deployment differs from benchmark conditions. Effect sizes here are upper bounds, not guarantees.
350 
351## How To Read Results
352 
353When a future release shows benchmark numbers, the dashboard answers four questions in order:
354 
3551. **Did the harness pass deterministic gates?** (always required, Stage 0 + 1)
3562. **Can the descriptions route the right skill?** (Stage 2, per model)
3573. **Does the skill actually help?** (Stage 3, per skill x model x task)
3584. **Do skills compose?** (Stage 4, future)
359 
360Skills that fail at any earlier stage do not need later stages to be justified for removal or rework.
361
Preparing the source view

Agent Skills for Context Engineering

researcher/benchmarks/PLAN.md