Loading source
Pulling the file list, source metadata, and syntax-aware rendering for this listing.
Source from repo
A comprehensive collection of Agent Skills for context engineering, multi-agent architectures, and production agent systems.
Files
Skill
Size
Entrypoint
Format
Open file
Syntax-highlighted preview of this file as included in the skill package.
researcher/benchmarks/effectiveness/README.md
1# Effectiveness Benchmark (Stage 3)23Real agent tasks executed via the Cursor SDK. Each task is run under multiple skill-loading conditions; the difference in outcome quality, token cost, and wall time is the skill's measured effect size.45See `researcher/benchmarks/PLAN.md` for methodology and `researcher/benchmarks/effectiveness/tasks/001-filesystem-context-offload/` for the canonical task template.67## Task layout89```10researcher/benchmarks/effectiveness/tasks/<NNN>-<slug>/11README.md # human-readable task description and grading criteria12task.md # the exact prompt given to the agent13metadata.json # machine-readable metadata (target_skill, difficulty, category)14starting/ # workspace seed copied into a temp directory before each run15verify.sh # deterministic check returning exit 0 on success16```1718`metadata.json` shape:1920```json21{22"id": "001",23"slug": "filesystem-context-offload",24"target_skill": "filesystem-context",25"irrelevant_skill": "bdi-mental-states",26"category": "context-management",27"difficulty": "easy",28"notes": "Optional rationale for picking this task."29}30```3132## Conditions3334For each task, six conditions are evaluated per model:3536| Condition | settingSources | Skills present in `.cursor/skills/` |37| --- | --- | --- |38| `control` | `[]` | none (no skills loaded) |39| `target` | `["project"]` | only `target_skill` |40| `negative` | `["project"]` | only `irrelevant_skill` (negative control) |41| `full` | `["project"]` | all 15 skills |42| `target_plus_one` | `["project"]` | `target_skill` plus one related skill |43| `target_plus_unrelated` | `["project"]` | `target_skill` plus one unrelated skill (interaction control) |4445The runner builds a fresh task workspace per (task, condition, model, replication) by copying `starting/` to a temp dir and then placing only the in-scope skills into `.cursor/skills/`.4647## Reporting4849After each run the runner calls `verify.sh`. Exit code 0 means the task succeeded. Tokens are read from `run.conversation()` and durationMs from the SDK result. The runner persists:5051- per-condition raw result JSON52- workspace diff before/after53- verify.sh stdout/stderr54- summary.json with per-task per-condition aggregates5556Aggregated results land in `researcher/reports/effectiveness-history.jsonl` as a single line per benchmark sweep.5758## Adding a task59601. Create a new directory under `tasks/` with a 3-digit ID and slug.612. Copy the structure from `001-filesystem-context-offload/`.623. Write `task.md` so the agent has a self-contained prompt referencing the workspace.634. Write `verify.sh` so it can be run inside any temp directory and exits 0 on success.645. Fill `metadata.json` honestly: pick a real `irrelevant_skill` that genuinely should not help.656. Validate by running `npm run effectiveness:dry-run`.6667Negative-control tasks (where no skill should help) live alongside positive tasks. Set `target_skill: "none"` and `irrelevant_skill: "none"` to mark them; the runner skips the `target` and `target_plus_*` conditions and runs only `control`, `full`, and a sanity check.68