Source from repo

Agent Skills for Context Engineering

A comprehensive collection of Agent Skills for context engineering, multi-agent architectures, and production agent systems.

muratcankoylanGitHub muratcankoylanSource repo Original GitHub link

Files

339

Skill

n/a

Size

4.3 MB

Entrypoint

SKILL.md

Format

git-repo

Open file

researcher/benchmarks/effectiveness/README.md

Syntax-highlighted preview of this file as included in the skill package.

Rendered Source

markdown68 linesFree

researcher/benchmarks/effectiveness/README.md

1# Effectiveness Benchmark (Stage 3)
2 
3Real agent tasks executed via the Cursor SDK. Each task is run under multiple skill-loading conditions; the difference in outcome quality, token cost, and wall time is the skill's measured effect size.
4 
5See `researcher/benchmarks/PLAN.md` for methodology and `researcher/benchmarks/effectiveness/tasks/001-filesystem-context-offload/` for the canonical task template.
6 
7## Task layout
8 
9```
10researcher/benchmarks/effectiveness/tasks/<NNN>-<slug>/
11  README.md         # human-readable task description and grading criteria
12  task.md           # the exact prompt given to the agent
13  metadata.json     # machine-readable metadata (target_skill, difficulty, category)
14  starting/         # workspace seed copied into a temp directory before each run
15  verify.sh         # deterministic check returning exit 0 on success
16```
17 
18`metadata.json` shape:
19 
20```json
21{
22  "id": "001",
23  "slug": "filesystem-context-offload",
24  "target_skill": "filesystem-context",
25  "irrelevant_skill": "bdi-mental-states",
26  "category": "context-management",
27  "difficulty": "easy",
28  "notes": "Optional rationale for picking this task."
29}
30```
31 
32## Conditions
33 
34For each task, six conditions are evaluated per model:
35 
36| Condition | settingSources | Skills present in `.cursor/skills/` |
37| --- | --- | --- |
38| `control` | `[]` | none (no skills loaded) |
39| `target` | `["project"]` | only `target_skill` |
40| `negative` | `["project"]` | only `irrelevant_skill` (negative control) |
41| `full` | `["project"]` | all 15 skills |
42| `target_plus_one` | `["project"]` | `target_skill` plus one related skill |
43| `target_plus_unrelated` | `["project"]` | `target_skill` plus one unrelated skill (interaction control) |
44 
45The runner builds a fresh task workspace per (task, condition, model, replication) by copying `starting/` to a temp dir and then placing only the in-scope skills into `.cursor/skills/`.
46 
47## Reporting
48 
49After each run the runner calls `verify.sh`. Exit code 0 means the task succeeded. Tokens are read from `run.conversation()` and durationMs from the SDK result. The runner persists:
50 
51- per-condition raw result JSON
52- workspace diff before/after
53- verify.sh stdout/stderr
54- summary.json with per-task per-condition aggregates
55 
56Aggregated results land in `researcher/reports/effectiveness-history.jsonl` as a single line per benchmark sweep.
57 
58## Adding a task
59 
601. Create a new directory under `tasks/` with a 3-digit ID and slug.
612. Copy the structure from `001-filesystem-context-offload/`.
623. Write `task.md` so the agent has a self-contained prompt referencing the workspace.
634. Write `verify.sh` so it can be run inside any temp directory and exits 0 on success.
645. Fill `metadata.json` honestly: pick a real `irrelevant_skill` that genuinely should not help.
656. Validate by running `npm run effectiveness:dry-run`.
66 
67Negative-control tasks (where no skill should help) live alongside positive tasks. Set `target_skill: "none"` and `irrelevant_skill: "none"` to mark them; the runner skips the `target` and `target_plus_*` conditions and runs only `control`, `full`, and a sanity check.
68

Agent Skills for Context Engineering

researcher/benchmarks/effectiveness/README.md

Preparing the source view

Agent Skills for Context Engineering

researcher/benchmarks/effectiveness/README.md