Source from repo

Agent Skills for Context Engineering

A comprehensive collection of Agent Skills for context engineering, multi-agent architectures, and production agent systems.

muratcankoylanGitHub muratcankoylanSource repo Original GitHub link

Files

339

Skill

n/a

Size

4.3 MB

Entrypoint

SKILL.md

Format

git-repo

Open file

researcher/benchmarks/sdk-runner/README.md

Syntax-highlighted preview of this file as included in the skill package.

Rendered Source

markdown74 linesFree

researcher/benchmarks/sdk-runner/README.md

1# Researcher SDK Runner
2 
3TypeScript runner for the router (Stage 2) and effectiveness (Stage 3) benchmarks. Uses the [Cursor SDK](https://cursor.com/docs/sdk/typescript) to run real agents against the skill corpus.
4 
5See `researcher/benchmarks/PLAN.md` for methodology, hypothesis, statistical design, and reproducibility rules.
6 
7## Setup
8 
9```bash
10cd researcher/benchmarks/sdk-runner
11npm install
12export CURSOR_API_KEY="cursor_..."
13```
14 
15API keys come from [Cursor Dashboard > Integrations](https://cursor.com/dashboard/integrations). The runner never reads `CURSOR_API_KEY` from anywhere except the explicit env var; it is passed through to every SDK call so cross-tenant mistakes are impossible.
16 
17Enable [Privacy Mode](https://cursor.com/help/security-and-privacy/privacy) on the account before any benchmark run so eval traffic stays out of training data.
18 
19## Commands
20 
21```bash
22npm run typecheck
23 
24npm run router:dry-run       # print plan and cost forecast, no agent calls
25npm run router:run           # execute Stage 2 router benchmark
26 
27npm run effectiveness:dry-run
28npm run effectiveness:run    # execute Stage 3 effectiveness benchmark
29```
30 
31Flags shared by both runners:
32 
33- `--models <id,id,...>`: subset to specific models (default: `composer-2`).
34- `--reps <N>`: replications per condition (default: 3).
35- `--max-runs <N>`: hard cap on agent invocations.
36- `--max-budget-usd <N>`: estimated cost cap; runner refuses to start if forecast exceeds.
37- `--seed <N>`: deterministic shuffling of skill order and tie-breaking.
38- `--fixture <path>`: alternate fixture JSONL.
39- `--dry-run`: print plan, do not call the SDK.
40- `--concurrency <N>`: bounded parallel SDK calls (default 1). 4 to 8 is reasonable; respect Cursor rate limits.
41- `--no-resume`: ignore existing per-run results in the destination directory and re-execute everything. Default is to resume by skipping any plan item that already has a result file.
42 
43## Output
44 
45Runs append a single-line summary to:
46 
47- `researcher/reports/router-history.jsonl` (Stage 2)
48- `researcher/reports/effectiveness-history.jsonl` (Stage 3)
49 
50Per-run raw outputs land under:
51 
52- `researcher/benchmarks/router/results/<timestamp>-<seed>/`
53- `researcher/benchmarks/effectiveness/results/<timestamp>-<seed>/`
54 
55Both `results/` directories are gitignored. Curated released results live in release notes or a published-results file.
56 
57## Cost gates
58 
59The runner refuses to call the SDK unless either `--max-runs` or `--max-budget-usd` is set (or `--unsafe-no-cost-cap` is explicitly passed). Default budgets are intentionally conservative.
60 
61## Reproducibility
62 
63Every run records:
64 
65- Runner package version
66- `CURSOR_API_KEY` fingerprint (last 4 chars only)
67- Repo commit SHA at runtime
68- Model ids resolved at start
69- Fixture revision SHA
70- Seed
71- Full config snapshot
72 
73Third parties can reproduce a run with `node src/runRouter.ts --config <captured-config.json>`.
74

Agent Skills for Context Engineering

researcher/benchmarks/sdk-runner/README.md

Preparing the source view

Agent Skills for Context Engineering

researcher/benchmarks/sdk-runner/README.md