Loading source
Pulling the file list, source metadata, and syntax-aware rendering for this listing.
Source from repo
A comprehensive collection of Agent Skills for context engineering, multi-agent architectures, and production agent systems.
Files
Skill
Size
Entrypoint
Format
Open file
Syntax-highlighted preview of this file as included in the skill package.
researcher/benchmarks/sdk-runner/README.md
1# Researcher SDK Runner23TypeScript runner for the router (Stage 2) and effectiveness (Stage 3) benchmarks. Uses the [Cursor SDK](https://cursor.com/docs/sdk/typescript) to run real agents against the skill corpus.45See `researcher/benchmarks/PLAN.md` for methodology, hypothesis, statistical design, and reproducibility rules.67## Setup89```bash10cd researcher/benchmarks/sdk-runner11npm install12export CURSOR_API_KEY="cursor_..."13```1415API keys come from [Cursor Dashboard > Integrations](https://cursor.com/dashboard/integrations). The runner never reads `CURSOR_API_KEY` from anywhere except the explicit env var; it is passed through to every SDK call so cross-tenant mistakes are impossible.1617Enable [Privacy Mode](https://cursor.com/help/security-and-privacy/privacy) on the account before any benchmark run so eval traffic stays out of training data.1819## Commands2021```bash22npm run typecheck2324npm run router:dry-run # print plan and cost forecast, no agent calls25npm run router:run # execute Stage 2 router benchmark2627npm run effectiveness:dry-run28npm run effectiveness:run # execute Stage 3 effectiveness benchmark29```3031Flags shared by both runners:3233- `--models <id,id,...>`: subset to specific models (default: `composer-2`).34- `--reps <N>`: replications per condition (default: 3).35- `--max-runs <N>`: hard cap on agent invocations.36- `--max-budget-usd <N>`: estimated cost cap; runner refuses to start if forecast exceeds.37- `--seed <N>`: deterministic shuffling of skill order and tie-breaking.38- `--fixture <path>`: alternate fixture JSONL.39- `--dry-run`: print plan, do not call the SDK.40- `--concurrency <N>`: bounded parallel SDK calls (default 1). 4 to 8 is reasonable; respect Cursor rate limits.41- `--no-resume`: ignore existing per-run results in the destination directory and re-execute everything. Default is to resume by skipping any plan item that already has a result file.4243## Output4445Runs append a single-line summary to:4647- `researcher/reports/router-history.jsonl` (Stage 2)48- `researcher/reports/effectiveness-history.jsonl` (Stage 3)4950Per-run raw outputs land under:5152- `researcher/benchmarks/router/results/<timestamp>-<seed>/`53- `researcher/benchmarks/effectiveness/results/<timestamp>-<seed>/`5455Both `results/` directories are gitignored. Curated released results live in release notes or a published-results file.5657## Cost gates5859The runner refuses to call the SDK unless either `--max-runs` or `--max-budget-usd` is set (or `--unsafe-no-cost-cap` is explicitly passed). Default budgets are intentionally conservative.6061## Reproducibility6263Every run records:6465- Runner package version66- `CURSOR_API_KEY` fingerprint (last 4 chars only)67- Repo commit SHA at runtime68- Model ids resolved at start69- Fixture revision SHA70- Seed71- Full config snapshot7273Third parties can reproduce a run with `node src/runRouter.ts --config <captured-config.json>`.74