Router Benchmark (Stage 2)
Tests whether the activation-scenario descriptions in v2.2.0 skill frontmatter are good enough to route the right skill to a given prompt.
See researcher/benchmarks/PLAN.md for full methodology.
Files
prompts.jsonl: ground-truth prompts. Each line hasprompt_id,prompt,expected_primary_skill, optionalacceptable_secondary_skillsandrejected_skills, and areason.routing-prompt.md: the template given to the LLM. Uses{{SKILL_BLOCK}},{{USER_PROMPT}},{{SKILL_COUNT}}placeholders.results/<date>-<seed>/: per-run JSON outputs (gitignored).
Running
From the SDK runner:
cd researcher/benchmarks/sdk-runner
npm install
npm run router:dry-run # see the plan and cost forecast
npm run router:run -- --max-budget-usd 5 # execute (after exporting CURSOR_API_KEY)Ground truth
Initial fixtures are 50 prompts covering:
- Single-skill positive controls (one per skill, 15 cases)
- Adversarial boundary pairs from the v2.2.0 boundary-confusion list (15 cases across 5 pairs x 3 variants)
- Combined-skill prompts where multiple are acceptable (10 cases)
- Negative controls where no skill should fit well (5 cases)
- Subtle activation cases that should still resolve (5 cases)
Expand to 100 by adding prompts as new boundary confusions surface in the wild.