Researcher Benchmarks
Benchmarks test whether the research-to-skill harness resists common failure modes. Deterministic checks always run first. Model-judged evaluation can be added later as advisory evidence, but it must not override deterministic failures.
Benchmark results may be appended to researcher/reports/benchmark-history.jsonl for longitudinal tracking.