Researcher Benchmarks

Benchmarks test whether the research-to-skill harness resists common failure modes. Deterministic checks always run first. Model-judged evaluation can be added later as advisory evidence, but it must not override deterministic failures.

Benchmark results may be appended to researcher/reports/benchmark-history.jsonl for longitudinal tracking.

Preparing the source view

Agent Skills for Context Engineering

researcher/benchmarks/README.md

Researcher Benchmarks