Researcher Operating System

This directory defines the repo-native workflow for turning external research into skill changes. It is intentionally file-based so agents can inspect, resume, and audit work without requiring a hosted scheduler.

Mission

Maintain this repository as the source of truth for context engineering and harness engineering by continuously:

Discovering credible papers, engineering posts, benchmark reports, and lab notes.
Evaluating sources against explicit rubrics.
Extracting implementable mechanisms, not generic takeaways.
Mapping mechanisms to new skills, existing skill updates, or reference-only notes.
Preparing reviewable PRs after gates pass.

Agents may prepare branches and PR content after passing gates, but humans decide what merges. No workflow in this directory authorizes auto-merge.

Lifecycle

discover -> triage -> evaluate -> extract -> map -> draft -> validate -> prepare-pr -> human-merge

Stage	Output
Discover	Candidate source with URL, author, date, and why it matters
Triage	Source class and exclusion check from `source-registry.md`
Evaluate	JSON matching `templates/source-evaluation.json`
Extract	Mechanisms, artifacts, evidence, and failure modes
Map	Skill proposal using `templates/skill-proposal.md`
Draft	Skill or reference changes in normal repo structure
Validate	Rubric scores plus deterministic structure checks
Prepare PR	PR-ready summary, test plan, and unresolved review notes

Directory Map

source-registry.md - source classes, priorities, and rejection rules.
mechanisms/registry.jsonl - accepted mechanisms used for novelty and skill-delta checks.
mechanisms/ledgers/ - append-only accepted and rejected mechanism promotion events.
claims/index.jsonl - provenance for volatile, numeric, or benchmark claims.
corpus/index.json - machine-readable map of skills, mechanisms, claims, and activation scenarios.
benchmarks/ - adversarial scenarios and goldens for the researcher harness.
rubrics/content-curation.md - gates for accepting external content.
rubrics/skill-change.md - gates for changing skills.
rubrics/harness-change.md - gates for changing research or evaluation harnesses.
rubrics/pairwise-skill-revision.md - comparison rubric for competing skill drafts.
templates/source-evaluation.json - machine-readable evaluation shape.
templates/skill-proposal.md - source-to-skill delta proposal format.
templates/mechanism-proposal.jsonl - run-local mechanism promotion proposal format.
templates/research-thread.md - durable thread log for long-running agents.
runbooks/autonomous-research-loop.md - operating loop for autonomous researchers.
runbooks/pr-readiness.md - pre-PR checklist.
scripts/validate_repo.py - deterministic repository and harness validator.
scripts/validate_run.py - publish-readiness validator for a single research run.
scripts/research_loop.py - creates durable run directories and validation reports.
scripts/novelty_check.py - checks proposal overlap against existing skills and prior runs.
scripts/compare_skill_revisions.py - deterministic pre-check for pairwise skill revisions.
scripts/check_activation_cases.py - deterministic activation-boundary regression checks.
scripts/run_benchmarks.py - deterministic benchmark harness with optional history recording.

Governance Rules

Keep rubrics harder to change than outputs. A source cannot relax the rubric used to admit it.
Cite only retrieved sources. If a source failed to load, record the failure and do not cite it as evidence.
Separate source quality from skill quality. A strong paper may still produce no actionable skill delta.
Prefer updating existing skills over adding new ones unless the activation scenario, mechanism, and operating procedure are distinct.
Require human review when evidence is anecdotal, source claims are volatile, or a skill change affects repo-wide guidance.
Keep all generated skill changes aligned with template/SKILL.md, the 500-line cap, and manifest sync rules.

Local Commands

python researcher/scripts/validate_repo.py
python researcher/scripts/validate_run.py --run-dir researcher/runs/<run-id>
python researcher/scripts/research_loop.py init --title "Source title" --url "https://example.com/source"
python researcher/scripts/novelty_check.py --file researcher/fixtures/skill-proposals/harness-engineering-proposal.md
python researcher/scripts/compare_skill_revisions.py skills/evaluation/SKILL.md skills/advanced-evaluation/SKILL.md
python researcher/scripts/check_activation_cases.py
python researcher/scripts/run_benchmarks.py

Current Published Research Skills

The first published skill from this operating system is harness-engineering. Skill evolution remains internal to this directory until the process has enough examples and validation data to justify a standalone published skill.

Researcher Operating System

Mission

Maintain this repository as the source of truth for context engineering and harness engineering by continuously:

Discovering credible papers, engineering posts, benchmark reports, and lab notes.
Evaluating sources against explicit rubrics.
Extracting implementable mechanisms, not generic takeaways.
Mapping mechanisms to new skills, existing skill updates, or reference-only notes.
Preparing reviewable PRs after gates pass.

Agents may prepare branches and PR content after passing gates, but humans decide what merges. No workflow in this directory authorizes auto-merge.

Lifecycle

discover -> triage -> evaluate -> extract -> map -> draft -> validate -> prepare-pr -> human-merge

Stage	Output
Discover	Candidate source with URL, author, date, and why it matters
Triage	Source class and exclusion check from `source-registry.md`
Evaluate	JSON matching `templates/source-evaluation.json`
Extract	Mechanisms, artifacts, evidence, and failure modes
Map	Skill proposal using `templates/skill-proposal.md`
Draft	Skill or reference changes in normal repo structure
Validate	Rubric scores plus deterministic structure checks
Prepare PR	PR-ready summary, test plan, and unresolved review notes

Directory Map

source-registry.md - source classes, priorities, and rejection rules.
mechanisms/registry.jsonl - accepted mechanisms used for novelty and skill-delta checks.
mechanisms/ledgers/ - append-only accepted and rejected mechanism promotion events.
claims/index.jsonl - provenance for volatile, numeric, or benchmark claims.
corpus/index.json - machine-readable map of skills, mechanisms, claims, and activation scenarios.
benchmarks/ - adversarial scenarios and goldens for the researcher harness.
rubrics/content-curation.md - gates for accepting external content.
rubrics/skill-change.md - gates for changing skills.
rubrics/harness-change.md - gates for changing research or evaluation harnesses.
rubrics/pairwise-skill-revision.md - comparison rubric for competing skill drafts.
templates/source-evaluation.json - machine-readable evaluation shape.
templates/skill-proposal.md - source-to-skill delta proposal format.
templates/mechanism-proposal.jsonl - run-local mechanism promotion proposal format.
templates/research-thread.md - durable thread log for long-running agents.
runbooks/autonomous-research-loop.md - operating loop for autonomous researchers.
runbooks/pr-readiness.md - pre-PR checklist.
scripts/validate_repo.py - deterministic repository and harness validator.
scripts/validate_run.py - publish-readiness validator for a single research run.
scripts/research_loop.py - creates durable run directories and validation reports.
scripts/novelty_check.py - checks proposal overlap against existing skills and prior runs.
scripts/compare_skill_revisions.py - deterministic pre-check for pairwise skill revisions.
scripts/check_activation_cases.py - deterministic activation-boundary regression checks.
scripts/run_benchmarks.py - deterministic benchmark harness with optional history recording.

Governance Rules

Keep rubrics harder to change than outputs. A source cannot relax the rubric used to admit it.
Cite only retrieved sources. If a source failed to load, record the failure and do not cite it as evidence.
Separate source quality from skill quality. A strong paper may still produce no actionable skill delta.
Prefer updating existing skills over adding new ones unless the activation scenario, mechanism, and operating procedure are distinct.
Require human review when evidence is anecdotal, source claims are volatile, or a skill change affects repo-wide guidance.
Keep all generated skill changes aligned with template/SKILL.md, the 500-line cap, and manifest sync rules.

Local Commands

python researcher/scripts/validate_repo.py
python researcher/scripts/validate_run.py --run-dir researcher/runs/<run-id>
python researcher/scripts/research_loop.py init --title "Source title" --url "https://example.com/source"
python researcher/scripts/novelty_check.py --file researcher/fixtures/skill-proposals/harness-engineering-proposal.md
python researcher/scripts/compare_skill_revisions.py skills/evaluation/SKILL.md skills/advanced-evaluation/SKILL.md
python researcher/scripts/check_activation_cases.py
python researcher/scripts/run_benchmarks.py

Agent Skills for Context Engineering

researcher/README.md

Researcher Operating System

Mission

Lifecycle

Directory Map

Governance Rules

Local Commands

Current Published Research Skills

Preparing the source view

Agent Skills for Context Engineering

researcher/README.md

Researcher Operating System

Mission

Lifecycle

Directory Map

Governance Rules

Local Commands

Current Published Research Skills