Loading source
Pulling the file list, source metadata, and syntax-aware rendering for this listing.
Source from repo
A comprehensive collection of Agent Skills for context engineering, multi-agent architectures, and production agent systems.
Files
Skill
Size
Entrypoint
Format
Open file
Syntax-highlighted preview of this file as included in the skill package.
researcher/insights/auto-research-experiment.md
1# Auto-Research Experiment: Lessons From Building The Researcher OS23This document records what was actually learned across the multi-session experiment that built the v2.2.0 researcher operating system. It is meant to be read alongside `researcher/README.md` and `runbooks/continuous-operation.md`, not in place of them.45## Context67The experiment started with a published skill collection (v2.0/v2.1) and a single question: how do we keep this corpus alive without turning it into a one-person editorial job. The answer became an autonomous research-to-skill operating system with a continuous loop, deterministic gates, mechanism promotion, claim provenance, and explicit run states.89The goal was not autonomy for its own sake. The goal was compounding: each run should leave the next run better off.1011## Findings That Changed The Architecture1213### 1. Activation scenarios beat keyword triggers1415The original skill frontmatter used quoted keyword lists like `"design agent tools"`, `"reduce tool complexity"`, etc. Those decayed fast: any rewording broke discovery, and the model could not distinguish neighboring skills with overlapping vocabularies.1617Switching every skill description to a task-boundary statement let `check_activation_cases.py` actually measure routing. The first run of that checker immediately failed on `evaluation` vs `advanced-evaluation` and `project-development` vs `harness-engineering`. Those failures pointed at real wording problems, not contrived edge cases.1819Takeaway: trigger lists feel concrete but are unstable scaffolding. Activation scenarios scale because they describe what the skill is for.2021### 2. The mechanism registry became the encyclopedia backbone2223The mechanism registry started as a way to make novelty detection smarter than token overlap. By the time it was wired through `novelty_check.py`, `validate_repo.py`, `corpus/index.json`, and the run state machine, it had become the canonical map of which behavior changes the corpus actually endorses.2425Once mechanisms were first-class data with ledgers (`accepted.jsonl`, `rejected.jsonl`) and gated promotion, future runs gained free institutional memory. A new proposal can be compared against accepted patterns, candidate patterns, and explicitly rejected patterns from prior runs.2627Takeaway: making the unit of knowledge machine-readable is the actual unlock. Markdown is a publication target, not a memory layer.2829### 3. Deterministic gates caught the categorical failures3031Every time `validate_repo.py` was tightened, it caught real bugs that had already shipped:3233- Duplicate JSON keys in `source-evaluation.json`.34- Incorrect rubric math (recorded `weighted_total` did not match the dimension scores).35- `APPROVE` decisions on `partial` or `failed` retrievals.36- Mechanism registry entries with broken evidence paths.37- Activation cases pointing at unpublished skills.38- Run artifacts missing required pieces.39- Raw research dumps left at the repo root.4041None of these required model judgment. They were structural. Spending tokens on judges before these gates exist was the wrong order of operations.4243Takeaway: deterministic verification is cheaper than LLM judgment and catches a different class of failures. Build the structural validator first.4445### 4. Prose logs are not enforceable4647`THREAD.md` was useful for humans reading the history of a run. It was useless for telling another agent what the run was actually doing or whether it was safe to advance. The fix was `run-state.json` with explicit transitions and `state_history`, plus `validate_run.py` to enforce publish readiness independent of the prose log.4849The first version of the state machine still passed validation while runs were intentionally incomplete. Splitting "is the corpus healthy" (`validate_repo`) from "is this run ready to ship" (`validate_run`) eliminated that ambiguity.5051Takeaway: machine-readable state with explicit transitions is the difference between "the run looks ready" and "the run is ready". Prose logs should be the human surface, not the source of truth.5253### 5. Closed-run reaping was the obvious bug5455The first continuous loop would have parked closed runs forever. `loop_step.py` only advanced active runs, so a run that transitioned to `closed` via `research_loop.py close` would sit in `parked.jsonl` until the parked queue filled and the loop halted.5657`reap_closed_runs()` is six lines. The missing piece was not complexity; it was recognizing that "the loop will eventually pick it up" was wishful thinking when no path existed.5859Takeaway: queue-based systems need an explicit reaper for every state that exits the queue. Audit every terminal state and ask "who removes this entry?"6061### 6. Concurrency bugs are invisible until you assume the daemon6263`loop_discover.py` appended to `inbox.jsonl`. `loop_step.py` read `inbox.jsonl`, popped one item, and wrote the rest back. With manual usage this was fine. With launchd running both on overlapping schedules, discovery's appended items could be silently overwritten.6465The fix was atomic writes (`tempfile` + `os.replace`) plus `fcntl` exclusive locks scoped per queue family. Cost was a small dependency on `fcntl` (Unix only, acceptable for macOS). The lock also had to be held through the subprocess that initialized the run, not released right after the pop, because otherwise two concurrent invocations would each pass the budget check and exceed `max_active_runs`.6667Takeaway: as soon as the same files are touched by two cron jobs, append/replace patterns need atomic semantics and locks. There is no halfway design.6869### 7. Adversarial benchmarks tested the harness, not the outputs7071The benchmark harness exists to ask "what would it look like if someone tried to game this loop?" The scenarios are concrete attacks: duplicate mechanism with different wording, credible-author-but-generic content, unretrieved source cited as evidence, valid JSON with wrong rubric math, verbose skill draft that adds no behavior change, proposal that modifies the rubric used to approve itself.7273Writing the attack first showed which gates were missing. Most of the validator hardening came from staring at adversarial scenarios and asking "would the current code catch this?"7475Takeaway: write the failure modes before claiming the defense works. Adversarial fixtures are the cheapest way to find category-level gaps.7677### 8. Claim provenance prevents claim rot7879Skills accumulate numeric claims: BrowseComp variance percentages, LoCoMo accuracy numbers, multi-agent token multipliers, masking gain estimates. These age fast and humans never notice until the source is gone.8081`researcher/claims/index.jsonl` makes each claim auditable: source URL, evidence strength, volatility, last reviewed date. `loop_daily.py` can flag high-volatility claims that have not been reviewed today. The validator enforces that owning skills exist and source paths resolve.8283Takeaway: any claim that could rot needs explicit provenance. Burying them in prose is a guarantee that future agents will repeat them without checking.8485### 9. Subagent reviews are a deterministic gate too8687Multiple times in this experiment, an implementation was declared complete, then a `code-reviewer` subagent flagged concrete bugs: closed-run reaping, scheme allowlist on fetch, race condition on the inbox lock, dead config flags, redundant heuristics that silently passed broken runs. Each review pass moved the implementation from "looks right" to "is right".8889This is cheap. A single review pass typically costs less than what it saves in production debugging. The review prompt that worked best was: "Focus only on HIGH-CONFIDENCE issues that would cause incorrect behavior, data loss, runaway resource use, or security risk. Skip stylistic comments."9091Takeaway: treat code review as part of the build pipeline. Even adversarial scenarios miss class-of-error bugs that a fresh reader catches in minutes.9293### 10. Live execution caught what paper review missed9495After building the continuous loop, running it for 13 iterations live exposed:9697- The pre-existing seed run got parked at "needs evaluation" because its source URL pointed at a Parallel research interaction, not the actual evidence files.98- Daily-budget timing depended on the test runs counting against the same day.99- Status dashboard short-circuited the launchd wrapper when loop_step exited 78 (no-op), preventing dashboard refresh on idle iterations.100101None of these were visible from reading the code. All were obvious within minutes of running it.102103Takeaway: end-to-end execution is the highest-signal validation for orchestration code. Unit tests of pieces will not catch state-machine timing or wrapper exit-code interactions.104105### 11. Continuous operation is a separate architectural concern106107The breakthrough roadmap delivered the right per-task primitives: validate_run, mechanism promotion, claim provenance, activation tests, benchmarks, corpus index. None of those answered "what runs this loop when no one is at the keyboard."108109That answer required different machinery: queue, source discovery, scheduler (launchd), budgets, parking, reaping, status surfaces, runbook. Features and continuous operation are orthogonal concerns. Treating them as the same milestone hid the gap.110111Takeaway: per-task correctness and continuous operation are different problems. Design them as separate layers and only couple them through file-based state.112113### 12. Terminal states convert technical debt into documentation114115The seed run started this experiment in flight, with a draft evaluation and a template proposal. Each repo-validation pass had to special-case it. Closing it as `reference-only` with a clear rationale converted it from "incomplete work" into "worked example of the lifecycle." Same files, different framing.116117Takeaway: when in-flight artifacts will not be completed, give them a terminal state with an explicit reason. They become useful as examples instead of remaining as debt.118119### 13. Token cost of structural validation is near zero120121Validating the entire 14-skill corpus, all rubrics, all manifests, the mechanism registry, the claims index, the corpus index, activation cases, benchmark scenarios, and all run artifacts takes under a second of CPU and zero LLM tokens. There is no excuse for not running these checks on every PR and on every loop iteration.122123Takeaway: cheap structural checks are the right floor. Spend model attention only on what cannot be deterministically verified.124125### 14. The description-benchmark loop closes in ~2 hours and is the highest-leverage feedback loop in the system126127This was the v2.3.0 finding. Stage 2 of the benchmark plan shipped real numbers across four frontier models, then we rewrote two skill descriptions based on the confusion matrix, then we re-ran the benchmark and measured the delta. End to end:128129- Baseline benchmark: 50 prompts x 4 models x 3 reps = 600 runs (sequential, ~60 min; killed at 566 on first attempt and resumed via results-folder scan).130- Description rewrites for `context-fundamentals`, `tool-design`, `project-development` based on the baseline confusion matrix: ~10 minutes of focused writing.131- Re-run benchmark with same seed, same fixture: 600/600 runs in ~15 minutes (concurrency=4) with per-run progress logging.132- Delta report generation: 1 minute.133134Total wall time end-to-end: under 2 hours including discussion. Result: `context-fundamentals` top-1 +23.4pp, `project-development` top-1 +25pp (now perfect), `tool-design` top-1 +7.8pp, three of four models gained on top-1, all four gained on top-3.135136Three properties of the loop matter:1371381. **Same seed, same fixture, same shuffles**. The only changing variable across the two runs was the descriptions. This is what lets the delta be attributed to the rewrites, not luck.1392. **Per-skill effect size is the unit of action**. Aggregate accuracy moves by ~1pp; individual skills move by 25pp. Reporting only the aggregate hides which descriptions need work.1403. **The confusion matrix tells you what to rewrite**. The baseline showed `context-fundamentals` leaking specifically to `context-degradation`, `project-development`, and `context-optimization`. The rewrite was guided by the exact direction of each leak. It is much harder to fix a description by re-reading it; it is straightforward to fix one by looking at which other skill claimed its territory.141142Takeaway: for any system where description quality matters (skill routers, tool routers, RAG document selection), invest in a measurement loop early. The first run identifies the failures; the second run proves whether the fix worked. Without the second run, you do not actually know.143144### 15. Description rewrites are incomplete without body alignment, and the router benchmark cannot catch that145146The first v2.3.0 description-rewrite pass changed the one-line frontmatter `description` for three skills. The router benchmark validated that change (top-1 rates improved as predicted). The mistake was assuming the work was done.147148The frontmatter description is what gets injected into a routing prompt. The SKILL.md body (`When to Activate`, `Practical Guidance`, `Examples`, `Integration`, `Gotchas`) is what gets loaded into the agent's context once the skill activates. These are two different surfaces with two different audiences. A description can route correctly while the body still steers the agent toward the wrong work.149150Concrete instance: after the description rewrite, `context-fundamentals` correctly routed to itself for foundational prompts, but its body still listed "Optimizing context usage to reduce token costs or improve performance" as a `When to Activate` trigger. The moment the skill activated for a real task, the agent saw operational ownership claims that contradicted the description that got it there. The router benchmark (which runs with `settingSources: []` and only sees descriptions) had no way to surface this gap.151152Three fixes:1531541. Treat "rewrite the description" and "align the body" as two work items, not one. The first is cheap and benchmark-measurable; the second is cheaper still but invisible to the router benchmark and only measurable in Stage 3 effectiveness tests where the body actually loads.1552. The `When to Activate` body section is the highest-leverage body content. It is loaded immediately, parsed by the model, and acts as a second routing surface inside the skill. A `When to Activate` that contradicts the frontmatter `description` is worse than no `When to Activate` at all.1563. The `Integration` section is the cross-reference layer. After every description rewrite, every sibling skill's `Integration` should be checked for stale routing claims. A skill that says "for X, go to old-skill-name" when X has been re-homed introduces real routing damage.157158Takeaway: any time a skill's scope is narrowed at the description level, audit the body the same day. A skill is a multi-surface artifact; consistency across surfaces is part of the change.159160### 16. Concurrency, resume, and per-run logging are the three non-negotiable runner features161162The v1 router runner ran sequentially, had no resume capability, and printed nothing per-run. The sweep died at 566 of 600 and we did not know:163164- That it had died (no per-run output meant the stalled state was invisible).165- How far it had gotten (had to count files in the results directory).166- That we would have to redo everything (no resume meant any restart was a full do-over).167168The v2 runner has all three. The v2 sweep ran in 15 minutes with full visibility and complete coverage. Same code path, same SDK, same models. The May 19 corpus-hardening sweep added one more operational lesson: transient empty/format-failed SDK responses should be retried once before being counted. The full 600-record sweep initially had 15 format failures, all of which completed successfully when rerun through resume. The runner now has an automatic format-failure retry.169170Takeaway: any runner that calls a paid API in a loop needs: bounded parallelism with a concurrency flag; resume that scans the results directory and skips completed work; per-run progress logging that surfaces stalls in less than the time it takes to finish one call; and one retry for transient empty or unparsable outputs. These are not optimization. They are baseline competence.171172### 17. A skill is not improved until the corpus metadata improves with it173174The user was right to call out the shortcut. Rewriting descriptions and aligning three skill bodies was not enough. It improved the router surface, but it left the rest of the corpus with uneven body standards, missing mechanism ownership, weak claim provenance, and untested activation boundaries.175176The corpus-wide hardening pass changed the unit of work from "fix the skills the benchmark complained about" to "make every skill pass the same publication standard." That meant:177178- Every skill body got explicit ownership boundaries and adjacent `Do not activate` routes.179- Structurally weak bodies (`bdi-mental-states`, `hosted-agents`) gained practical guidance and examples.180- Every skill was mapped to at least one accepted mechanism in `researcher/mechanisms/registry.jsonl`.181- Volatile numeric/vendor claims were softened or tied to `claim-*` IDs with concrete source paths.182- Activation fixtures grew to cover previously untested skills, not just the benchmark-confused ones.183- `validate_repo.py --strict` was tightened so required body sections and non-activation boundaries are enforced.184185The result was measurable without another paid SDK sweep: skill health moved from 0.8111 with two flagged skills to 0.9117 with zero flagged skills; strict validation passed with zero warnings; activation cases passed 19/19; adversarial deterministic benchmarks passed 7/7 scenarios.186187Takeaway: auto-research should improve three layers at once: prose, metadata, and gates. If only the prose changes, the corpus becomes prettier but not more self-improving.188189## What This Experiment Did Not Solve190191Honest list, not aspirations:192193- **No LLM judge adapter**: `retrieved -> evaluated` and `proposed -> novelty_checked` still require human or future judge action. The contract is in place; the cost-gated implementation is not.194- **No automated source discovery beyond the manual seed**: feed adapters for Parallel deep research, web search, RSS, and academic preprints are placeholders.195- **No log rotation**: per-script logs grow indefinitely. Acceptable for weeks, not years.196- **No pytest suite**: scripts are smoke-tested via live execution and the deterministic gates. Real unit/integration tests would catch refactor regressions earlier.197- **No full-body effectiveness benchmark yet**: deterministic health now measures skill-body structure, but Stage 3 still needs real agent tasks with and without full skills loaded.198199## Reusable Patterns For Future Work200201Patterns from this experiment that are worth carrying into other agentic systems:2022031. **Run state in a JSON file plus an append-only thread log**. State governs automation, thread governs humans.2042. **Queue + inbox + parked + done + quarantine** as the minimum durable surface for any agent that processes work.2053. **Atomic writes (`tempfile` + `os.replace`) plus `fcntl` locks** as the default whenever two processes share a file.2064. **Mechanism registry with gated promotion** as the way to accumulate "what we have learned" without ad hoc edits.2075. **Activation regression fixtures** for any skill-router style component that uses descriptions to route tasks.2086. **Adversarial benchmarks before model judges** for any harness that decides whether to publish or merge.2097. **Closed-state reaper** in every queue.2108. **Split validators by question** ("is the corpus healthy" vs "is this run ready") instead of one validator per artifact.2119. **Run live before declaring done** for any orchestration code.21210. **Treat continuous operation as a separate milestone** from per-task correctness.21311. **The description-benchmark loop** for any system where natural-language descriptions affect routing: baseline benchmark, read the confusion matrix, rewrite based on specific leak directions, re-run with the same seed and fixture, publish the delta.21412. **Resume-by-results-folder-scan + bounded concurrency + per-run progress logging + format-failure retry** as the non-negotiable runner features for any paid-API loop.21513. **Audit the body the same day the description changes.** A skill is a multi-surface artifact; the `When to Activate` body section is a second routing surface and must not contradict the frontmatter description. The router benchmark cannot catch body-vs-description inconsistencies because router prompts only contain descriptions.21614. **Update prose, metadata, and gates together.** A skill improvement is incomplete until the body text, mechanism registry, claim provenance, corpus index, activation fixtures, and validators all agree.217218## How To Use This Document219220When extending the researcher OS or designing similar systems, read this before writing new infrastructure. Most of the friction in this experiment came from making the same shape of mistake (silent failure modes, missing reapers, prose-only state, deterministic checks deferred) in different concrete forms. Knowing the shape lets you skip the mistake.221222If a future change here invalidates one of these findings, update this file in the same PR.223