Loading source
Pulling the file list, source metadata, and syntax-aware rendering for this listing.
Source from repo
A comprehensive collection of Agent Skills for context engineering, multi-agent architectures, and production agent systems.
Files
Skill
Size
Entrypoint
Format
Open file
Syntax-highlighted preview of this file as included in the skill package.
researcher/insights/how-we-built-this.md
1# How We Built v2.3.023A narrative companion to `auto-research-experiment.md`. That document captures the technical findings. This one captures the process: what we read, how we worked, what each session produced, what shipped, and how to talk about it without overselling.45If you want the engineering rationale, read `auto-research-experiment.md`. If you want the project story or templates for sharing it, read this.67## Origins89The starting question was small and concrete: how do we keep this skill repository alive without it becoming a one-person editorial backlog. The 10k stars made the answer matter; the absence of a curation pipeline made it urgent.1011Three pieces of prior work shaped the design.1213### Karpathy's autoresearch1415[karpathy/autoresearch](https://github.com/karpathy/autoresearch) is the smallest interesting autonomous experimentation harness. It has one editable file (`train.py`), one locked evaluator (`prepare.py`), one scalar metric, a `results.tsv` results log, fixed wall-clock budgets, and `git` rollback on non-improving attempts. That is the entire harness.1617The lesson it teaches is not that every loop needs one scalar metric. The lesson is that ambiguous feedback produces ambiguous autonomy. If the agent can change the evaluator, it will eventually optimize the benchmark instead of the task.1819This became the architectural seed for our `locked-editable-surfaces` mechanism: classify every file as locked, editable, append-only, or human-controlled before the loop starts.2021### Prime Intellect's auto-NanoGPT2223[primeintellect.ai/auto-nanogpt](https://www.primeintellect.ai/auto-nanogpt) showed what durable agent state looks like in practice. Their `THREAD.md` pattern, the source queue files, the rejected-attempt log, the explicit handover summary at the bottom of each thread: these are not nice-to-haves. They are the difference between a loop that survives context compaction and one that does not.2425This became our `durable-research-thread` mechanism and the run-directory layout we ship today: every run has `THREAD.md`, `sources/queue.jsonl`, `proposals/`, `reports/`, and `logs/`.2627### Parallel deep research on autonomous research frameworks2829We then ran a targeted deep-research query through Parallel (interaction `trun_64f5be03055a4b52adf17481e4b865bc`) asking specifically how to build an autonomous research-to-skill system inside a file-based repository. The raw output is preserved under the seed run's `sources/evidence/raw/` directory. The findings that survived rubric scoring became the v2.2.0 architecture:3031- Deterministic validation harnesses as the locked evaluator.32- Durable scratchpads and THREAD-style logs for auditability.33- Novelty gates to prevent redundant skill revisions.34- Pairwise skill revision evaluation for competing drafts.35- Auto-PR workflows that prepare changes but never auto-merge.3637All five made it into the published harness. The deep-research summary at `researcher/runs/20260515-035228-executable-autonomous-research-frameworks/sources/evidence/deep-research-summary.md` is the worked-example artifact.3839## How We Worked4041The collaboration ran across multiple sessions. The pattern that worked was always the same:4243```44research -> plan -> implement -> review -> harden -> verify45```4647### Mode switching4849When the next step was unclear or had real architectural trade-offs, the conversation went into plan mode. The plan was written, the user approved it, then the conversation switched back to agent mode for execution. This matters because the plan became a contract that future steps could be checked against, and it stopped premature implementation on ambiguous goals.5051### Subagent reviews5253After every meaningful batch of implementation, a `code-reviewer` subagent was invoked with a tightly scoped prompt: "find only HIGH-CONFIDENCE issues that would cause incorrect behavior, data loss, runaway resource use, or security risk. Skip stylistic comments."5455Those reviews found bugs the implementation pass missed:5657- Closed runs would never be reaped (the loop would silently fill `parked.jsonl` until it halted).58- `write_jsonl` was non-atomic and could corrupt queue files on kill mid-write.59- `pop_inbox_item` released the lock before `init_run`, so two concurrent loop_step invocations could exceed `max_active_runs`.60- The HTTP fetcher accepted `file://` and `data:` schemes (SSRF/LFI risk).61- The launchd wrapper short-circuited the status refresh when `loop_step` exited 78 (no work).62- URL deduplication used the raw URL while `source_id` hashed the normalized form.6364Each of these had a one-line root cause and a clean fix once surfaced. None would have shown up in unit tests of individual functions because they were interactions between components.6566### Live execution as a deterministic gate6768The continuous loop was built and then actually run for thirteen iterations during the session: discovered eight sources from a manual seed, initialized six runs (capped by the daily budget), fetched four sources via stdlib HTTP totalling about 1.7 MB, parked five runs at the correct gates, closed one run and watched the reaper move it to `done.jsonl`.6970This is mundane in description but high-signal in practice. Half the bugs above were only obvious because the live execution exposed them. Paper review of orchestration code consistently misses interaction bugs.7172### Tools that mattered7374- `code-reviewer` subagent: caught the compound failures listed above.75- `code-explorer` subagent: mapped the existing skill corpus before designing the breakthrough roadmap.76- `code-architect` subagent: produced the first draft of the evaluation strategy that became the benchmark harness.77- Parallel deep research: provided the architectural recommendations that the rubrics later turned into accepted mechanisms.78- DeepWiki: already linked from the README; useful for reviewers who want a higher-level overview without reading every skill.7980## Stage 2: The Description-Benchmark Loop Closes8182The narrative above ends with v2.2.0 shipping the scaffolding. v2.3.0 is the chapter where the scaffolding produces measurements that change what gets written.8384### The first run was honest about its own failures8586The v2.2.0 router benchmark sweep ran 566 of 600 calls before the sequential runner stalled silently. The aggregate numbers across four models (composer-2, claude-opus-4-7, gpt-5.5, gemini-3.1-pro) clustered at ~88.6% top-1, which sounded fine until we looked at the per-skill confusion matrix.8788One skill carried almost all of the routing failure: `context-fundamentals` was predicted correctly only 12 of 47 times. The rest split across `context-degradation` (12), `project-development` (12), `context-optimization` (8). A second pair, `tool-design` vs `project-development`, leaked symmetrically at ~25%. The other 11 skills were near-perfect.8990### What we changed9192Two interventions, both small:9394- Rewrote `context-fundamentals` to be unambiguously about conceptual foundations. The old description claimed ownership of "designing agent architectures," "debugging context quality issues," and "setting context budgets" - all of which belong to other skills. The new description explicitly routes operational work to the specialized skills.95- Rewrote `tool-design` and `project-development` with explicit cross-references to each other. The old descriptions both mentioned "architectural" or "structuring" framings that overlapped. The new descriptions define the unit of work: tool-design owns single-tool decisions; project-development owns project-shape decisions; each routes the other.9697We also fixed the runner so the next sweep would not silently die: bounded parallelism (concurrency=4), resume capability (scans results folder on startup), and per-run progress logging.9899### What we measured100101The second sweep ran 600/600 in 15 minutes (4x speedup from concurrency). Per-skill effect sizes for the three targeted descriptions:102103- `context-fundamentals`: 0.255 -> 0.489 (**+23.4pp**)104- `project-development`: 0.750 -> 1.000 (**+25pp**, now perfect routing)105- `tool-design`: 0.729 -> 0.807 (**+7.8pp**)106107Aggregate top-1: 0.888 -> 0.900. Three of four models gained on top-1; all four gained on top-3. Format compliance: 99.5% across 600 calls. Total Cursor SDK cost: ~$7.20.108109### Why this matters more than the absolute numbers110111The aggregate moved by ~1pp; individual skills moved by 25pp. **The aggregate is the wrong unit**. The confusion matrix tells you which descriptions need work and which directions the leakage goes. The delta-vs-baseline comparison tells you whether the fix worked. Without per-skill effect sizes, this entire feedback loop is invisible.112113End to end from "the v2.2.0 benchmark finished" to "v2.3.0 measured the rewrites" was under two hours of focused work. This is the cheapest, highest-leverage feedback loop in the system. Future contributions should run it whenever a skill description changes.114115### The shortcut we corrected116117The first post-benchmark fix was too narrow: rewrite three descriptions, align those three bodies, and call the release ready. That improved the measured router surface, but it did not make the whole repo better through auto-research.118119The correction was a corpus-wide hardening pass. Three read-only audits looked at activation boundaries, mechanism and claim coverage, and the skill-quality standard. Then every skill was brought under the same rule: explicit owned scope, explicit adjacent routes, practical guidance an agent can execute, examples, gotchas, integration as a boundary map, and machine-readable mechanisms/claims where the prose makes reusable or volatile claims.120121Concrete result:122123- `bdi-mental-states` and `hosted-agents` were no longer structural outliers; both gained missing practical guidance/examples and passed strict health.124- Mechanisms increased from 5 to 16, so every skill now owns at least one accepted behavior pattern.125- Claim provenance increased from 6 to 12, with concrete repo source paths replacing vague research-run summaries.126- Activation fixtures increased from 14 to 19, covering previously untested skills.127- `validate_repo.py --strict` now enforces full body sections and explicit non-activation boundaries.128- Skill health moved from 0.8111 with 2 flagged skills to 0.9117 with 0 flagged skills.129- A fresh 600-record router sweep on May 19 verified the corpus-wide pass did not introduce broad routing collapse: 0 format failures after retrying transient blanks, Gemini 0.920 top-1, Composer 0.913, GPT-5.5 0.913, Claude Opus 4.7 0.840. The remaining failures are still concentrated in ambiguous/negative-control prompts and the `context-fundamentals` catch-all boundary.130131This is the release's real lesson: improving a skill corpus means changing prose, metadata, and gates together. A prettier SKILL.md without updated mechanisms, claims, corpus index, fixtures, and validators is not self-improving.132133### What v2.3.0 inherits from v2.2.0134135Everything that shipped in v2.2.0 is still here: the file-based researcher operating system, the run state machine, the continuous loop, the launchd service definitions, the deterministic gates, the adversarial benchmark scenarios. v2.3.0 adds the measured results and the description fixes that those results justified.136137## What v2.3.0 Actually Ships138139Five concrete pieces of infrastructure, plus a corpus-wide hardening pass across all 15 skills:1401411. **Researcher operating system** under `researcher/`: source registry, content/skill/harness rubrics, pairwise revision rubric, mechanism registry, mechanism ledgers, claim provenance index, corpus index, activation regression fixtures, adversarial benchmark scenarios with goldens, append-only benchmark history.1421432. **Run state machine**: every run has `run-state.json` with explicit transitions (`initialized -> retrieved -> evaluated -> proposed -> novelty_checked -> validated -> pr_ready -> closed`). `research_loop.py` advances state via subcommands; `validate_run.py` enforces publish readiness without false positives on intentionally incomplete runs.1441453. **Continuous loop**: `loop_discover.py`, `loop_step.py`, `loop_daily.py`, `loop_status.py`. Backed by `researcher/queue/` (inbox, parked, done, quarantine) with atomic JSONL writes and `fcntl` locks. Runs unattended from launchd; never invokes paid LLMs; HTTP retrieval is stdlib-only with a 1.5 MB cap and a 30-second timeout.1461474. **CI**: `.github/workflows/validate.yml` runs Python compile, `validate_repo.py --strict`, `check_activation_cases.py`, and `run_benchmarks.py` on every push and PR.1481495. **Documentation**: `CHANGELOG.md`, refreshed `README.md` / `CLAUDE.md` / `CONTRIBUTING.md`, `AGENTS.md` for workspace memory, `researcher/runs/README.md` for operator orientation, and the two insight documents (this one and `auto-research-experiment.md`).150151By the numbers: 15 published skills, **16 accepted mechanisms**, **12 provenance-tracked claims**, **19 activation cases** (up from 8), 7 adversarial benchmark scenarios, 1 worked-example seed run, 50 router benchmark prompts, **1,800 completed router benchmark records** across three sweeps (baseline, post-description rewrite, post-corpus hardening), 4 frontier models evaluated, 1 worked-example effectiveness task scaffolded, and a strict skill-health score of **0.9117 with 0 flagged skills**.152153## What You Should Tell People154155### What was built156157A skill collection that can keep itself alive. Not a one-off skill drop. A file-based research-to-skill operating system that:158159- Discovers candidate sources from a curated registry.160- Scores them against locked rubrics that the agent cannot relax.161- Extracts implementable mechanisms instead of generic takeaways.162- Promotes only behaviors that survive both deterministic checks and human review.163- Runs the whole thing in a continuous loop without spending a single LLM token on retrieval or judgment.164- Falls back to a human review queue for anything that needs judgment.165166### Why it matters167168Most public skill collections are anthologies. They accumulate good content and then drift, because there is no process for keeping them current and no machine-readable layer for compounding what they know. v2.2.0 is the bet that an open repository can be more than that: a source of truth that improves with use.169170### Concrete artifacts you can point to171172- The mechanism registry: `researcher/mechanisms/registry.jsonl`173- The claim provenance index: `researcher/claims/index.jsonl`174- The corpus index: `researcher/corpus/index.json`175- The run state machine in action: `researcher/runs/20260515-035228-executable-autonomous-research-frameworks/`176- The adversarial benchmark harness: `researcher/benchmarks/`177- The continuous loop scripts: `researcher/scripts/loop_*.py`178- The launchd daemon installer: `researcher/orchestration/launchd/install.sh`179- The runbooks: `researcher/runbooks/`180- The technical findings: `researcher/insights/auto-research-experiment.md`181182### Sound bites that survive copy-paste183184- "Activation scenarios beat keyword triggers." (Documented as finding #1.)185- "Deterministic gates before model judges." (The validator costs zero tokens and catches categorical failures the judge would miss.)186- "Mechanism registry as encyclopedia backbone." (Making behavior changes first-class data is the unlock.)187- "Closed-run reaping is non-negotiable." (Every queue needs an explicit exit path.)188- "Adversarial benchmarks before declaring the harness safe." (Write the attack before claiming the defense.)189- "Continuous operation is a separate milestone from per-task correctness."190- "Live execution is the highest-signal validation for orchestration code."191192## Templates For Sharing193194Use these as starting points, not as final copy. Strip what is not true for your audience.195196### X / Twitter thread (8 tweets)197198```1991/ Shipped v2.2.0 of the Agent Skills for Context Engineering repo.200201It is no longer just a skill collection. It is a file-based research-to-skill operating system that keeps the corpus alive without becoming editorial debt.202203github.com/muratcankoylan/Agent-Skills-for-Context-Engineering2042052/ Origins: Karpathy's autoresearch showed that one editable file + one locked evaluator + a results log + git rollback is enough for autonomy. Prime Intellect's auto-nanoGPT showed why durable scratchpads matter. Parallel deep research filled in the rest.2062073/ What it does: discovers sources from a curated registry, scores them against locked rubrics, extracts implementable mechanisms (not generic takeaways), promotes only behaviors that pass deterministic checks plus human review.2082094/ Mechanism registry as encyclopedia backbone. Every accepted behavior change is a JSONL row with owning_skill, activation_scenario, behavior_change, evidence, and failure_modes. Novelty checks compare against the registry first, then the corpus.2102115/ Claim provenance for every volatile number. BrowseComp variance, LoCoMo accuracy, multi-agent token multipliers: each has a claim_id, source_url, evidence_strength, volatility, and last_reviewed. Claims that rot will not silently propagate.2122136/ Continuous loop runs from launchd on macOS. No paid LLMs. HTTP retrieval is stdlib-only. Atomic JSONL writes, fcntl locks, closed-run reaping, parked-review queue, daily ops with adversarial benchmarks against the harness itself.2142157/ CI is the floor. validate_repo + run_benchmarks + check_activation_cases on every PR. The validator catches duplicate JSON keys, wrong rubric math, APPROVE on partial retrievals, manifest drift, and missing run artifacts before any model judge runs.2162178/ Lessons in researcher/insights/. Headline findings: activation scenarios beat keyword triggers, deterministic gates before model judges, closed-run reapers are non-negotiable, adversarial benchmarks find what unit tests miss, live execution > paper review for orchestration.218```219220### LinkedIn post221222```223Released v2.2.0 of Agent Skills for Context Engineering: an open collection that now ships a file-based research-to-skill operating system with deterministic gates and a continuous loop.224225The starting question was small: how do you keep a 10k-star skill repository alive without it becoming a one-person editorial backlog. The answer turned into infrastructure.226227Three pieces of prior work shaped the design.228229Andrej Karpathy's autoresearch repo demonstrated the minimal autonomy harness: one editable file, one locked evaluator, a results log, git rollback. The lesson is not that every loop needs a single scalar metric. It is that ambiguous feedback produces ambiguous autonomy.230231Prime Intellect's auto-nanoGPT work showed why durable agent state matters in practice. Their THREAD.md pattern, source queues, and explicit handover summaries are the difference between a loop that survives context compaction and one that does not.232233Targeted deep research filled in the missing pieces: deterministic validation as the locked evaluator, durable run directories, novelty gates, pairwise revision, human-controlled merge.234235What shipped:236- Mechanism registry with gated promotion and append-only ledgers237- Claim provenance for volatile benchmark numbers238- Run state machine with enforceable transitions239- Activation regression tests for skill-boundary confusion240- Adversarial benchmark harness with append-only history241- Continuous loop scripts and launchd service definitions242- CI workflow running deterministic gates on every PR243- 15 skills, all migrated from keyword triggers to task-boundary scenarios244245The technical findings and the full process narrative live in researcher/insights/. The repo: github.com/muratcankoylan/Agent-Skills-for-Context-Engineering246247Honest scope: no LLM-judge adapter yet, no automated source discovery beyond the manual seed, no log rotation. Tracked at the end of CHANGELOG.md.248```249250### Blog post outline251252```253# Title: From Karpathy's autoresearch to a self-compounding skill encyclopedia254255## 1. Why this exists256The 10k-star skill repository was at a fork: stay an anthology, or become infrastructure.257258## 2. The three inputs259- Karpathy autoresearch: minimal autonomy harness, locked evaluator, results log260- Prime Intellect auto-nanoGPT: durable scratchpads, THREAD.md, handover discipline261- Parallel deep research on autonomous research frameworks: deterministic gates, novelty, pairwise, human-controlled merge262263## 3. The breakthrough roadmap264- Split repo health from run readiness265- Mechanism registry as the unit of accumulated learning266- Claim provenance to prevent claim rot267- Activation regression for skill boundaries268- Adversarial benchmarks instead of self-congratulatory ones269- Corpus index as the machine-readable map270271## 4. The continuous loop272- Queue + state machine + reaper + parked review273- launchd daemon, no paid LLMs, atomic writes, fcntl locks274- Live execution as the validation gate275276## 5. The lessons277(17 findings from researcher/insights/auto-research-experiment.md)278279## 6. What it does not solve yet280Honest list of out-of-scope items281282## 7. How to use it283- Install the plugin (Claude Code, Cursor, Open Plugins)284- Install the daemon (macOS launchd)285- Read the runbook before extending286```287288## How To Improve The Repo From Here289290The 2.2.0 release deliberately left several things off. These are the obvious next steps in roughly the order they pay off:2912921. **LLM-judge adapter** for the `retrieved -> evaluated` transition, gated by a per-day budget config. This is the single change that unlocks "the loop actually decides what to publish" instead of "the loop stages everything for human review."2932942. **Automated source discovery adapters**. Parallel deep research, a Cohere/OpenAI/Anthropic engineering blog RSS, an arXiv author-watch list. Each is a self-contained file under `researcher/scripts/feeds/`.2952963. **Log rotation and benchmark history pruning**. The current implementation appends forever. Acceptable for weeks, painful at one year.2972984. **pytest suite** for the loop scripts. Smoke-tested via live execution is fine for v2.2.0; refactor safety needs unit and integration tests.2993005. **More activation cases** as skill boundaries get challenged in the wild. Each confusion that shows up in user activations should become a fixture.3013026. **More claim provenance**. The current 12 claims cover the highest-volatility ones; the corpus has additional softer claims that would benefit from explicit provenance.3033047. **Self-spawning agents for parallel runs**. The loop currently processes one run per step. With proper locking and per-run budget tracking, parallel runs are safe and would amortize cost.3053068. **Cross-skill integration tests**. The corpus index already maps skills to mechanisms; the next step is asserting that integration sections in each skill reference real skills and stay accurate as the corpus evolves.307308If you want to contribute, look at the open items above. Each is scoped to fit in a single PR.309310## Your Learnings (Best Inference From The Process)311312This section is necessarily an inference. Adjust before sharing.313314- A repository with this many stars is a public surface, and that surface benefits more from infrastructure than from more content. The instinct to add another skill is almost always less valuable than the instinct to add another gate.315- The cheapest way to compound a knowledge base is to make its units machine-readable. Mechanisms, claims, activation scenarios, run states, benchmark scenarios. Prose is publication, not memory.316- "Improve all skills" means improving the corpus substrate, not just editing Markdown. The mechanism registry, claim index, corpus index, activation cases, validators, and template have to move with the skill bodies.317- "Run it for days" is a real architectural milestone, not a feature flag. The system has to be designed for unattended operation from the beginning.318- Subagent code review is the cheapest way to find category-of-error bugs that you would otherwise hit in production.319- Plan mode for irreversible decisions, agent mode for everything else. Mixing them silently is how scope creep happens.320- An AI collaborator works best as a load-bearing technical lead, not as a writing assistant. Direct instructions, named subagents, explicit budgets, deterministic gates: this is how the work compounds rather than dissipates.321322## Joint Findings From The Experiment323324These are documented in detail in `auto-research-experiment.md`. The five most reusable across other projects:3253261. Atomic file writes + fcntl locks are the default for any shared file the loop touches.3272. Closed-state reapers in every queue.3283. Split validators by question, not by artifact.3294. Adversarial benchmarks before model judges.3305. Live execution as the highest-signal validation for orchestration code.331332## What This Document Is For333334This file is the narrative version of the release. Use it when explaining what was built and why. Use `auto-research-experiment.md` when explaining the engineering decisions. Use `CHANGELOG.md` when you need the precise inventory.335336Keep all three accurate as the next versions ship.337