Source from repo
Agent Skills for Context Engineering

A comprehensive collection of Agent Skills for context engineering, multi-agent architectures, and production agent systems.
muratcankoylanGitHub muratcankoylanSource repo Original GitHub link
Files
339
Skill
n/a
Size
4.3 MB
Entrypoint
SKILL.md
Format
git-repo
Open file
researcher/insights/how-we-built-this.md

Syntax-highlighted preview of this file as included in the skill package.
Rendered Source
markdown337 linesFree
researcher/insights/how-we-built-this.md
1# How We Built v2.3.0
2 
3A narrative companion to `auto-research-experiment.md`. That document captures the technical findings. This one captures the process: what we read, how we worked, what each session produced, what shipped, and how to talk about it without overselling.
4 
5If you want the engineering rationale, read `auto-research-experiment.md`. If you want the project story or templates for sharing it, read this.
6 
7## Origins
8 
9The starting question was small and concrete: how do we keep this skill repository alive without it becoming a one-person editorial backlog. The 10k stars made the answer matter; the absence of a curation pipeline made it urgent.
10 
11Three pieces of prior work shaped the design.
12 
13### Karpathy's autoresearch
14 
15[karpathy/autoresearch](https://github.com/karpathy/autoresearch) is the smallest interesting autonomous experimentation harness. It has one editable file (`train.py`), one locked evaluator (`prepare.py`), one scalar metric, a `results.tsv` results log, fixed wall-clock budgets, and `git` rollback on non-improving attempts. That is the entire harness.
16 
17The lesson it teaches is not that every loop needs one scalar metric. The lesson is that ambiguous feedback produces ambiguous autonomy. If the agent can change the evaluator, it will eventually optimize the benchmark instead of the task.
18 
19This became the architectural seed for our `locked-editable-surfaces` mechanism: classify every file as locked, editable, append-only, or human-controlled before the loop starts.
20 
21### Prime Intellect's auto-NanoGPT
22 
23[primeintellect.ai/auto-nanogpt](https://www.primeintellect.ai/auto-nanogpt) showed what durable agent state looks like in practice. Their `THREAD.md` pattern, the source queue files, the rejected-attempt log, the explicit handover summary at the bottom of each thread: these are not nice-to-haves. They are the difference between a loop that survives context compaction and one that does not.
24 
25This became our `durable-research-thread` mechanism and the run-directory layout we ship today: every run has `THREAD.md`, `sources/queue.jsonl`, `proposals/`, `reports/`, and `logs/`.
26 
27### Parallel deep research on autonomous research frameworks
28 
29We then ran a targeted deep-research query through Parallel (interaction `trun_64f5be03055a4b52adf17481e4b865bc`) asking specifically how to build an autonomous research-to-skill system inside a file-based repository. The raw output is preserved under the seed run's `sources/evidence/raw/` directory. The findings that survived rubric scoring became the v2.2.0 architecture:
30 
31- Deterministic validation harnesses as the locked evaluator.
32- Durable scratchpads and THREAD-style logs for auditability.
33- Novelty gates to prevent redundant skill revisions.
34- Pairwise skill revision evaluation for competing drafts.
35- Auto-PR workflows that prepare changes but never auto-merge.
36 
37All five made it into the published harness. The deep-research summary at `researcher/runs/20260515-035228-executable-autonomous-research-frameworks/sources/evidence/deep-research-summary.md` is the worked-example artifact.
38 
39## How We Worked
40 
41The collaboration ran across multiple sessions. The pattern that worked was always the same:
42 
43```
44research -> plan -> implement -> review -> harden -> verify
45```
46 
47### Mode switching
48 
49When the next step was unclear or had real architectural trade-offs, the conversation went into plan mode. The plan was written, the user approved it, then the conversation switched back to agent mode for execution. This matters because the plan became a contract that future steps could be checked against, and it stopped premature implementation on ambiguous goals.
50 
51### Subagent reviews
52 
53After every meaningful batch of implementation, a `code-reviewer` subagent was invoked with a tightly scoped prompt: "find only HIGH-CONFIDENCE issues that would cause incorrect behavior, data loss, runaway resource use, or security risk. Skip stylistic comments."
54 
55Those reviews found bugs the implementation pass missed:
56 
57- Closed runs would never be reaped (the loop would silently fill `parked.jsonl` until it halted).
58- `write_jsonl` was non-atomic and could corrupt queue files on kill mid-write.
59- `pop_inbox_item` released the lock before `init_run`, so two concurrent loop_step invocations could exceed `max_active_runs`.
60- The HTTP fetcher accepted `file://` and `data:` schemes (SSRF/LFI risk).
61- The launchd wrapper short-circuited the status refresh when `loop_step` exited 78 (no work).
62- URL deduplication used the raw URL while `source_id` hashed the normalized form.
63 
64Each of these had a one-line root cause and a clean fix once surfaced. None would have shown up in unit tests of individual functions because they were interactions between components.
65 
66### Live execution as a deterministic gate
67 
68The continuous loop was built and then actually run for thirteen iterations during the session: discovered eight sources from a manual seed, initialized six runs (capped by the daily budget), fetched four sources via stdlib HTTP totalling about 1.7 MB, parked five runs at the correct gates, closed one run and watched the reaper move it to `done.jsonl`.
69 
70This is mundane in description but high-signal in practice. Half the bugs above were only obvious because the live execution exposed them. Paper review of orchestration code consistently misses interaction bugs.
71 
72### Tools that mattered
73 
74- `code-reviewer` subagent: caught the compound failures listed above.
75- `code-explorer` subagent: mapped the existing skill corpus before designing the breakthrough roadmap.
76- `code-architect` subagent: produced the first draft of the evaluation strategy that became the benchmark harness.
77- Parallel deep research: provided the architectural recommendations that the rubrics later turned into accepted mechanisms.
78- DeepWiki: already linked from the README; useful for reviewers who want a higher-level overview without reading every skill.
79 
80## Stage 2: The Description-Benchmark Loop Closes
81 
82The narrative above ends with v2.2.0 shipping the scaffolding. v2.3.0 is the chapter where the scaffolding produces measurements that change what gets written.
83 
84### The first run was honest about its own failures
85 
86The v2.2.0 router benchmark sweep ran 566 of 600 calls before the sequential runner stalled silently. The aggregate numbers across four models (composer-2, claude-opus-4-7, gpt-5.5, gemini-3.1-pro) clustered at ~88.6% top-1, which sounded fine until we looked at the per-skill confusion matrix.
87 
88One skill carried almost all of the routing failure: `context-fundamentals` was predicted correctly only 12 of 47 times. The rest split across `context-degradation` (12), `project-development` (12), `context-optimization` (8). A second pair, `tool-design` vs `project-development`, leaked symmetrically at ~25%. The other 11 skills were near-perfect.
89 
90### What we changed
91 
92Two interventions, both small:
93 
94- Rewrote `context-fundamentals` to be unambiguously about conceptual foundations. The old description claimed ownership of "designing agent architectures," "debugging context quality issues," and "setting context budgets" - all of which belong to other skills. The new description explicitly routes operational work to the specialized skills.
95- Rewrote `tool-design` and `project-development` with explicit cross-references to each other. The old descriptions both mentioned "architectural" or "structuring" framings that overlapped. The new descriptions define the unit of work: tool-design owns single-tool decisions; project-development owns project-shape decisions; each routes the other.
96 
97We also fixed the runner so the next sweep would not silently die: bounded parallelism (concurrency=4), resume capability (scans results folder on startup), and per-run progress logging.
98 
99### What we measured
100 
101The second sweep ran 600/600 in 15 minutes (4x speedup from concurrency). Per-skill effect sizes for the three targeted descriptions:
102 
103- `context-fundamentals`: 0.255 -> 0.489 (**+23.4pp**)
104- `project-development`: 0.750 -> 1.000 (**+25pp**, now perfect routing)
105- `tool-design`: 0.729 -> 0.807 (**+7.8pp**)
106 
107Aggregate top-1: 0.888 -> 0.900. Three of four models gained on top-1; all four gained on top-3. Format compliance: 99.5% across 600 calls. Total Cursor SDK cost: ~$7.20.
108 
109### Why this matters more than the absolute numbers
110 
111The aggregate moved by ~1pp; individual skills moved by 25pp. **The aggregate is the wrong unit**. The confusion matrix tells you which descriptions need work and which directions the leakage goes. The delta-vs-baseline comparison tells you whether the fix worked. Without per-skill effect sizes, this entire feedback loop is invisible.
112 
113End to end from "the v2.2.0 benchmark finished" to "v2.3.0 measured the rewrites" was under two hours of focused work. This is the cheapest, highest-leverage feedback loop in the system. Future contributions should run it whenever a skill description changes.
114 
115### The shortcut we corrected
116 
117The first post-benchmark fix was too narrow: rewrite three descriptions, align those three bodies, and call the release ready. That improved the measured router surface, but it did not make the whole repo better through auto-research.
118 
119The correction was a corpus-wide hardening pass. Three read-only audits looked at activation boundaries, mechanism and claim coverage, and the skill-quality standard. Then every skill was brought under the same rule: explicit owned scope, explicit adjacent routes, practical guidance an agent can execute, examples, gotchas, integration as a boundary map, and machine-readable mechanisms/claims where the prose makes reusable or volatile claims.
120 
121Concrete result:
122 
123- `bdi-mental-states` and `hosted-agents` were no longer structural outliers; both gained missing practical guidance/examples and passed strict health.
124- Mechanisms increased from 5 to 16, so every skill now owns at least one accepted behavior pattern.
125- Claim provenance increased from 6 to 12, with concrete repo source paths replacing vague research-run summaries.
126- Activation fixtures increased from 14 to 19, covering previously untested skills.
127- `validate_repo.py --strict` now enforces full body sections and explicit non-activation boundaries.
128- Skill health moved from 0.8111 with 2 flagged skills to 0.9117 with 0 flagged skills.
129- A fresh 600-record router sweep on May 19 verified the corpus-wide pass did not introduce broad routing collapse: 0 format failures after retrying transient blanks, Gemini 0.920 top-1, Composer 0.913, GPT-5.5 0.913, Claude Opus 4.7 0.840. The remaining failures are still concentrated in ambiguous/negative-control prompts and the `context-fundamentals` catch-all boundary.
130 
131This is the release's real lesson: improving a skill corpus means changing prose, metadata, and gates together. A prettier SKILL.md without updated mechanisms, claims, corpus index, fixtures, and validators is not self-improving.
132 
133### What v2.3.0 inherits from v2.2.0
134 
135Everything that shipped in v2.2.0 is still here: the file-based researcher operating system, the run state machine, the continuous loop, the launchd service definitions, the deterministic gates, the adversarial benchmark scenarios. v2.3.0 adds the measured results and the description fixes that those results justified.
136 
137## What v2.3.0 Actually Ships
138 
139Five concrete pieces of infrastructure, plus a corpus-wide hardening pass across all 15 skills:
140 
1411. **Researcher operating system** under `researcher/`: source registry, content/skill/harness rubrics, pairwise revision rubric, mechanism registry, mechanism ledgers, claim provenance index, corpus index, activation regression fixtures, adversarial benchmark scenarios with goldens, append-only benchmark history.
142 
1432. **Run state machine**: every run has `run-state.json` with explicit transitions (`initialized -> retrieved -> evaluated -> proposed -> novelty_checked -> validated -> pr_ready -> closed`). `research_loop.py` advances state via subcommands; `validate_run.py` enforces publish readiness without false positives on intentionally incomplete runs.
144 
1453. **Continuous loop**: `loop_discover.py`, `loop_step.py`, `loop_daily.py`, `loop_status.py`. Backed by `researcher/queue/` (inbox, parked, done, quarantine) with atomic JSONL writes and `fcntl` locks. Runs unattended from launchd; never invokes paid LLMs; HTTP retrieval is stdlib-only with a 1.5 MB cap and a 30-second timeout.
146 
1474. **CI**: `.github/workflows/validate.yml` runs Python compile, `validate_repo.py --strict`, `check_activation_cases.py`, and `run_benchmarks.py` on every push and PR.
148 
1495. **Documentation**: `CHANGELOG.md`, refreshed `README.md` / `CLAUDE.md` / `CONTRIBUTING.md`, `AGENTS.md` for workspace memory, `researcher/runs/README.md` for operator orientation, and the two insight documents (this one and `auto-research-experiment.md`).
150 
151By the numbers: 15 published skills, **16 accepted mechanisms**, **12 provenance-tracked claims**, **19 activation cases** (up from 8), 7 adversarial benchmark scenarios, 1 worked-example seed run, 50 router benchmark prompts, **1,800 completed router benchmark records** across three sweeps (baseline, post-description rewrite, post-corpus hardening), 4 frontier models evaluated, 1 worked-example effectiveness task scaffolded, and a strict skill-health score of **0.9117 with 0 flagged skills**.
152 
153## What You Should Tell People
154 
155### What was built
156 
157A skill collection that can keep itself alive. Not a one-off skill drop. A file-based research-to-skill operating system that:
158 
159- Discovers candidate sources from a curated registry.
160- Scores them against locked rubrics that the agent cannot relax.
161- Extracts implementable mechanisms instead of generic takeaways.
162- Promotes only behaviors that survive both deterministic checks and human review.
163- Runs the whole thing in a continuous loop without spending a single LLM token on retrieval or judgment.
164- Falls back to a human review queue for anything that needs judgment.
165 
166### Why it matters
167 
168Most public skill collections are anthologies. They accumulate good content and then drift, because there is no process for keeping them current and no machine-readable layer for compounding what they know. v2.2.0 is the bet that an open repository can be more than that: a source of truth that improves with use.
169 
170### Concrete artifacts you can point to
171 
172- The mechanism registry: `researcher/mechanisms/registry.jsonl`
173- The claim provenance index: `researcher/claims/index.jsonl`
174- The corpus index: `researcher/corpus/index.json`
175- The run state machine in action: `researcher/runs/20260515-035228-executable-autonomous-research-frameworks/`
176- The adversarial benchmark harness: `researcher/benchmarks/`
177- The continuous loop scripts: `researcher/scripts/loop_*.py`
178- The launchd daemon installer: `researcher/orchestration/launchd/install.sh`
179- The runbooks: `researcher/runbooks/`
180- The technical findings: `researcher/insights/auto-research-experiment.md`
181 
182### Sound bites that survive copy-paste
183 
184- "Activation scenarios beat keyword triggers." (Documented as finding #1.)
185- "Deterministic gates before model judges." (The validator costs zero tokens and catches categorical failures the judge would miss.)
186- "Mechanism registry as encyclopedia backbone." (Making behavior changes first-class data is the unlock.)
187- "Closed-run reaping is non-negotiable." (Every queue needs an explicit exit path.)
188- "Adversarial benchmarks before declaring the harness safe." (Write the attack before claiming the defense.)
189- "Continuous operation is a separate milestone from per-task correctness."
190- "Live execution is the highest-signal validation for orchestration code."
191 
192## Templates For Sharing
193 
194Use these as starting points, not as final copy. Strip what is not true for your audience.
195 
196### X / Twitter thread (8 tweets)
197 
198```
1991/  Shipped v2.2.0 of the Agent Skills for Context Engineering repo.
200 
201It is no longer just a skill collection. It is a file-based research-to-skill operating system that keeps the corpus alive without becoming editorial debt.
202 
203github.com/muratcankoylan/Agent-Skills-for-Context-Engineering
204 
2052/  Origins: Karpathy's autoresearch showed that one editable file + one locked evaluator + a results log + git rollback is enough for autonomy. Prime Intellect's auto-nanoGPT showed why durable scratchpads matter. Parallel deep research filled in the rest.
206 
2073/  What it does: discovers sources from a curated registry, scores them against locked rubrics, extracts implementable mechanisms (not generic takeaways), promotes only behaviors that pass deterministic checks plus human review.
208 
2094/  Mechanism registry as encyclopedia backbone. Every accepted behavior change is a JSONL row with owning_skill, activation_scenario, behavior_change, evidence, and failure_modes. Novelty checks compare against the registry first, then the corpus.
210 
2115/  Claim provenance for every volatile number. BrowseComp variance, LoCoMo accuracy, multi-agent token multipliers: each has a claim_id, source_url, evidence_strength, volatility, and last_reviewed. Claims that rot will not silently propagate.
212 
2136/  Continuous loop runs from launchd on macOS. No paid LLMs. HTTP retrieval is stdlib-only. Atomic JSONL writes, fcntl locks, closed-run reaping, parked-review queue, daily ops with adversarial benchmarks against the harness itself.
214 
2157/  CI is the floor. validate_repo + run_benchmarks + check_activation_cases on every PR. The validator catches duplicate JSON keys, wrong rubric math, APPROVE on partial retrievals, manifest drift, and missing run artifacts before any model judge runs.
216 
2178/  Lessons in researcher/insights/. Headline findings: activation scenarios beat keyword triggers, deterministic gates before model judges, closed-run reapers are non-negotiable, adversarial benchmarks find what unit tests miss, live execution > paper review for orchestration.
218```
219 
220### LinkedIn post
221 
222```
223Released v2.2.0 of Agent Skills for Context Engineering: an open collection that now ships a file-based research-to-skill operating system with deterministic gates and a continuous loop.
224 
225The starting question was small: how do you keep a 10k-star skill repository alive without it becoming a one-person editorial backlog. The answer turned into infrastructure.
226 
227Three pieces of prior work shaped the design.
228 
229Andrej Karpathy's autoresearch repo demonstrated the minimal autonomy harness: one editable file, one locked evaluator, a results log, git rollback. The lesson is not that every loop needs a single scalar metric. It is that ambiguous feedback produces ambiguous autonomy.
230 
231Prime Intellect's auto-nanoGPT work showed why durable agent state matters in practice. Their THREAD.md pattern, source queues, and explicit handover summaries are the difference between a loop that survives context compaction and one that does not.
232 
233Targeted deep research filled in the missing pieces: deterministic validation as the locked evaluator, durable run directories, novelty gates, pairwise revision, human-controlled merge.
234 
235What shipped:
236- Mechanism registry with gated promotion and append-only ledgers
237- Claim provenance for volatile benchmark numbers
238- Run state machine with enforceable transitions
239- Activation regression tests for skill-boundary confusion
240- Adversarial benchmark harness with append-only history
241- Continuous loop scripts and launchd service definitions
242- CI workflow running deterministic gates on every PR
243- 15 skills, all migrated from keyword triggers to task-boundary scenarios
244 
245The technical findings and the full process narrative live in researcher/insights/. The repo: github.com/muratcankoylan/Agent-Skills-for-Context-Engineering
246 
247Honest scope: no LLM-judge adapter yet, no automated source discovery beyond the manual seed, no log rotation. Tracked at the end of CHANGELOG.md.
248```
249 
250### Blog post outline
251 
252```
253# Title: From Karpathy's autoresearch to a self-compounding skill encyclopedia
254 
255## 1. Why this exists
256The 10k-star skill repository was at a fork: stay an anthology, or become infrastructure.
257 
258## 2. The three inputs
259- Karpathy autoresearch: minimal autonomy harness, locked evaluator, results log
260- Prime Intellect auto-nanoGPT: durable scratchpads, THREAD.md, handover discipline
261- Parallel deep research on autonomous research frameworks: deterministic gates, novelty, pairwise, human-controlled merge
262 
263## 3. The breakthrough roadmap
264- Split repo health from run readiness
265- Mechanism registry as the unit of accumulated learning
266- Claim provenance to prevent claim rot
267- Activation regression for skill boundaries
268- Adversarial benchmarks instead of self-congratulatory ones
269- Corpus index as the machine-readable map
270 
271## 4. The continuous loop
272- Queue + state machine + reaper + parked review
273- launchd daemon, no paid LLMs, atomic writes, fcntl locks
274- Live execution as the validation gate
275 
276## 5. The lessons
277(17 findings from researcher/insights/auto-research-experiment.md)
278 
279## 6. What it does not solve yet
280Honest list of out-of-scope items
281 
282## 7. How to use it
283- Install the plugin (Claude Code, Cursor, Open Plugins)
284- Install the daemon (macOS launchd)
285- Read the runbook before extending
286```
287 
288## How To Improve The Repo From Here
289 
290The 2.2.0 release deliberately left several things off. These are the obvious next steps in roughly the order they pay off:
291 
2921. **LLM-judge adapter** for the `retrieved -> evaluated` transition, gated by a per-day budget config. This is the single change that unlocks "the loop actually decides what to publish" instead of "the loop stages everything for human review."
293 
2942. **Automated source discovery adapters**. Parallel deep research, a Cohere/OpenAI/Anthropic engineering blog RSS, an arXiv author-watch list. Each is a self-contained file under `researcher/scripts/feeds/`.
295 
2963. **Log rotation and benchmark history pruning**. The current implementation appends forever. Acceptable for weeks, painful at one year.
297 
2984. **pytest suite** for the loop scripts. Smoke-tested via live execution is fine for v2.2.0; refactor safety needs unit and integration tests.
299 
3005. **More activation cases** as skill boundaries get challenged in the wild. Each confusion that shows up in user activations should become a fixture.
301 
3026. **More claim provenance**. The current 12 claims cover the highest-volatility ones; the corpus has additional softer claims that would benefit from explicit provenance.
303 
3047. **Self-spawning agents for parallel runs**. The loop currently processes one run per step. With proper locking and per-run budget tracking, parallel runs are safe and would amortize cost.
305 
3068. **Cross-skill integration tests**. The corpus index already maps skills to mechanisms; the next step is asserting that integration sections in each skill reference real skills and stay accurate as the corpus evolves.
307 
308If you want to contribute, look at the open items above. Each is scoped to fit in a single PR.
309 
310## Your Learnings (Best Inference From The Process)
311 
312This section is necessarily an inference. Adjust before sharing.
313 
314- A repository with this many stars is a public surface, and that surface benefits more from infrastructure than from more content. The instinct to add another skill is almost always less valuable than the instinct to add another gate.
315- The cheapest way to compound a knowledge base is to make its units machine-readable. Mechanisms, claims, activation scenarios, run states, benchmark scenarios. Prose is publication, not memory.
316- "Improve all skills" means improving the corpus substrate, not just editing Markdown. The mechanism registry, claim index, corpus index, activation cases, validators, and template have to move with the skill bodies.
317- "Run it for days" is a real architectural milestone, not a feature flag. The system has to be designed for unattended operation from the beginning.
318- Subagent code review is the cheapest way to find category-of-error bugs that you would otherwise hit in production.
319- Plan mode for irreversible decisions, agent mode for everything else. Mixing them silently is how scope creep happens.
320- An AI collaborator works best as a load-bearing technical lead, not as a writing assistant. Direct instructions, named subagents, explicit budgets, deterministic gates: this is how the work compounds rather than dissipates.
321 
322## Joint Findings From The Experiment
323 
324These are documented in detail in `auto-research-experiment.md`. The five most reusable across other projects:
325 
3261. Atomic file writes + fcntl locks are the default for any shared file the loop touches.
3272. Closed-state reapers in every queue.
3283. Split validators by question, not by artifact.
3294. Adversarial benchmarks before model judges.
3305. Live execution as the highest-signal validation for orchestration code.
331 
332## What This Document Is For
333 
334This file is the narrative version of the release. Use it when explaining what was built and why. Use `auto-research-experiment.md` when explaining the engineering decisions. Use `CHANGELOG.md` when you need the precise inventory.
335 
336Keep all three accurate as the next versions ship.
337
Preparing the source view

Agent Skills for Context Engineering

researcher/insights/how-we-built-this.md