Loading source
Pulling the file list, source metadata, and syntax-aware rendering for this listing.
Source from repo
A comprehensive collection of Agent Skills for context engineering, multi-agent architectures, and production agent systems.
Files
Skill
Size
Entrypoint
Format
Open file
Syntax-highlighted preview of this file as included in the skill package.
CHANGELOG.md
1# Changelog23All notable changes to this project are documented here. Versions follow semantic versioning where practical, with skill content treated as data.45## [2.3.0] - 2026-05-1567First release with measured benchmark results across four frontier models, closing the loop from "we wrote skill descriptions" to "we proved they route correctly."89### Added1011#### Stage 2 router benchmark, executed end-to-end1213- 600 of 600 runs completed across `composer-2`, `claude-opus-4-7`, `gpt-5.5`, `gemini-3.1-pro` at 3 replications per (prompt, model). Initial v2.2.0 baseline at `researcher/benchmarks/router/results-published/2026-05-15.md` (566 of 600 due to the v1 runner dying); updated run after the description fixes at `researcher/benchmarks/router/results-published/2026-05-15-v2.md` with full delta-vs-baseline table.14- 50 ground-truth router prompts at `researcher/benchmarks/router/prompts.jsonl` covering positive controls, adversarial boundary pairs, combined-skill prompts, and negative controls.15- `researcher/scripts/render_router_report.py` with `--baseline` flag for delta reports.16- `researcher/benchmarks/router/results-published/README.md` explains the committed-summary vs gitignored-raw split.1718#### Hardened SDK runner (`researcher/benchmarks/sdk-runner/src/`)1920- **Resume**: scans the destination directory on startup and skips plan items that already have a per-run JSON. A killed sweep can be picked up exactly where it stopped; no wasted credits, no duplicate runs.21- **Bounded parallelism**: `--concurrency N` runs N agent calls simultaneously. Cuts the 600-run sweep from ~60 minutes (sequential) to ~15 minutes (concurrency=4) with identical correctness.22- **Per-run progress logging**: every completed run prints `[N/total] model prompt rep=R status durationMs T1 ETA=duration`. The v1 sweep silently stalled at 566 of 600 with no signal; the v2 sweep would have surfaced the cause immediately.23- **Format-failure retry**: transient empty or unparsable SDK responses are retried once before being recorded as format failures. This was added after the May 19 sweep produced transient blank outputs that succeeded on rerun.24- `runConcurrently` helper in `common.ts`, reusable by future runners.2526#### Skill description rewrites (data-driven)2728Targeted at the two routing failures the v2.2.0 baseline benchmark surfaced:2930- `context-fundamentals`: rewrote to be unambiguously about conceptual foundations and explicitly route operational work to the specialized skills. Top-1 rate went from **0.255 to 0.489** (+23.4pp).31- `project-development`: tightened with explicit cross-references to `tool-design`. Top-1 rate went from **0.750 to 1.000** (now perfect routing).32- `tool-design`: tightened with explicit cross-references to `project-development`. Top-1 rate went from **0.729 to 0.807** (+7.8pp).3334#### Skill body alignment with new descriptions3536The router benchmark only sees frontmatter `description` because `settingSources: []` excludes the SKILL.md body. The first description rewrite pass left the bodies (`When to Activate`, `Practical Guidance`, `Integration`) claiming the broader pre-rewrite scope, which would have steered the agent toward operational work the moment the skill actually activated in production. Aligned the bodies in a follow-up pass:3738- `context-fundamentals` body: rewrote `When to Activate` to list conceptual triggers and explicit do-not-activate routing; removed the operational `File-System-Based Access` and `Context Budgeting` practical-guidance sections (owned by `filesystem-context` and `context-optimization` respectively); replaced with conceptual application advice plus a reading-order recommendation for new contributors; rewrote `Integration` as an explicit routing map across all 14 sibling skills. Internal version bump 2.0.0 -> 2.1.0.39- `tool-design` body: rewrote `When to Activate` to anchor on the unit of work (single tool or tool catalog) and listed adjacent decisions owned by `project-development`, `multi-agent-patterns`, `context-optimization`; rewrote `Integration` with explicit routing reasons. Internal version bump 2.0.0 -> 2.1.0.40- `project-development` body: rewrote `When to Activate` to anchor on project shape and pipeline; listed adjacent decisions owned by `tool-design`, `context-optimization`, `multi-agent-patterns`, `harness-engineering`; rewrote `Integration` with explicit routing reasons. Internal version bump 1.1.0 -> 1.2.0.4142The body changes do not affect router-benchmark numbers (the router sees only descriptions) but they do affect what the agent loads when these skills activate. Stage 3 (effectiveness benchmark, which loads full bodies) is the right place to measure the impact of this alignment.4344#### Corpus-wide skill hardening pass4546After the targeted three-skill body alignment, every published skill was audited against the same standard: explicit ownership boundary, `Do not activate` routing, executable practical guidance, examples, gotchas, integration boundaries, mechanism coverage, claim provenance, and activation fixtures.4748- Updated all 15 skill bodies with scoped improvements, including structural fixes for `bdi-mental-states` and `hosted-agents`, stronger negative routing across older skills, clearer examples for context failure modes, and claim-backed wording for volatile benchmark statements.49- Expanded `researcher/mechanisms/registry.jsonl` from 5 to 16 accepted mechanisms so every published skill owns at least one machine-readable behavior pattern.50- Expanded claim provenance from 6 to 12 records and replaced vague run-summary sources with concrete repo paths for BrowseComp, RULER/lost-in-middle, compression, d0, Latent Briefing, memory, and tool-output claims.51- Expanded activation regression coverage from 14 to 19 cases so every skill has deterministic routing coverage, including `bdi-mental-states`, `context-degradation`, `hosted-agents`, `latent-briefing`, and `multi-agent-patterns`.52- Tightened `validate_repo.py --strict` so `Core Concepts`, `Practical Guidance`, `Examples`, `References`, and explicit non-activation boundaries are now enforced rather than optional.53- Updated `template/SKILL.md` with the new corpus-wide standard: body/frontmatter alignment, mechanism registration, executable guidance, and `claim-*` provenance for volatile claims.54- Re-ran the no-API gates after the pass: `validate_repo.py --strict` passed with 0 errors / 0 warnings; `skill_health.py --strict --no-history` improved from corpus score 0.8111 / 2 flagged skills to 0.9117 / 0 flagged skills; `check_activation_cases.py` passed 19/19; `run_benchmarks.py` passed 3 checks and 7 adversarial scenarios.55- Re-ran the paid Stage 2 router benchmark after the corpus-wide pass: 600/600 usable records, 0 format failures after retrying transient format failures, published at `researcher/benchmarks/router/results-published/2026-05-19.md`. Per-model top-1: Gemini 0.920, Composer 0.913, GPT-5.5 0.913, Claude Opus 4.7 0.840. Remaining failures are concentrated in known ambiguous/negative-control prompts (`p046`, `p048`) and the `context-fundamentals` catch-all boundary.5657#### Eleven new boundary regression cases5859`researcher/fixtures/activation-cases.jsonl` grew from 8 to 19 cases. The first six new cases target specific confusions observed in the v2.2.0 baseline:6061- `activation-fundamentals-vs-degradation`, `activation-fundamentals-onboarding`, `activation-fundamentals-vs-optimization`62- `activation-tool-vs-project-structured-output`, `activation-tool-individual-tool`, `activation-tool-consolidation`6364These act as a tripwire so any future description change is held accountable.6566The corpus-wide pass added five more cases for previously uncovered skills:6768- `activation-bdi-vs-memory`69- `activation-degradation-poisoning`70- `activation-hosted-vs-harness`71- `activation-latent-briefing-vs-memory`72- `activation-multi-agent-topology`7374#### Stage 1 skill health (still no API cost)7576- `researcher/scripts/skill_health.py`: per-skill structural scoring. Initial corpus baseline: 0.8111 aggregate, 2 of 15 skills flagged (`bdi-mental-states` for missing required section, `hosted-agents` for multiple structural issues). After the corpus-wide hardening pass: 0.9117 aggregate, 0 flagged skills.77- Output at `researcher/reports/skill-health.json` (gitignored runtime artifact) + optional append to `skill-health-history.jsonl`.7879### Changed8081- Version bumped 2.2.0 -> 2.3.0 across `.claude-plugin/marketplace.json`, `.plugin/plugin.json`, root `SKILL.md`.82- `researcher/benchmarks/PLAN.md` status table reflects Stage 0/1/2 shipped, Stage 3/4 still scaffolded.8384### Headline measured results8586Per-model top-1 accuracy (baseline -> new descriptions, 600-run sweep at seed=1, fixture sha 8f974d9):8788| Model | Baseline | New | Delta |89| --- | --- | --- | --- |90| composer-2 | 0.888 | 0.913 | +2.5pp |91| gpt-5.5 | 0.886 | 0.913 | +2.7pp |92| gemini-3.1-pro | 0.886 | 0.925 | +3.9pp |93| claude-opus-4-7 | 0.886 | 0.867 | -2.0pp |9495Per-skill top-1 rate change for the three skills targeted by description rewrites:9697| Skill | Baseline | New | Delta |98| --- | --- | --- | --- |99| `context-fundamentals` | 0.255 | 0.489 | +23.4pp |100| `project-development` | 0.750 | 1.000 | +25pp |101| `tool-design` | 0.729 | 0.807 | +7.8pp |102103Format compliance: 99.5% (3 failures, all Gemini). Latency: Gemini ~9.1s median, others 3.3-4.2s. Total sweep cost approximately 7.20 USD against the 15 USD budget cap.104105### Honest scope caveats106107- `context-fundamentals` improved a lot but is still the weakest skill (0.489 top-1). Remaining failures route to `project-development` for generic onboarding prompts. One more description pass may push it past 0.75.108- Two prompts remain at 0.00 across all models: p046 (Python reformatting, negative control) and p048 (evaluate KV compaction, genuinely ambiguous). Should be re-labeled or removed from positive-routing tests.109- `advanced-evaluation` looks regressed (-18.3pp) but is largely an artifact of the v2.2.0 baseline missing 11 attempts when the runner died at 566/600. Absolute correct count: 48 baseline -> 47 new.110- Stage 3 (real agent tasks with and without skills loaded) is still scaffolded but not executed; that is the next investment.111- No LLM-judge adapter for the run state machine. No automated source discovery beyond manual seed.112113## [2.2.0] - 2026-05-15114115### Added116117#### Researcher operating system118119- **Mechanism registry** (`researcher/mechanisms/registry.jsonl`) seeded with five accepted mechanisms (`locked-editable-surfaces`, `durable-research-thread`, `deterministic-first-validation`, `structured-novelty-gate`, `pairwise-skill-revision`).120- **Mechanism ledgers** (`researcher/mechanisms/ledgers/accepted.jsonl`, `rejected.jsonl`) for append-only promotion events.121- **Claim provenance** (`researcher/claims/index.jsonl`) for six volatile or benchmark-backed claims across `evaluation`, `multi-agent-patterns`, `context-optimization`, `memory-systems`, `advanced-evaluation`, and `harness-engineering`.122- **Corpus index** (`researcher/corpus/index.json`) mapping skills to activation scenarios, mechanism IDs, and claim IDs.123- **Activation regression fixtures** (`researcher/fixtures/activation-cases.jsonl`) covering high-risk skill-boundary pairs.124- **Adversarial benchmark harness** (`researcher/benchmarks/scenarios/adversarial.jsonl` + goldens) with seven scenarios that try to game the loop.125- **Benchmark history** (`researcher/reports/benchmark-history.jsonl`) for longitudinal trend tracking.126- **Pairwise revision rubric and script** (`researcher/rubrics/pairwise-skill-revision.md`, `researcher/scripts/compare_skill_revisions.py`).127- **Run state machine** in `run-state.json` with explicit transitions: `initialized -> retrieved -> evaluated -> proposed -> novelty_checked -> validated -> pr_ready -> closed`.128129#### Continuous loop130131- **Queue infrastructure** (`researcher/queue/`): inbox, parked, done, quarantine.132- **Orchestration config** (`researcher/orchestration/config.json`) with daily/active/parked/failure budgets.133- **Discovery feeder** (`researcher/scripts/loop_discover.py`) reading from `researcher/discovery/manual-seed.jsonl`.134- **Loop step orchestrator** (`researcher/scripts/loop_step.py`) that reaps closed runs, pulls from inbox, retrieves via stdlib `urllib`, and parks at human-review gates.135- **Daily ops** (`researcher/scripts/loop_daily.py`) running validators, benchmarks, activation cases, and writing dated snapshots.136- **Status dashboard** (`researcher/scripts/loop_status.py`) plus parked-review surface.137- **launchd service definitions** (`researcher/orchestration/launchd/`) with install/uninstall scripts and per-script wrappers.138- **Continuous operation runbook** (`researcher/runbooks/continuous-operation.md`).139140#### Scripts141142- `researcher/scripts/validate_run.py`: per-run publish readiness, skips closed runs.143- `researcher/scripts/research_loop.py` subcommands: `retrieve`, `evaluate`, `propose`, `novelty`, `validate-run`, `pr-ready`, `close`, `promote-mechanisms`.144- `researcher/scripts/check_activation_cases.py`: deterministic activation regression checker.145- `researcher/scripts/run_benchmarks.py`: runs deterministic gates, scenarios, optional history recording.146- `researcher/scripts/loop_common.py`: shared atomic-write helpers and `fcntl` locks.147148#### CI149150- `.github/workflows/validate.yml` runs `validate_repo.py --strict`, `run_benchmarks.py`, `check_activation_cases.py`, and Python compile checks on every push and PR.151152### Changed153154- Skill activation surface refactored from keyword triggers to task-boundary descriptions in frontmatter and README. Affected: all 14 published skills plus the example skills in `examples/`.155- `validate_repo.py` hardened: duplicate JSON keys, exact doc sync, rubric IDs, run artifacts, registry schema, evidence paths, partial-retrieval approvals, root-level raw provenance, claims schema, corpus index consistency, activation cases, benchmark scenarios.156- `novelty_check.py` now compares mechanism registry overlap as the primary signal, with corpus overlap secondary.157- Mechanism registry evidence may now reference claim IDs from `researcher/claims/index.jsonl` in addition to URLs and repo paths.158159### Hardened160161- All queue mutations use atomic temp-file `os.replace` and `fcntl` exclusive locks scoped per queue family.162- `read_jsonl` is tolerant: malformed lines quarantine to `researcher/reports/jsonl-quarantine/` rather than crashing the loop.163- `fetch_url` allows only `http(s)://` and re-checks scheme after redirect.164- Closed runs are automatically reaped from `parked.jsonl` and recorded in `done.jsonl`.165- Inbox lock now held through `init_run` so concurrent loop_step invocations cannot exceed budgets.166- URL deduplication normalizes case before hashing.167168### Repository policy169170- Active research runs under `researcher/runs/*/` are runtime state and not committed. The seed run `20260515-035228-executable-autonomous-research-frameworks` is kept as a worked-example fixture.171- Runtime queue and report files (`researcher/queue/*.jsonl`, `researcher/reports/{logs,snapshots,loop-events.jsonl,loop-failures.jsonl,status.md,parked-review.md}`, `researcher/queue/.locks/`) are gitignored.172173### Out of scope for 2.2.0174175- LLM-judge adapters for advancing `retrieved -> evaluated` automatically.176- Automated source discovery beyond the manual seed file (Parallel deep research and web search adapters are placeholders behind config toggles).177- Log rotation; benchmark history pruning.178179## [2.1.0] - 2026-05-14180181### Added182183- `harness-engineering` skill: locked/editable surface model, durable threads, novelty gates, rollback, human approval boundaries.184- `researcher/` directory v1: source registry, content/skill/harness rubrics, source-evaluation JSON template, skill-proposal template, autonomous research loop runbook, PR readiness runbook.185186## [2.0.0] - earlier187188Baseline corpus of 13 skills distributed as a single Claude Code plugin.189