Loading source
Pulling the file list, source metadata, and syntax-aware rendering for this listing.
Source from repo
A comprehensive collection of Agent Skills for context engineering, multi-agent architectures, and production agent systems.
Files
Skill
Size
Entrypoint
Format
Open file
Syntax-highlighted preview of this file as included in the skill package.
researcher/fixtures/skill-proposals/harness-engineering-proposal.md
1# Skill Proposal: Harness Engineering From Autoresearch23## Source45- URL: https://github.com/karpathy/autoresearch/blob/master/program.md6- Title: Karpathy autoresearch program7- Author or organization: Andrej Karpathy8- Source type: code9- Retrieval status: retrieved10- Evaluation file: `researcher/fixtures/source-evaluations/approved-harness-source.json`11- Decision: HUMAN_REVIEW1213## Mechanism1415The source demonstrates a constrained autonomous experiment harness: the agent can edit one surface, the evaluator remains locked, each run receives fixed feedback, results are logged durably, and git rollback discards non-improving attempts. The transferable mechanism is not the nanoGPT task itself, but the boundary between editable and locked surfaces.1617## Skill Target1819- Target type: existing skill20- Target path: `skills/harness-engineering/SKILL.md`21- Activation scenario: designing an autonomous loop where editable artifacts are scored by locked evaluation surfaces22- Related skills: evaluation, filesystem-context, project-development23- Proposed location: SKILL.md2425## Novelty Check2627- Command: `python researcher/scripts/novelty_check.py --file researcher/fixtures/skill-proposals/harness-engineering-proposal.md --json`28- Verdict: pass29- Max mechanism overlap: 0.086630- Top mechanism overlaps: `locked-editable-surfaces`, `structured-novelty-gate`31- Human-review rationale: Overlap is expected because the fixture seeded the harness pattern; the proposal predates the published mechanism registry and remains useful as a known-good example.3233## Evidence3435| Claim | Evidence | Source |36| --- | --- | --- |37| Autonomous loops need locked metrics | `prepare.py` owns evaluation while `train.py` is editable | Karpathy autoresearch |38| Durable result logs prevent repeated failures | `results.tsv` records commit, metric, memory, status, and description | Karpathy autoresearch |39| Rollback keeps failed attempts from polluting the frontier | Non-improving commits are reset | Karpathy autoresearch |4041## Proposed Delta4243```yaml44changes:45- path: "skills/harness-engineering/SKILL.md"46section: "Core Concepts"47change_type: "add"48summary: "Explain locked vs editable surfaces and fixed feedback loops."49```5051## Quality Checks5253- [x] Fits an existing activation scenario or justifies a new one.54- [x] Adds an operating rule, workflow, artifact, gotcha, or reference.55- [x] Records novelty-check verdict and top mechanism overlaps.56- [x] Avoids duplicating existing skill guidance or accepted mechanisms.57- [x] Keeps `SKILL.md` under 500 lines.58- [x] Uses progressive disclosure for detailed or volatile evidence.59- [x] Uses platform-agnostic wording.60- [x] Updates README, root `SKILL.md`, and manifests if publishing a new skill.6162## Risks And Gaps6364- Evidence limitations: one benchmark environment; should not imply every research task has one scalar metric.65- Possible duplication: overlaps with evaluation, but focuses on the control loop around evaluation.66- Volatile claims: avoid embedding star counts or time-sensitive popularity metrics.67- Required human review: O3 applies because evidence rigor is useful but narrow.6869## Recommendation7071`update-existing-skill`7273Use the source as a core example for `harness-engineering`, with general wording that applies beyond nanoGPT.74