Loading source
Pulling the file list, source metadata, and syntax-aware rendering for this listing.
Source from repo
A comprehensive collection of Agent Skills for context engineering, multi-agent architectures, and production agent systems.
Files
Skill
Size
Entrypoint
Format
Open file
Syntax-highlighted preview of this file as included in the skill package.
researcher/rubrics/harness-change.md
1# Harness Change Rubric23Use this rubric when a source proposes changes to a research loop, evaluation harness, agent operating procedure, or PR automation policy.45## Harness Definition67A harness is the control system around an agent: prompts, editable surfaces, tools, evals, logs, retry rules, rollback rules, and human approval gates. Harness engineering is separate from content authoring because agents can game or weaken their own harnesses if the boundary is not explicit.89## Locked vs Editable Surfaces1011Every harness proposal must classify files or settings:1213| Surface | Examples | Rule |14| --- | --- | --- |15| Locked | Rubrics, evaluation scripts, source registry, merge policy | Agent may propose changes, but cannot use changed version to score the same proposal |16| Editable | Skill draft, source evaluation, research thread, proposal text | Agent can modify during the loop |17| Append-only | Result logs, rejected source log, experiment history | Agent can append, not rewrite |18| Human-controlled | Merge decisions, credential setup, destructive actions | Agent cannot automate without explicit approval |1920## Gates2122| Gate | Pass | Fail |23| --- | --- | --- |24| H1 Objective Clarity | The harness has a measurable objective and stop/continue rule | Goal is vague or impossible to evaluate |25| H2 Metric Integrity | Metrics are locked, external, and resistant to gaming | Agent can change the metric or hidden answer key |26| H3 Artifact Durability | The loop writes durable logs, candidates, failures, and decisions | State only exists in chat context |27| H4 Recovery Path | Crashes, bad sources, and failed proposals have explicit handling | Failures are ignored or retried indefinitely |28| H5 Human Governance | PR, merge, and policy-changing boundaries are explicit | Agent can merge or change governance autonomously |2930Any failed gate requires human review.3132## Scoring3334Score each dimension 0, 1, or 2.3536| Dimension | Weight | Score 2 |37| --- | --- | --- |38| Feedback Quality | 25% | Fast, unambiguous feedback from locked metrics or rubrics |39| Search Discipline | 20% | Supports novelty, ablation, pruning, and upstream refresh |40| Auditability | 20% | Durable logs reconstruct what happened and why |41| Safety and Governance | 20% | Clear approval gates and rollback paths |42| Cost Control | 15% | Bounded token, compute, and review cost per loop |4344Approve harness changes only when total >= 1.5 and all gates pass. Otherwise route to human review.4546## Required Checks4748Before accepting a harness change:49501. Run the old harness on at least one known artifact if possible.512. Run the proposed harness on the same artifact and compare decisions.523. Confirm that stricter checks do not block obviously valid examples.534. Confirm that looser checks do not admit known bad examples.545. Record whether the change affects future proposals only or invalidates earlier results.5556## Common Anti-Patterns57581. **Self-scored harness edits**: The same agent proposes and approves a new rubric. Require independent review or old-rubric scoring.592. **Mutable benchmark**: The loop optimizes a metric the agent can edit. Lock metrics outside the editable surface.603. **No discard path**: Bad experiments accumulate and become implicit context. Log and isolate rejected attempts.614. **No pruning round**: Agents stack components and rarely remove them. Require leave-one-out or simplification checks for complex proposals.625. **Stale upstream view**: Long-running agents stop checking new sources. Schedule periodic source refresh before major decisions.63