Source from repo

Agent Skills for Context Engineering

A comprehensive collection of Agent Skills for context engineering, multi-agent architectures, and production agent systems.

muratcankoylanGitHub muratcankoylanSource repo Original GitHub link

Files

339

Skill

n/a

Size

4.3 MB

Entrypoint

SKILL.md

Format

git-repo

Open file

researcher/rubrics/harness-change.md

Syntax-highlighted preview of this file as included in the skill package.

Rendered Source

markdown63 linesFree

researcher/rubrics/harness-change.md

1# Harness Change Rubric
2 
3Use this rubric when a source proposes changes to a research loop, evaluation harness, agent operating procedure, or PR automation policy.
4 
5## Harness Definition
6 
7A harness is the control system around an agent: prompts, editable surfaces, tools, evals, logs, retry rules, rollback rules, and human approval gates. Harness engineering is separate from content authoring because agents can game or weaken their own harnesses if the boundary is not explicit.
8 
9## Locked vs Editable Surfaces
10 
11Every harness proposal must classify files or settings:
12 
13| Surface | Examples | Rule |
14| --- | --- | --- |
15| Locked | Rubrics, evaluation scripts, source registry, merge policy | Agent may propose changes, but cannot use changed version to score the same proposal |
16| Editable | Skill draft, source evaluation, research thread, proposal text | Agent can modify during the loop |
17| Append-only | Result logs, rejected source log, experiment history | Agent can append, not rewrite |
18| Human-controlled | Merge decisions, credential setup, destructive actions | Agent cannot automate without explicit approval |
19 
20## Gates
21 
22| Gate | Pass | Fail |
23| --- | --- | --- |
24| H1 Objective Clarity | The harness has a measurable objective and stop/continue rule | Goal is vague or impossible to evaluate |
25| H2 Metric Integrity | Metrics are locked, external, and resistant to gaming | Agent can change the metric or hidden answer key |
26| H3 Artifact Durability | The loop writes durable logs, candidates, failures, and decisions | State only exists in chat context |
27| H4 Recovery Path | Crashes, bad sources, and failed proposals have explicit handling | Failures are ignored or retried indefinitely |
28| H5 Human Governance | PR, merge, and policy-changing boundaries are explicit | Agent can merge or change governance autonomously |
29 
30Any failed gate requires human review.
31 
32## Scoring
33 
34Score each dimension 0, 1, or 2.
35 
36| Dimension | Weight | Score 2 |
37| --- | --- | --- |
38| Feedback Quality | 25% | Fast, unambiguous feedback from locked metrics or rubrics |
39| Search Discipline | 20% | Supports novelty, ablation, pruning, and upstream refresh |
40| Auditability | 20% | Durable logs reconstruct what happened and why |
41| Safety and Governance | 20% | Clear approval gates and rollback paths |
42| Cost Control | 15% | Bounded token, compute, and review cost per loop |
43 
44Approve harness changes only when total >= 1.5 and all gates pass. Otherwise route to human review.
45 
46## Required Checks
47 
48Before accepting a harness change:
49 
501. Run the old harness on at least one known artifact if possible.
512. Run the proposed harness on the same artifact and compare decisions.
523. Confirm that stricter checks do not block obviously valid examples.
534. Confirm that looser checks do not admit known bad examples.
545. Record whether the change affects future proposals only or invalidates earlier results.
55 
56## Common Anti-Patterns
57 
581. **Self-scored harness edits**: The same agent proposes and approves a new rubric. Require independent review or old-rubric scoring.
592. **Mutable benchmark**: The loop optimizes a metric the agent can edit. Lock metrics outside the editable surface.
603. **No discard path**: Bad experiments accumulate and become implicit context. Log and isolate rejected attempts.
614. **No pruning round**: Agents stack components and rarely remove them. Require leave-one-out or simplification checks for complex proposals.
625. **Stale upstream view**: Long-running agents stop checking new sources. Schedule periodic source refresh before major decisions.
63

Preparing the source view

Agent Skills for Context Engineering

researcher/rubrics/harness-change.md