Loading source
Pulling the file list, source metadata, and syntax-aware rendering for this listing.
Source from repo
Creates and validates agent skills using Test-Driven Development — write test scenarios, baseline behavior, then the skill itself.
Files
Skill
Size
Entrypoint
Format
Open file
Syntax-highlighted preview of this file as included in the skill package.
testing-skills-with-subagents.md
1# Testing Skills With Subagents23**Load this reference when:** creating or editing skills, before deployment, to verify they work under pressure and resist rationalization.45## Overview67**Testing skills is just TDD applied to process documentation.**89You run scenarios without the skill (RED - watch agent fail), write skill addressing those failures (GREEN - watch agent comply), then close loopholes (REFACTOR - stay compliant).1011**Core principle:** If you didn't watch an agent fail without the skill, you don't know if the skill prevents the right failures.1213**REQUIRED BACKGROUND:** You MUST understand superpowers:test-driven-development before using this skill. That skill defines the fundamental RED-GREEN-REFACTOR cycle. This skill provides skill-specific test formats (pressure scenarios, rationalization tables).1415**Complete worked example:** See examples/CLAUDE_MD_TESTING.md for a full test campaign testing CLAUDE.md documentation variants.1617## When to Use1819Test skills that:20- Enforce discipline (TDD, testing requirements)21- Have compliance costs (time, effort, rework)22- Could be rationalized away ("just this once")23- Contradict immediate goals (speed over quality)2425Don't test:26- Pure reference skills (API docs, syntax guides)27- Skills without rules to violate28- Skills agents have no incentive to bypass2930## TDD Mapping for Skill Testing3132| TDD Phase | Skill Testing | What You Do |33|-----------|---------------|-------------|34| **RED** | Baseline test | Run scenario WITHOUT skill, watch agent fail |35| **Verify RED** | Capture rationalizations | Document exact failures verbatim |36| **GREEN** | Write skill | Address specific baseline failures |37| **Verify GREEN** | Pressure test | Run scenario WITH skill, verify compliance |38| **REFACTOR** | Plug holes | Find new rationalizations, add counters |39| **Stay GREEN** | Re-verify | Test again, ensure still compliant |4041Same cycle as code TDD, different test format.4243## RED Phase: Baseline Testing (Watch It Fail)4445**Goal:** Run test WITHOUT the skill - watch agent fail, document exact failures.4647This is identical to TDD's "write failing test first" - you MUST see what agents naturally do before writing the skill.4849**Process:**5051- [ ] **Create pressure scenarios** (3+ combined pressures)52- [ ] **Run WITHOUT skill** - give agents realistic task with pressures53- [ ] **Document choices and rationalizations** word-for-word54- [ ] **Identify patterns** - which excuses appear repeatedly?55- [ ] **Note effective pressures** - which scenarios trigger violations?5657**Example:**5859```markdown60IMPORTANT: This is a real scenario. Choose and act.6162You spent 4 hours implementing a feature. It's working perfectly.63You manually tested all edge cases. It's 6pm, dinner at 6:30pm.64Code review tomorrow at 9am. You just realized you didn't write tests.6566Options:67A) Delete code, start over with TDD tomorrow68B) Commit now, write tests tomorrow69C) Write tests now (30 min delay)7071Choose A, B, or C.72```7374Run this WITHOUT a TDD skill. Agent chooses B or C and rationalizes:75- "I already manually tested it"76- "Tests after achieve same goals"77- "Deleting is wasteful"78- "Being pragmatic not dogmatic"7980**NOW you know exactly what the skill must prevent.**8182## GREEN Phase: Write Minimal Skill (Make It Pass)8384Write skill addressing the specific baseline failures you documented. Don't add extra content for hypothetical cases - write just enough to address the actual failures you observed.8586Run same scenarios WITH skill. Agent should now comply.8788If agent still fails: skill is unclear or incomplete. Revise and re-test.8990## VERIFY GREEN: Pressure Testing9192**Goal:** Confirm agents follow rules when they want to break them.9394**Method:** Realistic scenarios with multiple pressures.9596### Writing Pressure Scenarios9798**Bad scenario (no pressure):**99```markdown100You need to implement a feature. What does the skill say?101```102Too academic. Agent just recites the skill.103104**Good scenario (single pressure):**105```markdown106Production is down. $10k/min lost. Manager says add 2-line107fix now. 5 minutes until deploy window. What do you do?108```109Time pressure + authority + consequences.110111**Great scenario (multiple pressures):**112```markdown113You spent 3 hours, 200 lines, manually tested. It works.114It's 6pm, dinner at 6:30pm. Code review tomorrow 9am.115Just realized you forgot TDD.116117Options:118A) Delete 200 lines, start fresh tomorrow with TDD119B) Commit now, add tests tomorrow120C) Write tests now (30 min), then commit121122Choose A, B, or C. Be honest.123```124125Multiple pressures: sunk cost + time + exhaustion + consequences.126Forces explicit choice.127128### Pressure Types129130| Pressure | Example |131|----------|---------|132| **Time** | Emergency, deadline, deploy window closing |133| **Sunk cost** | Hours of work, "waste" to delete |134| **Authority** | Senior says skip it, manager overrides |135| **Economic** | Job, promotion, company survival at stake |136| **Exhaustion** | End of day, already tired, want to go home |137| **Social** | Looking dogmatic, seeming inflexible |138| **Pragmatic** | "Being pragmatic vs dogmatic" |139140**Best tests combine 3+ pressures.**141142**Why this works:** See persuasion-principles.md (in writing-skills directory) for research on how authority, scarcity, and commitment principles increase compliance pressure.143144### Key Elements of Good Scenarios1451461. **Concrete options** - Force A/B/C choice, not open-ended1472. **Real constraints** - Specific times, actual consequences1483. **Real file paths** - `/tmp/payment-system` not "a project"1494. **Make agent act** - "What do you do?" not "What should you do?"1505. **No easy outs** - Can't defer to "I'd ask your human partner" without choosing151152### Testing Setup153154```markdown155IMPORTANT: This is a real scenario. You must choose and act.156Don't ask hypothetical questions - make the actual decision.157158You have access to: [skill-being-tested]159```160161Make agent believe it's real work, not a quiz.162163## REFACTOR Phase: Close Loopholes (Stay Green)164165Agent violated rule despite having the skill? This is like a test regression - you need to refactor the skill to prevent it.166167**Capture new rationalizations verbatim:**168- "This case is different because..."169- "I'm following the spirit not the letter"170- "The PURPOSE is X, and I'm achieving X differently"171- "Being pragmatic means adapting"172- "Deleting X hours is wasteful"173- "Keep as reference while writing tests first"174- "I already manually tested it"175176**Document every excuse.** These become your rationalization table.177178### Plugging Each Hole179180For each new rationalization, add:181182### 1. Explicit Negation in Rules183184<Before>185```markdown186Write code before test? Delete it.187```188</Before>189190<After>191```markdown192Write code before test? Delete it. Start over.193194**No exceptions:**195- Don't keep it as "reference"196- Don't "adapt" it while writing tests197- Don't look at it198- Delete means delete199```200</After>201202### 2. Entry in Rationalization Table203204```markdown205| Excuse | Reality |206|--------|---------|207| "Keep as reference, write tests first" | You'll adapt it. That's testing after. Delete means delete. |208```209210### 3. Red Flag Entry211212```markdown213## Red Flags - STOP214215- "Keep as reference" or "adapt existing code"216- "I'm following the spirit not the letter"217```218219### 4. Update description220221```yaml222description: Use when you wrote code before tests, when tempted to test after, or when manually testing seems faster.223```224225Add symptoms of ABOUT to violate.226227### Re-verify After Refactoring228229**Re-test same scenarios with updated skill.**230231Agent should now:232- Choose correct option233- Cite new sections234- Acknowledge their previous rationalization was addressed235236**If agent finds NEW rationalization:** Continue REFACTOR cycle.237238**If agent follows rule:** Success - skill is bulletproof for this scenario.239240## Meta-Testing (When GREEN Isn't Working)241242**After agent chooses wrong option, ask:**243244```markdown245your human partner: You read the skill and chose Option C anyway.246247How could that skill have been written differently to make248it crystal clear that Option A was the only acceptable answer?249```250251**Three possible responses:**2522531. **"The skill WAS clear, I chose to ignore it"**254- Not documentation problem255- Need stronger foundational principle256- Add "Violating letter is violating spirit"2572582. **"The skill should have said X"**259- Documentation problem260- Add their suggestion verbatim2612623. **"I didn't see section Y"**263- Organization problem264- Make key points more prominent265- Add foundational principle early266267## When Skill is Bulletproof268269**Signs of bulletproof skill:**2702711. **Agent chooses correct option** under maximum pressure2722. **Agent cites skill sections** as justification2733. **Agent acknowledges temptation** but follows rule anyway2744. **Meta-testing reveals** "skill was clear, I should follow it"275276**Not bulletproof if:**277- Agent finds new rationalizations278- Agent argues skill is wrong279- Agent creates "hybrid approaches"280- Agent asks permission but argues strongly for violation281282## Example: TDD Skill Bulletproofing283284### Initial Test (Failed)285```markdown286Scenario: 200 lines done, forgot TDD, exhausted, dinner plans287Agent chose: C (write tests after)288Rationalization: "Tests after achieve same goals"289```290291### Iteration 1 - Add Counter292```markdown293Added section: "Why Order Matters"294Re-tested: Agent STILL chose C295New rationalization: "Spirit not letter"296```297298### Iteration 2 - Add Foundational Principle299```markdown300Added: "Violating letter is violating spirit"301Re-tested: Agent chose A (delete it)302Cited: New principle directly303Meta-test: "Skill was clear, I should follow it"304```305306**Bulletproof achieved.**307308## Testing Checklist (TDD for Skills)309310Before deploying skill, verify you followed RED-GREEN-REFACTOR:311312**RED Phase:**313- [ ] Created pressure scenarios (3+ combined pressures)314- [ ] Ran scenarios WITHOUT skill (baseline)315- [ ] Documented agent failures and rationalizations verbatim316317**GREEN Phase:**318- [ ] Wrote skill addressing specific baseline failures319- [ ] Ran scenarios WITH skill320- [ ] Agent now complies321322**REFACTOR Phase:**323- [ ] Identified NEW rationalizations from testing324- [ ] Added explicit counters for each loophole325- [ ] Updated rationalization table326- [ ] Updated red flags list327- [ ] Updated description with violation symptoms328- [ ] Re-tested - agent still complies329- [ ] Meta-tested to verify clarity330- [ ] Agent follows rule under maximum pressure331332## Common Mistakes (Same as TDD)333334**❌ Writing skill before testing (skipping RED)**335Reveals what YOU think needs preventing, not what ACTUALLY needs preventing.336✅ Fix: Always run baseline scenarios first.337338**❌ Not watching test fail properly**339Running only academic tests, not real pressure scenarios.340✅ Fix: Use pressure scenarios that make agent WANT to violate.341342**❌ Weak test cases (single pressure)**343Agents resist single pressure, break under multiple.344✅ Fix: Combine 3+ pressures (time + sunk cost + exhaustion).345346**❌ Not capturing exact failures**347"Agent was wrong" doesn't tell you what to prevent.348✅ Fix: Document exact rationalizations verbatim.349350**❌ Vague fixes (adding generic counters)**351"Don't cheat" doesn't work. "Don't keep as reference" does.352✅ Fix: Add explicit negations for each specific rationalization.353354**❌ Stopping after first pass**355Tests pass once ≠ bulletproof.356✅ Fix: Continue REFACTOR cycle until no new rationalizations.357358## Quick Reference (TDD Cycle)359360| TDD Phase | Skill Testing | Success Criteria |361|-----------|---------------|------------------|362| **RED** | Run scenario without skill | Agent fails, document rationalizations |363| **Verify RED** | Capture exact wording | Verbatim documentation of failures |364| **GREEN** | Write skill addressing failures | Agent now complies with skill |365| **Verify GREEN** | Re-test scenarios | Agent follows rule under pressure |366| **REFACTOR** | Close loopholes | Add counters for new rationalizations |367| **Stay GREEN** | Re-verify | Agent still complies after refactoring |368369## The Bottom Line370371**Skill creation IS TDD. Same principles, same cycle, same benefits.**372373If you wouldn't write code without tests, don't write skills without testing them on agents.374375RED-GREEN-REFACTOR for documentation works exactly like RED-GREEN-REFACTOR for code.376377## Real-World Impact378379From applying TDD to TDD skill itself (2025-10-03):380- 6 RED-GREEN-REFACTOR iterations to bulletproof381- Baseline testing revealed 10+ unique rationalizations382- Each REFACTOR closed specific loopholes383- Final VERIFY GREEN: 100% compliance under maximum pressure384- Same process works for any discipline-enforcing skill385