Loading source
Pulling the file list, source metadata, and syntax-aware rendering for this listing.
Source from repo
Creates and validates agent skills using Test-Driven Development — write test scenarios, baseline behavior, then the skill itself.
Files
Skill
Size
Entrypoint
Format
Open file
Syntax-highlighted preview of this file as included in the skill package.
SKILL.md
1---2name: writing-skills3description: Use when creating new skills, editing existing skills, or verifying skills work before deployment4---56# Writing Skills78## Overview910**Writing skills IS Test-Driven Development applied to process documentation.**1112**Personal skills live in your runtime's skills directory** — see [claude-code-tools.md](../using-superpowers/references/claude-code-tools.md), [codex-tools.md](../using-superpowers/references/codex-tools.md), [copilot-tools.md](../using-superpowers/references/copilot-tools.md), or [gemini-tools.md](../using-superpowers/references/gemini-tools.md) for the path on your runtime. Codex, Copilot CLI, and Gemini CLI all also recognize `~/.agents/skills/` as a cross-runtime alias.1314You write test cases (pressure scenarios with subagents), watch them fail (baseline behavior), write the skill (documentation), watch tests pass (agents comply), and refactor (close loopholes).1516**Core principle:** If you didn't watch an agent fail without the skill, you don't know if the skill teaches the right thing.1718**REQUIRED BACKGROUND:** You MUST understand superpowers:test-driven-development before using this skill. That skill defines the fundamental RED-GREEN-REFACTOR cycle. This skill adapts TDD to documentation.1920**Official guidance:** For Anthropic's official skill authoring best practices, see anthropic-best-practices.md. This document provides additional patterns and guidelines that complement the TDD-focused approach in this skill.2122## What is a Skill?2324A **skill** is a reference guide for proven techniques, patterns, or tools. Skills help future agents find and apply effective approaches.2526**Skills are:** Reusable techniques, patterns, tools, reference guides2728**Skills are NOT:** Narratives about how you solved a problem once2930## TDD Mapping for Skills3132| TDD Concept | Skill Creation |33|-------------|----------------|34| **Test case** | Pressure scenario with subagent |35| **Production code** | Skill document (SKILL.md) |36| **Test fails (RED)** | Agent violates rule without skill (baseline) |37| **Test passes (GREEN)** | Agent complies with skill present |38| **Refactor** | Close loopholes while maintaining compliance |39| **Write test first** | Run baseline scenario BEFORE writing skill |40| **Watch it fail** | Document exact rationalizations agent uses |41| **Minimal code** | Write skill addressing those specific violations |42| **Watch it pass** | Verify agent now complies |43| **Refactor cycle** | Find new rationalizations → plug → re-verify |4445The entire skill creation process follows RED-GREEN-REFACTOR.4647## When to Create a Skill4849**Create when:**50- Technique wasn't intuitively obvious to you51- You'd reference this again across projects52- Pattern applies broadly (not project-specific)53- Others would benefit5455**Don't create for:**56- One-off solutions57- Standard practices well-documented elsewhere58- Project-specific conventions (put in your instructions file)59- Mechanical constraints (if it's enforceable with regex/validation, automate it—save documentation for judgment calls)6061## Skill Types6263### Technique64Concrete method with steps to follow (condition-based-waiting, root-cause-tracing)6566### Pattern67Way of thinking about problems (flatten-with-flags, test-invariants)6869### Reference70API docs, syntax guides, tool documentation (office docs)7172## Directory Structure737475```76skills/77skill-name/78SKILL.md # Main reference (required)79supporting-file.* # Only if needed80```8182**Flat namespace** - all skills in one searchable namespace8384**Separate files for:**851. **Heavy reference** (100+ lines) - API docs, comprehensive syntax862. **Reusable tools** - Scripts, utilities, templates8788**Keep inline:**89- Principles and concepts90- Code patterns (< 50 lines)91- Everything else9293## SKILL.md Structure9495**Frontmatter (YAML):**96- Two required fields: `name` and `description` (see [agentskills.io/specification](https://agentskills.io/specification) for all supported fields)97- Max 1024 characters total98- `name`: Use letters, numbers, and hyphens only (no parentheses, special chars)99- `description`: Third-person, describes ONLY when to use (NOT what it does)100- Start with "Use when..." to focus on triggering conditions101- Include specific symptoms, situations, and contexts102- **NEVER summarize the skill's process or workflow** (see SDO section for why)103- Keep under 500 characters if possible104105```markdown106---107name: Skill-Name-With-Hyphens108description: Use when [specific triggering conditions and symptoms]109---110111# Skill Name112113## Overview114What is this? Core principle in 1-2 sentences.115116## When to Use117[Small inline flowchart IF decision non-obvious]118119Bullet list with SYMPTOMS and use cases120When NOT to use121122## Core Pattern (for techniques/patterns)123Before/after code comparison124125## Quick Reference126Table or bullets for scanning common operations127128## Implementation129Inline code for simple patterns130Link to file for heavy reference or reusable tools131132## Common Mistakes133What goes wrong + fixes134135## Real-World Impact (optional)136Concrete results137```138139140## Skill Discovery Optimization (SDO)141142**Critical for discovery:** Future agents need to FIND your skill143144### 1. Rich Description Field145146**Purpose:** Your agent reads the description to decide which skills to load for a given task. Make it answer: "Should I read this skill right now?"147148**Format:** Start with "Use when..." to focus on triggering conditions149150**CRITICAL: Description = When to Use, NOT What the Skill Does**151152The description should ONLY describe triggering conditions. Do NOT summarize the skill's process or workflow in the description.153154**Why this matters:** Testing revealed that when a description summarizes the skill's workflow, an agent may follow the description instead of reading the full skill content. A description saying "code review between tasks" caused an agent to do ONE review, even though the skill's flowchart clearly showed TWO reviews (spec compliance then code quality).155156When the description was changed to just "Use when executing implementation plans with independent tasks" (no workflow summary), the agent correctly read the flowchart and followed the two-stage review process.157158**The trap:** Descriptions that summarize workflow create a shortcut agents will take. The skill body becomes documentation agents skip.159160```yaml161# ❌ BAD: Summarizes workflow - agents may follow this instead of reading skill162description: Use when executing plans - dispatches subagent per task with code review between tasks163164# ❌ BAD: Too much process detail165description: Use for TDD - write test first, watch it fail, write minimal code, refactor166167# ✅ GOOD: Just triggering conditions, no workflow summary168description: Use when executing implementation plans with independent tasks in the current session169170# ✅ GOOD: Triggering conditions only171description: Use when implementing any feature or bugfix, before writing implementation code172```173174**Content:**175- Use concrete triggers, symptoms, and situations that signal this skill applies176- Describe the *problem* (race conditions, inconsistent behavior) not *language-specific symptoms* (setTimeout, sleep)177- Keep triggers technology-agnostic unless the skill itself is technology-specific178- If skill is technology-specific, make that explicit in the trigger179- Write in third person (injected into system prompt)180- **NEVER summarize the skill's process or workflow**181182```yaml183# ❌ BAD: Too abstract, vague, doesn't include when to use184description: For async testing185186# ❌ BAD: First person187description: I can help you with async tests when they're flaky188189# ❌ BAD: Mentions technology but skill isn't specific to it190description: Use when tests use setTimeout/sleep and are flaky191192# ✅ GOOD: Starts with "Use when", describes problem, no workflow193description: Use when tests have race conditions, timing dependencies, or pass/fail inconsistently194195# ✅ GOOD: Technology-specific skill with explicit trigger196description: Use when using React Router and handling authentication redirects197```198199### 2. Keyword Coverage200201Use words an agent would search for:202- Error messages: "Hook timed out", "ENOTEMPTY", "race condition"203- Symptoms: "flaky", "hanging", "zombie", "pollution"204- Synonyms: "timeout/hang/freeze", "cleanup/teardown/afterEach"205- Tools: Actual commands, library names, file types206207### 3. Descriptive Naming208209**Use active voice, verb-first:**210- ✅ `creating-skills` not `skill-creation`211- ✅ `condition-based-waiting` not `async-test-helpers`212213### 4. Token Efficiency (Critical)214215**Problem:** getting-started and frequently-referenced skills load into EVERY conversation. Every token counts.216217**Target word counts:**218- getting-started workflows: <150 words each219- Frequently-loaded skills: <200 words total220- Other skills: <500 words (still be concise)221222**Techniques:**223224**Move details to tool help:**225```bash226# ❌ BAD: Document all flags in SKILL.md227search-conversations supports --text, --both, --after DATE, --before DATE, --limit N228229# ✅ GOOD: Reference --help230search-conversations supports multiple modes and filters. Run --help for details.231```232233**Use cross-references:**234```markdown235# ❌ BAD: Repeat workflow details236When searching, dispatch subagent with template...237[20 lines of repeated instructions]238239# ✅ GOOD: Reference other skill240Always use subagents (50-100x context savings). REQUIRED: Use [other-skill-name] for workflow.241```242243**Compress examples:**244```markdown245# ❌ BAD: Verbose example (42 words)246your human partner: "How did we handle authentication errors in React Router before?"247You: I'll search past conversations for React Router authentication patterns.248[Dispatch subagent with search query: "React Router authentication error handling 401"]249250# ✅ GOOD: Minimal example (20 words)251Partner: "How did we handle auth errors in React Router?"252You: Searching...253[Dispatch subagent → synthesis]254```255256**Eliminate redundancy:**257- Don't repeat what's in cross-referenced skills258- Don't explain what's obvious from command259- Don't include multiple examples of same pattern260261**Verification:**262```bash263wc -w skills/path/SKILL.md264# getting-started workflows: aim for <150 each265# Other frequently-loaded: aim for <200 total266```267268**Name by what you DO or core insight:**269- ✅ `condition-based-waiting` > `async-test-helpers`270- ✅ `using-skills` not `skill-usage`271- ✅ `flatten-with-flags` > `data-structure-refactoring`272- ✅ `root-cause-tracing` > `debugging-techniques`273274**Gerunds (-ing) work well for processes:**275- `creating-skills`, `testing-skills`, `debugging-with-logs`276- Active, describes the action you're taking277278### 5. Cross-Referencing Other Skills279280**When writing documentation that references other skills:**281282Use skill name only, with explicit requirement markers:283- ✅ Good: `**REQUIRED SUB-SKILL:** Use superpowers:test-driven-development`284- ✅ Good: `**REQUIRED BACKGROUND:** You MUST understand superpowers:systematic-debugging`285- ❌ Bad: `See skills/testing/test-driven-development` (unclear if required)286- ❌ Bad: `@skills/testing/test-driven-development/SKILL.md` (force-loads, burns context)287288**Why no @ links:** `@` syntax force-loads files immediately, consuming 200k+ context before you need them.289290## Flowchart Usage291292```dot293digraph when_flowchart {294"Need to show information?" [shape=diamond];295"Decision where I might go wrong?" [shape=diamond];296"Use markdown" [shape=box];297"Small inline flowchart" [shape=box];298299"Need to show information?" -> "Decision where I might go wrong?" [label="yes"];300"Decision where I might go wrong?" -> "Small inline flowchart" [label="yes"];301"Decision where I might go wrong?" -> "Use markdown" [label="no"];302}303```304305**Use flowcharts ONLY for:**306- Non-obvious decision points307- Process loops where you might stop too early308- "When to use A vs B" decisions309310**Never use flowcharts for:**311- Reference material → Tables, lists312- Code examples → Markdown blocks313- Linear instructions → Numbered lists314- Labels without semantic meaning (step1, helper2)315316See `graphviz-conventions.dot` in this directory for graphviz style rules.317318**Visualizing for your human partner:** Use `render-graphs.js` in this directory to render a skill's flowcharts to SVG:319```bash320./render-graphs.js ../some-skill # Each diagram separately321./render-graphs.js ../some-skill --combine # All diagrams in one SVG322```323324## Code Examples325326**One excellent example beats many mediocre ones**327328Choose most relevant language:329- Testing techniques → TypeScript/JavaScript330- System debugging → Shell/Python331- Data processing → Python332333**Good example:**334- Complete and runnable335- Well-commented explaining WHY336- From real scenario337- Shows pattern clearly338- Ready to adapt (not generic template)339340**Don't:**341- Implement in 5+ languages342- Create fill-in-the-blank templates343- Write contrived examples344345You're good at porting - one great example is enough.346347## File Organization348349### Self-Contained Skill350```351defense-in-depth/352SKILL.md # Everything inline353```354When: All content fits, no heavy reference needed355356### Skill with Reusable Tool357```358condition-based-waiting/359SKILL.md # Overview + patterns360example.ts # Working helpers to adapt361```362When: Tool is reusable code, not just narrative363364### Skill with Heavy Reference365```366pptx/367SKILL.md # Overview + workflows368pptxgenjs.md # 600 lines API reference369ooxml.md # 500 lines XML structure370scripts/ # Executable tools371```372When: Reference material too large for inline373374## The Iron Law (Same as TDD)375376```377NO SKILL WITHOUT A FAILING TEST FIRST378```379380This applies to NEW skills AND EDITS to existing skills.381382Write skill before testing? Delete it. Start over.383Edit skill without testing? Same violation.384385**No exceptions:**386- Not for "simple additions"387- Not for "just adding a section"388- Not for "documentation updates"389- Don't keep untested changes as "reference"390- Don't "adapt" while running tests391- Delete means delete392393**REQUIRED BACKGROUND:** The superpowers:test-driven-development skill explains why this matters. Same principles apply to documentation.394395## Testing All Skill Types396397Different skill types need different test approaches:398399### Discipline-Enforcing Skills (rules/requirements)400401**Examples:** TDD, verification-before-completion, designing-before-coding402403**Test with:**404- Academic questions: Do they understand the rules?405- Pressure scenarios: Do they comply under stress?406- Multiple pressures combined: time + sunk cost + exhaustion407- Identify rationalizations and add explicit counters408409**Success criteria:** Agent follows rule under maximum pressure410411### Technique Skills (how-to guides)412413**Examples:** condition-based-waiting, root-cause-tracing, defensive-programming414415**Test with:**416- Application scenarios: Can they apply the technique correctly?417- Variation scenarios: Do they handle edge cases?418- Missing information tests: Do instructions have gaps?419420**Success criteria:** Agent successfully applies technique to new scenario421422### Pattern Skills (mental models)423424**Examples:** reducing-complexity, information-hiding concepts425426**Test with:**427- Recognition scenarios: Do they recognize when pattern applies?428- Application scenarios: Can they use the mental model?429- Counter-examples: Do they know when NOT to apply?430431**Success criteria:** Agent correctly identifies when/how to apply pattern432433### Reference Skills (documentation/APIs)434435**Examples:** API documentation, command references, library guides436437**Test with:**438- Retrieval scenarios: Can they find the right information?439- Application scenarios: Can they use what they found correctly?440- Gap testing: Are common use cases covered?441442**Success criteria:** Agent finds and correctly applies reference information443444## Common Rationalizations for Skipping Testing445446| Excuse | Reality |447|--------|---------|448| "Skill is obviously clear" | Clear to you ≠ clear to other agents. Test it. |449| "It's just a reference" | References can have gaps, unclear sections. Test retrieval. |450| "Testing is overkill" | Untested skills have issues. Always. 15 min testing saves hours. |451| "I'll test if problems emerge" | Problems = agents can't use skill. Test BEFORE deploying. |452| "Too tedious to test" | Testing is less tedious than debugging bad skill in production. |453| "I'm confident it's good" | Overconfidence guarantees issues. Test anyway. |454| "Academic review is enough" | Reading ≠ using. Test application scenarios. |455| "No time to test" | Deploying untested skill wastes more time fixing it later. |456457**All of these mean: Test before deploying. No exceptions.**458459## Match the Form to the Failure460461Before writing guidance, classify the baseline failure. The form that bulletproofs one failure type measurably backfires on another.462463| Baseline failure | Right form | Wrong form |464|---|---|---|465| Skips/violates a rule under pressure (knows better, does it anyway) | Prohibition + rationalization table + red flags (see Bulletproofing below) | Soft guidance ("prefer...", "consider...") |466| Complies, but output has the wrong shape (bloated prompt, buried verdict, restated spec) | Positive recipe or contract: state what the output IS — its parts, in order | Prohibition list ("don't restate", "never narrate") |467| Omits a required element from something they already produce | Structural: REQUIRED field or slot in the template they fill in | Prose reminders near the template |468| Behavior should depend on a condition | Conditional keyed to an observable predicate ("if the brief exists, reference it") | Unconditional rule + exemption clauses |469470**Why prohibitions backfire on shaping problems:** under a competing incentive ("make the prompt self-contained"), agents negotiate with "don't X". In head-to-head wording tests on dispatch-prompt guidance, the prohibition arm produced clearly more of the unwanted content than the recipe arm (fully separated distributions), and trended worse than even the no-guidance control — micro-test your own case rather than assuming, but never reach for the prohibition by default. A recipe leaves nothing to negotiate: the output matches the stated shape or it doesn't.471472**Rules for whichever form you pick:**473- **No nuance clauses.** "Don't X unless it matters" reopens the negotiation — appending a single nuance clause to a winning recipe degraded it from consistent to noisy in the same wording tests. Express a real exception as its own conditional on an observable predicate.474- **Exemption clauses don't scope.** "This limit doesn't apply to code blocks" still suppresses code blocks. If part of the output must be exempt, restructure so the rule can't reach it.475476## Bulletproofing Skills Against Rationalization477478Skills that enforce discipline (like TDD) need to resist rationalization. Agents are smart and will find loopholes when under pressure.479480**Scope:** this toolkit is for discipline failures — an agent that knows the rule and skips it under pressure. For wrong-shaped output or omitted elements, prohibition-based bulletproofing backfires; use the forms in Match the Form to the Failure instead.481482**Psychology note:** Understanding WHY persuasion techniques work helps you apply them systematically. See persuasion-principles.md for research foundation (Cialdini, 2021; Meincke et al., 2025) on authority, commitment, scarcity, social proof, and unity principles.483484### Close Every Loophole Explicitly485486Don't just state the rule - forbid specific workarounds:487488<Bad>489```markdown490Write code before test? Delete it.491```492</Bad>493494<Good>495```markdown496Write code before test? Delete it. Start over.497498**No exceptions:**499- Don't keep it as "reference"500- Don't "adapt" it while writing tests501- Don't look at it502- Delete means delete503```504</Good>505506### Address "Spirit vs Letter" Arguments507508Add foundational principle early:509510```markdown511**Violating the letter of the rules is violating the spirit of the rules.**512```513514This cuts off entire class of "I'm following the spirit" rationalizations.515516### Build Rationalization Table517518Capture rationalizations from baseline testing (see Testing section below). Every excuse agents make goes in the table:519520```markdown521| Excuse | Reality |522|--------|---------|523| "Too simple to test" | Simple code breaks. Test takes 30 seconds. |524| "I'll test after" | Tests passing immediately prove nothing. |525| "Tests after achieve same goals" | Tests-after = "what does this do?" Tests-first = "what should this do?" |526```527528### Create Red Flags List529530Make it easy for agents to self-check when rationalizing:531532```markdown533## Red Flags - STOP and Start Over534535- Code before test536- "I already manually tested it"537- "Tests after achieve the same purpose"538- "It's about spirit not ritual"539- "This is different because..."540541**All of these mean: Delete code. Start over with TDD.**542```543544### Update SDO for Violation Symptoms545546Add to description: symptoms of when you're ABOUT to violate the rule:547548```yaml549description: use when implementing any feature or bugfix, before writing implementation code550```551552## RED-GREEN-REFACTOR for Skills553554Follow the TDD cycle:555556### RED: Write Failing Test (Baseline)557558Run pressure scenario with subagent WITHOUT the skill. Document exact behavior:559- What choices did they make?560- What rationalizations did they use (verbatim)?561- Which pressures triggered violations?562563This is "watch the test fail" - you must see what agents naturally do before writing the skill.564565### GREEN: Write Minimal Skill566567Write skill that addresses those specific rationalizations. Don't add extra content for hypothetical cases.568569Run same scenarios WITH skill. Agent should now comply.570571### REFACTOR: Close Loopholes572573Agent found new rationalization? Add explicit counter. Re-test until bulletproof.574575### Micro-Test Wording Before Full Scenarios576577Full pressure-scenario runs are the final gate, but they are slow and expensive per iteration. Verify the wording itself first with micro-tests:5785791. **One fresh-context sample per call** — a raw API call, or a single-shot subagent if you don't have API access. System prompt = the realistic context the guidance will live in (the full skill or prompt template, not the guidance in isolation); user message = a task that tempts the failure.5802. **Always include a no-guidance control.** If the control doesn't exhibit the failure, there is nothing to fix — stop, don't author the guidance.5813. **5+ reps per variant.** Single samples lie.5824. **Manually read every flagged match.** Score programmatically if you like, but template echoes and quoted counter-examples masquerade as hits; automated counts alone overstate both failure and success.5835. **Variance is a metric.** When guidance lands, reps converge on the same shape. Five different interpretations across five reps means the wording isn't binding — tighten the form before adding words.584585Micro-tests verify wording; they do not replace pressure scenarios for discipline skills.586587**Testing methodology:** See [testing-skills-with-subagents.md](testing-skills-with-subagents.md) for the complete testing methodology:588- How to write pressure scenarios589- Pressure types (time, sunk cost, authority, exhaustion)590- Plugging holes systematically591- Meta-testing techniques592593## Anti-Patterns594595### ❌ Narrative Example596"In session 2025-10-03, we found empty projectDir caused..."597**Why bad:** Too specific, not reusable598599### ❌ Multi-Language Dilution600example-js.js, example-py.py, example-go.go601**Why bad:** Mediocre quality, maintenance burden602603### ❌ Code in Flowcharts604```dot605step1 [label="import fs"];606step2 [label="read file"];607```608**Why bad:** Can't copy-paste, hard to read609610### ❌ Generic Labels611helper1, helper2, step3, pattern4612**Why bad:** Labels should have semantic meaning613614## STOP: Before Moving to Next Skill615616**After writing ANY skill, you MUST STOP and complete the deployment process.**617618**Do NOT:**619- Create multiple skills in batch without testing each620- Move to next skill before current one is verified621- Skip testing because "batching is more efficient"622623**The deployment checklist below is MANDATORY for EACH skill.**624625Deploying untested skills = deploying untested code. It's a violation of quality standards.626627## Skill Creation Checklist (TDD Adapted)628629**IMPORTANT: Create a todo for EACH checklist item below.**630631**RED Phase - Write Failing Test:**632- [ ] Create pressure scenarios (3+ combined pressures for discipline skills)633- [ ] Run scenarios WITHOUT skill - document baseline behavior verbatim634- [ ] Identify patterns in rationalizations/failures635636**GREEN Phase - Write Minimal Skill:**637- [ ] Name uses only letters, numbers, hyphens (no parentheses/special chars)638- [ ] YAML frontmatter with required `name` and `description` fields (max 1024 chars; see [spec](https://agentskills.io/specification))639- [ ] Description starts with "Use when..." and includes specific triggers/symptoms640- [ ] Description written in third person641- [ ] Keywords throughout for search (errors, symptoms, tools)642- [ ] Clear overview with core principle643- [ ] Address specific baseline failures identified in RED644- [ ] Guidance form matches the failure type (see Match the Form to the Failure)645- [ ] For behavior-shaping guidance: wording micro-tested against a no-guidance control (5+ reps, every flagged match read manually) — N/A for pure reference skills646- [ ] Code inline OR link to separate file647- [ ] One excellent example (not multi-language)648- [ ] Run scenarios WITH skill - verify agents now comply649650**REFACTOR Phase - Close Loopholes:**651- [ ] Identify NEW rationalizations from testing652- [ ] Add explicit counters (if discipline skill)653- [ ] Build rationalization table from all test iterations654- [ ] Create red flags list655- [ ] Re-test until bulletproof656657**Quality Checks:**658- [ ] Small flowchart only if decision non-obvious659- [ ] Quick reference table660- [ ] Common mistakes section661- [ ] No narrative storytelling662- [ ] Supporting files only for tools or heavy reference663664**Deployment:**665- [ ] Commit skill to git and push to your fork (if configured)666- [ ] Consider contributing back via PR (if broadly useful)667668## Discovery Workflow669670How future agents find your skill:6716721. **Encounters problem** ("tests are flaky")6732. **Searches skills** (greps descriptions, browses categories)6743. **Finds SKILL** (description matches)6754. **Scans overview** (is this relevant?)6765. **Reads patterns** (quick reference table)6776. **Loads example** (only when implementing)678679**Optimize for this flow** - put searchable terms early and often.680681## The Bottom Line682683**Creating skills IS TDD for process documentation.**684685Same Iron Law: No skill without failing test first.686Same cycle: RED (baseline) → GREEN (write skill) → REFACTOR (close loopholes).687Same benefits: Better quality, fewer surprises, bulletproof results.688689If you follow TDD for code, follow it for skills. It's the same discipline applied to documentation.690