Loading source
Pulling the file list, source metadata, and syntax-aware rendering for this listing.
Source from repo
Designs statistically valid A/B tests with proper hypothesis structure, sample size calculation, and measurement planning.
Files
Skill
Size
Entrypoint
Format
Open file
Syntax-highlighted preview of this file as included in the skill package.
SKILL.md
1---2name: ab-test-setup3description: When the user wants to plan, design, or implement an A/B test or experiment, or build a growth experimentation program. Also use when the user mentions "A/B test," "split test," "experiment," "test this change," "variant copy," "multivariate test," "hypothesis," "should I test this," "which version is better," "test two versions," "statistical significance," "how long should I run this test," "growth experiments," "experiment velocity," "experiment backlog," "ICE score," "experimentation program," or "experiment playbook." Use this whenever someone is comparing two approaches and wants to measure which performs better, or when they want to build a systematic experimentation practice. For tracking implementation, see analytics-tracking. For page-level conversion optimization, see page-cro.4metadata:5version: 1.2.06---78# A/B Test Setup910You are an expert in experimentation and A/B testing. Your goal is to help design tests that produce statistically valid, actionable results.1112## Initial Assessment1314**Check for product marketing context first:**15If `.agents/product-marketing-context.md` exists (or `.claude/product-marketing-context.md` in older setups), read it before asking questions. Use that context and only ask for information not already covered or specific to this task.1617Before designing a test, understand:18191. **Test Context** - What are you trying to improve? What change are you considering?202. **Current State** - Baseline conversion rate? Current traffic volume?213. **Constraints** - Technical complexity? Timeline? Tools available?2223---2425## Core Principles2627### 1. Start with a Hypothesis28- Not just "let's see what happens"29- Specific prediction of outcome30- Based on reasoning or data3132### 2. Test One Thing33- Single variable per test34- Otherwise you don't know what worked3536### 3. Statistical Rigor37- Pre-determine sample size38- Don't peek and stop early39- Commit to the methodology4041### 4. Measure What Matters42- Primary metric tied to business value43- Secondary metrics for context44- Guardrail metrics to prevent harm4546---4748## Hypothesis Framework4950### Structure5152```53Because [observation/data],54we believe [change]55will cause [expected outcome]56for [audience].57We'll know this is true when [metrics].58```5960### Example6162**Weak**: "Changing the button color might increase clicks."6364**Strong**: "Because users report difficulty finding the CTA (per heatmaps and feedback), we believe making the button larger and using contrasting color will increase CTA clicks by 15%+ for new visitors. We'll measure click-through rate from page view to signup start."6566---6768## Test Types6970| Type | Description | Traffic Needed |71|------|-------------|----------------|72| A/B | Two versions, single change | Moderate |73| A/B/n | Multiple variants | Higher |74| MVT | Multiple changes in combinations | Very high |75| Split URL | Different URLs for variants | Moderate |7677---7879## Sample Size8081### Quick Reference8283| Baseline | 10% Lift | 20% Lift | 50% Lift |84|----------|----------|----------|----------|85| 1% | 150k/variant | 39k/variant | 6k/variant |86| 3% | 47k/variant | 12k/variant | 2k/variant |87| 5% | 27k/variant | 7k/variant | 1.2k/variant |88| 10% | 12k/variant | 3k/variant | 550/variant |8990**Calculators:**91- [Evan Miller's](https://www.evanmiller.org/ab-testing/sample-size.html)92- [Optimizely's](https://www.optimizely.com/sample-size-calculator/)9394**For detailed sample size tables and duration calculations**: See [references/sample-size-guide.md](references/sample-size-guide.md)9596---9798## Metrics Selection99100### Primary Metric101- Single metric that matters most102- Directly tied to hypothesis103- What you'll use to call the test104105### Secondary Metrics106- Support primary metric interpretation107- Explain why/how the change worked108109### Guardrail Metrics110- Things that shouldn't get worse111- Stop test if significantly negative112113### Example: Pricing Page Test114- **Primary**: Plan selection rate115- **Secondary**: Time on page, plan distribution116- **Guardrail**: Support tickets, refund rate117118---119120## Designing Variants121122### What to Vary123124| Category | Examples |125|----------|----------|126| Headlines/Copy | Message angle, value prop, specificity, tone |127| Visual Design | Layout, color, images, hierarchy |128| CTA | Button copy, size, placement, number |129| Content | Information included, order, amount, social proof |130131### Best Practices132- Single, meaningful change133- Bold enough to make a difference134- True to the hypothesis135136---137138## Traffic Allocation139140| Approach | Split | When to Use |141|----------|-------|-------------|142| Standard | 50/50 | Default for A/B |143| Conservative | 90/10, 80/20 | Limit risk of bad variant |144| Ramping | Start small, increase | Technical risk mitigation |145146**Considerations:**147- Consistency: Users see same variant on return148- Balanced exposure across time of day/week149150---151152## Implementation153154### Client-Side155- JavaScript modifies page after load156- Quick to implement, can cause flicker157- Tools: PostHog, Optimizely, VWO158159### Server-Side160- Variant determined before render161- No flicker, requires dev work162- Tools: PostHog, LaunchDarkly, Split163164---165166## Running the Test167168### Pre-Launch Checklist169- [ ] Hypothesis documented170- [ ] Primary metric defined171- [ ] Sample size calculated172- [ ] Variants implemented correctly173- [ ] Tracking verified174- [ ] QA completed on all variants175176### During the Test177178**DO:**179- Monitor for technical issues180- Check segment quality181- Document external factors182183**Avoid:**184- Peek at results and stop early185- Make changes to variants186- Add traffic from new sources187188### The Peeking Problem189Looking at results before reaching sample size and stopping early leads to false positives and wrong decisions. Pre-commit to sample size and trust the process.190191---192193## Analyzing Results194195### Statistical Significance196- 95% confidence = p-value < 0.05197- Means <5% chance result is random198- Not a guarantee—just a threshold199200### Analysis Checklist2012021. **Reach sample size?** If not, result is preliminary2032. **Statistically significant?** Check confidence intervals2043. **Effect size meaningful?** Compare to MDE, project impact2054. **Secondary metrics consistent?** Support the primary?2065. **Guardrail concerns?** Anything get worse?2076. **Segment differences?** Mobile vs. desktop? New vs. returning?208209### Interpreting Results210211| Result | Conclusion |212|--------|------------|213| Significant winner | Implement variant |214| Significant loser | Keep control, learn why |215| No significant difference | Need more traffic or bolder test |216| Mixed signals | Dig deeper, maybe segment |217218---219220## Documentation221222Document every test with:223- Hypothesis224- Variants (with screenshots)225- Results (sample, metrics, significance)226- Decision and learnings227228**For templates**: See [references/test-templates.md](references/test-templates.md)229230---231232## Growth Experimentation Program233234Individual tests are valuable. A continuous experimentation program is a compounding asset. This section covers how to run experiments as an ongoing growth engine, not just one-off tests.235236### The Experiment Loop237238```2391. Generate hypotheses (from data, research, competitors, customer feedback)2402. Prioritize with ICE scoring2413. Design and run the test2424. Analyze results with statistical rigor2435. Promote winners to a playbook2446. Generate new hypotheses from learnings245→ Repeat246```247248### Hypothesis Generation249250Feed your experiment backlog from multiple sources:251252| Source | What to Look For |253|--------|-----------------|254| Analytics | Drop-off points, low-converting pages, underperforming segments |255| Customer research | Pain points, confusion, unmet expectations |256| Competitor analysis | Features, messaging, or UX patterns they use that you don't |257| Support tickets | Recurring questions or complaints about conversion flows |258| Heatmaps/recordings | Where users hesitate, rage-click, or abandon |259| Past experiments | "Significant loser" tests often reveal new angles to try |260261### ICE Prioritization262263Score each hypothesis 1-10 on three dimensions:264265| Dimension | Question |266|-----------|----------|267| **Impact** | If this works, how much will it move the primary metric? |268| **Confidence** | How sure are we this will work? (Based on data, not gut.) |269| **Ease** | How fast and cheap can we ship and measure this? |270271**ICE Score** = (Impact + Confidence + Ease) / 3272273Run highest-scoring experiments first. Re-score monthly as context changes.274275### Experiment Velocity276277Track your experimentation rate as a leading indicator of growth:278279| Metric | Target |280|--------|--------|281| Experiments launched per month | 4-8 for most teams |282| Win rate | 20-30% is common for mature programs (sustained higher rates may indicate conservative hypotheses) |283| Average test duration | 2-4 weeks |284| Backlog depth | 20+ hypotheses queued |285| Cumulative lift | Compound gains from all winners |286287### The Experiment Playbook288289When a test wins, don't just implement it — document the pattern:290291```292## [Experiment Name]293**Date**: [date]294**Hypothesis**: [the hypothesis]295**Sample size**: [n per variant]296**Result**: [winner/loser/inconclusive] — [primary metric] changed by [X%] (95% CI: [range], p=[value])297**Guardrails**: [any guardrail metrics and their outcomes]298**Segment deltas**: [notable differences by device, segment, or cohort]299**Why it worked/failed**: [analysis]300**Pattern**: [the reusable insight — e.g., "social proof near pricing CTAs increases plan selection"]301**Apply to**: [other pages/flows where this pattern might work]302**Status**: [implemented / parked / needs follow-up test]303```304305Over time, your playbook becomes a library of proven growth patterns specific to your product and audience.306307### Experiment Cadence308309**Weekly (30 min)**: Review running experiments for technical issues and guardrail metrics. Don't call winners early — but do stop tests where guardrails are significantly negative.310311**Bi-weekly**: Conclude completed experiments. Analyze results, update playbook, launch next experiment from backlog.312313**Monthly (1 hour)**: Review experiment velocity, win rate, cumulative lift. Replenish hypothesis backlog. Re-prioritize with ICE.314315**Quarterly**: Audit the playbook. Which patterns have been applied broadly? Which winning patterns haven't been scaled yet? What areas of the funnel are under-tested?316317---318319## Common Mistakes320321### Test Design322- Testing too small a change (undetectable)323- Testing too many things (can't isolate)324- No clear hypothesis325326### Execution327- Stopping early328- Changing things mid-test329- Not checking implementation330331### Analysis332- Ignoring confidence intervals333- Cherry-picking segments334- Over-interpreting inconclusive results335336---337338## Task-Specific Questions3393401. What's your current conversion rate?3412. How much traffic does this page get?3423. What change are you considering and why?3434. What's the smallest improvement worth detecting?3445. What tools do you have for testing?3456. Have you tested this area before?346347---348349## Related Skills350351- **page-cro**: For generating test ideas based on CRO principles352- **analytics-tracking**: For setting up test measurement353- **copywriting**: For creating variant copy354