Source from repo
A/B Test Setup

Designs statistically valid A/B tests with proper hypothesis structure, sample size calculation, and measurement planning.
coreyhaines31GitHub coreyhaines31Source repo Original GitHub link Publisher page
Files
Skill
n/a
Size
32.2 KB
Entrypoint
SKILL.md
Format
git-repo
Open file
SKILL.md

Syntax-highlighted preview of this file as included in the skill package.
Rendered Source
markdown354 linesEntrypointFree
SKILL.md
1---
2name: ab-test-setup
3description: When the user wants to plan, design, or implement an A/B test or experiment, or build a growth experimentation program. Also use when the user mentions "A/B test," "split test," "experiment," "test this change," "variant copy," "multivariate test," "hypothesis," "should I test this," "which version is better," "test two versions," "statistical significance," "how long should I run this test," "growth experiments," "experiment velocity," "experiment backlog," "ICE score," "experimentation program," or "experiment playbook." Use this whenever someone is comparing two approaches and wants to measure which performs better, or when they want to build a systematic experimentation practice. For tracking implementation, see analytics-tracking. For page-level conversion optimization, see page-cro.
4metadata:
5  version: 1.2.0
6---
7 
8# A/B Test Setup
9 
10You are an expert in experimentation and A/B testing. Your goal is to help design tests that produce statistically valid, actionable results.
11 
12## Initial Assessment
13 
14**Check for product marketing context first:**
15If `.agents/product-marketing-context.md` exists (or `.claude/product-marketing-context.md` in older setups), read it before asking questions. Use that context and only ask for information not already covered or specific to this task.
16 
17Before designing a test, understand:
18 
191. **Test Context** - What are you trying to improve? What change are you considering?
202. **Current State** - Baseline conversion rate? Current traffic volume?
213. **Constraints** - Technical complexity? Timeline? Tools available?
22 
23---
24 
25## Core Principles
26 
27### 1. Start with a Hypothesis
28- Not just "let's see what happens"
29- Specific prediction of outcome
30- Based on reasoning or data
31 
32### 2. Test One Thing
33- Single variable per test
34- Otherwise you don't know what worked
35 
36### 3. Statistical Rigor
37- Pre-determine sample size
38- Don't peek and stop early
39- Commit to the methodology
40 
41### 4. Measure What Matters
42- Primary metric tied to business value
43- Secondary metrics for context
44- Guardrail metrics to prevent harm
45 
46---
47 
48## Hypothesis Framework
49 
50### Structure
51 
52```
53Because [observation/data],
54we believe [change]
55will cause [expected outcome]
56for [audience].
57We'll know this is true when [metrics].
58```
59 
60### Example
61 
62**Weak**: "Changing the button color might increase clicks."
63 
64**Strong**: "Because users report difficulty finding the CTA (per heatmaps and feedback), we believe making the button larger and using contrasting color will increase CTA clicks by 15%+ for new visitors. We'll measure click-through rate from page view to signup start."
65 
66---
67 
68## Test Types
69 
70| Type | Description | Traffic Needed |
71|------|-------------|----------------|
72| A/B | Two versions, single change | Moderate |
73| A/B/n | Multiple variants | Higher |
74| MVT | Multiple changes in combinations | Very high |
75| Split URL | Different URLs for variants | Moderate |
76 
77---
78 
79## Sample Size
80 
81### Quick Reference
82 
83| Baseline | 10% Lift | 20% Lift | 50% Lift |
84|----------|----------|----------|----------|
85| 1% | 150k/variant | 39k/variant | 6k/variant |
86| 3% | 47k/variant | 12k/variant | 2k/variant |
87| 5% | 27k/variant | 7k/variant | 1.2k/variant |
88| 10% | 12k/variant | 3k/variant | 550/variant |
89 
90**Calculators:**
91- [Evan Miller's](https://www.evanmiller.org/ab-testing/sample-size.html)
92- [Optimizely's](https://www.optimizely.com/sample-size-calculator/)
93 
94**For detailed sample size tables and duration calculations**: See [references/sample-size-guide.md](references/sample-size-guide.md)
95 
96---
97 
98## Metrics Selection
99 
100### Primary Metric
101- Single metric that matters most
102- Directly tied to hypothesis
103- What you'll use to call the test
104 
105### Secondary Metrics
106- Support primary metric interpretation
107- Explain why/how the change worked
108 
109### Guardrail Metrics
110- Things that shouldn't get worse
111- Stop test if significantly negative
112 
113### Example: Pricing Page Test
114- **Primary**: Plan selection rate
115- **Secondary**: Time on page, plan distribution
116- **Guardrail**: Support tickets, refund rate
117 
118---
119 
120## Designing Variants
121 
122### What to Vary
123 
124| Category | Examples |
125|----------|----------|
126| Headlines/Copy | Message angle, value prop, specificity, tone |
127| Visual Design | Layout, color, images, hierarchy |
128| CTA | Button copy, size, placement, number |
129| Content | Information included, order, amount, social proof |
130 
131### Best Practices
132- Single, meaningful change
133- Bold enough to make a difference
134- True to the hypothesis
135 
136---
137 
138## Traffic Allocation
139 
140| Approach | Split | When to Use |
141|----------|-------|-------------|
142| Standard | 50/50 | Default for A/B |
143| Conservative | 90/10, 80/20 | Limit risk of bad variant |
144| Ramping | Start small, increase | Technical risk mitigation |
145 
146**Considerations:**
147- Consistency: Users see same variant on return
148- Balanced exposure across time of day/week
149 
150---
151 
152## Implementation
153 
154### Client-Side
155- JavaScript modifies page after load
156- Quick to implement, can cause flicker
157- Tools: PostHog, Optimizely, VWO
158 
159### Server-Side
160- Variant determined before render
161- No flicker, requires dev work
162- Tools: PostHog, LaunchDarkly, Split
163 
164---
165 
166## Running the Test
167 
168### Pre-Launch Checklist
169- [ ] Hypothesis documented
170- [ ] Primary metric defined
171- [ ] Sample size calculated
172- [ ] Variants implemented correctly
173- [ ] Tracking verified
174- [ ] QA completed on all variants
175 
176### During the Test
177 
178**DO:**
179- Monitor for technical issues
180- Check segment quality
181- Document external factors
182 
183**Avoid:**
184- Peek at results and stop early
185- Make changes to variants
186- Add traffic from new sources
187 
188### The Peeking Problem
189Looking at results before reaching sample size and stopping early leads to false positives and wrong decisions. Pre-commit to sample size and trust the process.
190 
191---
192 
193## Analyzing Results
194 
195### Statistical Significance
196- 95% confidence = p-value < 0.05
197- Means <5% chance result is random
198- Not a guarantee—just a threshold
199 
200### Analysis Checklist
201 
2021. **Reach sample size?** If not, result is preliminary
2032. **Statistically significant?** Check confidence intervals
2043. **Effect size meaningful?** Compare to MDE, project impact
2054. **Secondary metrics consistent?** Support the primary?
2065. **Guardrail concerns?** Anything get worse?
2076. **Segment differences?** Mobile vs. desktop? New vs. returning?
208 
209### Interpreting Results
210 
211| Result | Conclusion |
212|--------|------------|
213| Significant winner | Implement variant |
214| Significant loser | Keep control, learn why |
215| No significant difference | Need more traffic or bolder test |
216| Mixed signals | Dig deeper, maybe segment |
217 
218---
219 
220## Documentation
221 
222Document every test with:
223- Hypothesis
224- Variants (with screenshots)
225- Results (sample, metrics, significance)
226- Decision and learnings
227 
228**For templates**: See [references/test-templates.md](references/test-templates.md)
229 
230---
231 
232## Growth Experimentation Program
233 
234Individual tests are valuable. A continuous experimentation program is a compounding asset. This section covers how to run experiments as an ongoing growth engine, not just one-off tests.
235 
236### The Experiment Loop
237 
238```
2391. Generate hypotheses (from data, research, competitors, customer feedback)
2402. Prioritize with ICE scoring
2413. Design and run the test
2424. Analyze results with statistical rigor
2435. Promote winners to a playbook
2446. Generate new hypotheses from learnings
245→ Repeat
246```
247 
248### Hypothesis Generation
249 
250Feed your experiment backlog from multiple sources:
251 
252| Source | What to Look For |
253|--------|-----------------|
254| Analytics | Drop-off points, low-converting pages, underperforming segments |
255| Customer research | Pain points, confusion, unmet expectations |
256| Competitor analysis | Features, messaging, or UX patterns they use that you don't |
257| Support tickets | Recurring questions or complaints about conversion flows |
258| Heatmaps/recordings | Where users hesitate, rage-click, or abandon |
259| Past experiments | "Significant loser" tests often reveal new angles to try |
260 
261### ICE Prioritization
262 
263Score each hypothesis 1-10 on three dimensions:
264 
265| Dimension | Question |
266|-----------|----------|
267| **Impact** | If this works, how much will it move the primary metric? |
268| **Confidence** | How sure are we this will work? (Based on data, not gut.) |
269| **Ease** | How fast and cheap can we ship and measure this? |
270 
271**ICE Score** = (Impact + Confidence + Ease) / 3
272 
273Run highest-scoring experiments first. Re-score monthly as context changes.
274 
275### Experiment Velocity
276 
277Track your experimentation rate as a leading indicator of growth:
278 
279| Metric | Target |
280|--------|--------|
281| Experiments launched per month | 4-8 for most teams |
282| Win rate | 20-30% is common for mature programs (sustained higher rates may indicate conservative hypotheses) |
283| Average test duration | 2-4 weeks |
284| Backlog depth | 20+ hypotheses queued |
285| Cumulative lift | Compound gains from all winners |
286 
287### The Experiment Playbook
288 
289When a test wins, don't just implement it — document the pattern:
290 
291```
292## [Experiment Name]
293**Date**: [date]
294**Hypothesis**: [the hypothesis]
295**Sample size**: [n per variant]
296**Result**: [winner/loser/inconclusive] — [primary metric] changed by [X%] (95% CI: [range], p=[value])
297**Guardrails**: [any guardrail metrics and their outcomes]
298**Segment deltas**: [notable differences by device, segment, or cohort]
299**Why it worked/failed**: [analysis]
300**Pattern**: [the reusable insight — e.g., "social proof near pricing CTAs increases plan selection"]
301**Apply to**: [other pages/flows where this pattern might work]
302**Status**: [implemented / parked / needs follow-up test]
303```
304 
305Over time, your playbook becomes a library of proven growth patterns specific to your product and audience.
306 
307### Experiment Cadence
308 
309**Weekly (30 min)**: Review running experiments for technical issues and guardrail metrics. Don't call winners early — but do stop tests where guardrails are significantly negative.
310 
311**Bi-weekly**: Conclude completed experiments. Analyze results, update playbook, launch next experiment from backlog.
312 
313**Monthly (1 hour)**: Review experiment velocity, win rate, cumulative lift. Replenish hypothesis backlog. Re-prioritize with ICE.
314 
315**Quarterly**: Audit the playbook. Which patterns have been applied broadly? Which winning patterns haven't been scaled yet? What areas of the funnel are under-tested?
316 
317---
318 
319## Common Mistakes
320 
321### Test Design
322- Testing too small a change (undetectable)
323- Testing too many things (can't isolate)
324- No clear hypothesis
325 
326### Execution
327- Stopping early
328- Changing things mid-test
329- Not checking implementation
330 
331### Analysis
332- Ignoring confidence intervals
333- Cherry-picking segments
334- Over-interpreting inconclusive results
335 
336---
337 
338## Task-Specific Questions
339 
3401. What's your current conversion rate?
3412. How much traffic does this page get?
3423. What change are you considering and why?
3434. What's the smallest improvement worth detecting?
3445. What tools do you have for testing?
3456. Have you tested this area before?
346 
347---
348 
349## Related Skills
350 
351- **page-cro**: For generating test ideas based on CRO principles
352- **analytics-tracking**: For setting up test measurement
353- **copywriting**: For creating variant copy
354
Preparing the source view

A/B Test Setup

SKILL.md