Skip to main content
ClaudeWave
Skill452 repo starsupdated 6d ago

19-ab-test-setup-global

# Claude Code Skill: A/B Test Setup (Global) This skill guides users through designing statistically valid A/B tests for marketing experiments by establishing clear hypotheses, calculating required sample sizes, and determining statistical significance thresholds. Use it when planning split tests, multivariate experiments, or any campaign variant comparison where you need to distinguish real improvements from noise rather than making decisions on weak or rushed data.

Install in Claude Code
Copy
git clone --depth 1 https://github.com/minhnv0807/ai-business-skills /tmp/19-ab-test-setup-global && cp -r /tmp/19-ab-test-setup-global/skills/en/19-ab-test-setup-global ~/.claude/skills/19-ab-test-setup-global
Then start a new Claude Code session; the skill loads automatically.

SKILL.md

# A/B Test Setup (Global)

> Run experiments that produce decisions, not noise. Most "A/B tests" in marketing are underpowered, peeked-at, and badly hypothesized — meaning the team learns nothing and ships the louder variant.

---

## For Newbies

A valid A/B test answers one question: "Did this change cause a real improvement, or am I seeing noise?"

To answer it credibly you need four things:
1. A **specific hypothesis** with a numeric prediction
2. **One variable changed** (everything else identical)
3. **Enough sample** to detect the effect you care about
4. **Statistical significance** before you call a winner (typically p < 0.05)

If any one of these is missing, you don't have an A/B test — you have a coin flip with extra steps.

**Common newbie mistake:** running a test for 3 days, seeing variant B 40% higher, declaring victory, and shipping. Three days is too short to absorb day-of-week effects, and small samples produce wild swings. Variant B may revert (or reverse) by day 14.

---

## Step 0 — Read Context

Read `.agents/product-marketing-context.md` if it exists. Audience size, average traffic, and current conversion rate determine whether a test is even feasible.

---

## Step 1 — Information Gathering

Ask up to 4 questions:

1. **What are you testing?** (Ad headline / Landing page section / Email subject / Pricing display / CTA button / Creative video)
2. **Primary metric?** (CTR / Conversion rate / CPM / CPA / Revenue / Open rate / Reply rate)
3. **Daily traffic to the test surface?** (Needed for sample size and duration)
4. **Goal of the test?** (Lift X% on primary metric / Pick a winner among N candidates / Validate a strategic hypothesis)

---

## The 7 Principles of a Valid A/B Test

### 1. Test exactly one variable

The cardinal rule. Change two things at once and you cannot attribute the result.

- **Bad:** "I changed the headline, the hero image, and the CTA color." → You learn nothing about which element drove the lift.
- **Good:** Change only the headline. Image, CTA, layout, traffic source, and audience targeting are identical.

If you must test multiple changes, use a **multivariate test (MVT)** — but those need much more traffic (often 4×–8× a single A/B).

### 2. Hypothesize with a number

**Format:** "If we [change X], [metric Y] will increase by [Z%] because [reason]."

- **Good:** "If we change the CTA from 'Sign up' to 'Get my free demo,' conversion rate will increase by 15% because action-specific language reduces ambiguity."
- **Bad:** "The new copy will be better." (No metric, no number, no causal reasoning — un-testable.)

The "because" matters: if your hypothesis is wrong but the reasoning was sound, you've still learned something generalizable.

### 3. Sufficient sample size

Don't stop early. Statistical tests need adequate data to distinguish signal from noise.

- **Minimum rule of thumb:** 100 conversions per variant (not 100 visitors)
- **Better:** Calculate sample size up front based on baseline conversion rate and minimum detectable effect (formula below)

### 4. Sufficient duration

Run for **whole weeks**, not 3 days, not 10 days. Different weekdays produce different audience behavior — Monday B2B traffic is not Saturday DTC traffic.

- **Minimum:** 7 days
- **Recommended:** 14 days
- **Watch for:** holidays, paydays, monthly billing cycles, ad spend ramp-ups

### 5. Don't peek

Looking at results every hour and stopping when "B looks good" is the most common error in marketing experimentation. Early peeks combined with early stops dramatically inflate false positive rates.

- Define the end date in advance. Honor it.
- If you must monitor, use **sequential testing** methods designed for it (Bayesian frameworks like Optimizely's Stats Engine, or platforms with built-in sequential controls).

### 6. Statistical significance: p < 0.05

Most marketing teams use **95% confidence (p-value < 0.05)** as the bar.

- p-value < 0.05 → less than 5% chance the observed difference is random
- p-value 0.05–0.10 → suggestive but inconclusive — extend the test
- p-value > 0.10 → no evidence of an effect — keep control or test something else

For high-stakes tests (pricing, branding) consider 99% confidence (p < 0.01).

### 7. Document everything

Write down:
- Hypothesis (with number)
- Start date / end date
- Sample size achieved
- Primary metric, secondary metrics
- Result + p-value
- Decision + reasoning
- What you'd test next

A documented test history prevents your team from re-testing things that already failed and from forgetting why you made past decisions.

---

## Sample Size Calculation

### Quick formula

```
Sample size per variant ≈ 16 × p × (1 − p) / MDE²

where:
  p   = baseline conversion rate (e.g. 0.03 = 3%)
  MDE = minimum detectable effect, in absolute terms
        (e.g. 0.006 = lift from 3% to 3.6%)
```

This produces sample size for **80% power, 95% confidence, 50/50 split** — sensible defaults for most marketing tests.

### Worked example A — landing page CRO

Current conversion rate is 3%. You want to detect a 20% relative lift (from 3% to 3.6%).

- p = 0.03
- MDE (absolute) = 0.20 × 0.03 = 0.006
- Sample size per variant = 16 × 0.03 × 0.97 / 0.006² = **12,933 visitors**
- Total: ~25,866 visitors. At 500 visitors/day → ~52 days.

That's slow. Either run it (if the change matters), test something with a bigger expected lift, or get more traffic on the test surface.

### Worked example B — email subject line

Current open rate is 25%. You want to detect a 10% relative lift (to 27.5%).

- p = 0.25
- MDE = 0.025
- Sample size per variant = 16 × 0.25 × 0.75 / 0.025² = **4,800 sends**
- Total: 9,600 sends per email — usually achievable in one campaign.

### Feasibility quick-reference

| Daily volume | Conv. rate | Days needed | Test feasibility |
|--------------|-----------|------------|------------------|
| < 100        | any        | 2+ months  | Skip — focus on traffic first |
| 100–500      | 2–5%       | 3–6 weeks  | Yes, but be pa
PULL_REQUEST_TEMPLATESkill
channel-operatorSubagent

Agent van hanh kenh — thiet lap kenh, brief landing page, email marketing, social listening

content-producerSubagent

Agent san xuat noi dung — viet script, copy, brief creator, lap lich noi dung

mkt-strategistSubagent

Agent chien luoc marketing — lap ke hoach, nghien cuu thi truong, phan tich doi thu, xay dung chien luoc thuong hieu

performance-analystSubagent

Agent phan tich hieu suat — doc data, danh gia chien dich, tinh KPI, bao cao

personal-brand-builderSubagent

Agent xay dung thuong hieu ca nhan voi AI Avatar — chien luoc, content engine, monetization, community cho founder/coach/creator

29-dropshipping-mastery-globalSkill

Full dropshipping pipeline for US/EU/global markets — product research (winning criteria, Minea, PiPiAds), supplier sourcing (AliExpress, CJ Dropshipping, Spocket, Zendrop), Shopify store setup (themes, apps), ad creative pipeline (10 ads/week methodology, UGC pattern), audience targeting (interest stacking, lookalike, broad), pricing math (3-5x markup, BE-ROAS), customer service (long shipping, refunds), scaling playbook (CBO, vertical), compliance (FTC, EU CHRD). Trigger: 'dropshipping', 'shopify store', 'AliExpress', 'winning product', 'Facebook ads dropship', 'TikTok ads dropship', 'Shopify conversion'.

22-personal-brand-context-globalSkill

Foundation skill for global personal brand cluster. Creates `.agents/personal-brand-context-global.md` with region-specific personal brand context. 4 region variants (US/EU/SEA/LATAM); each covers founder/coach/creator inside. Reads BEFORE other PB skills (23-28 global). Trigger: 'global personal brand', 'international personal brand', 'US founder brand', 'EU coach brand', 'creator economy global'.