Skip to main content
ClaudeWave
Skill963 repo starsupdated 3d ago

ab-test-planner

The ab-test-planner skill designs statistically rigorous A/B tests for product features, UI changes, onboarding flows, and pricing experiments. Use it when setting up an experiment, calculating sample size, or interpreting test results. It produces complete test plans including hypothesis formulation, variant definitions, sample size calculations, duration estimates, guardrail metrics, and results interpretation guidance to ensure experiments yield trustworthy conclusions rather than directional signals.

Install in Claude Code
Copy
git clone --depth 1 https://github.com/mohitagw15856/pm-claude-skills /tmp/ab-test-planner && cp -r /tmp/ab-test-planner/plugins/pm-delivery/skills/ab-test-planner ~/.claude/skills/ab-test-planner
Then start a new Claude Code session; the skill loads automatically.

SKILL.md

# A/B Test Planner Skill

Design experiments that produce trustworthy results — not just directional signals. Every test output includes hypothesis, success metrics, sample size, duration, and a results interpretation guide.

## Required Inputs

Ask the user for these if not provided:
- **What is being tested** (feature, UI change, copy, pricing, onboarding step)
- **Hypothesis** (or ask to help formulate one)
- **Primary metric** (conversion rate, click-through, completion rate, etc.)
- **Baseline rate** and **minimum detectable effect** (MDE)
- **Daily eligible users** (to calculate duration)

## Experiment Design Checklist

Before running any test, confirm:
- [ ] Clear hypothesis with predicted direction
- [ ] Single primary metric (plus up to 2 guardrail metrics)
- [ ] Minimum detectable effect (MDE) defined
- [ ] Sample size calculated
- [ ] Test duration estimated
- [ ] Segment isolated (no overlap with other running tests)
- [ ] Rollback plan defined

## Hypothesis Template

> "We believe that [change] will cause [primary metric] to [increase/decrease] by [X%] for [user segment], because [rationale based on data or insight]."

Never run a test without a directional hypothesis. "Let's just see what happens" is not a hypothesis.

## Sample Size Calculator Logic

Use this formula (provide the output, not the formula, to the user):

- **Baseline conversion rate:** Current rate of primary metric
- **MDE:** Smallest change worth detecting (recommend 10–20% relative lift for most features)
- **Statistical power:** 80% (standard)
- **Significance level:** 95% (p < 0.05)

For common scenarios, provide pre-calculated estimates:

| Baseline Rate | MDE (Relative) | Required Sample per Variant |
|---|---|---|
| 5% | 20% | ~19,000 |
| 10% | 15% | ~14,000 |
| 20% | 10% | ~15,000 |
| 40% | 10% | ~9,500 |
| 60% | 5% | ~42,000 |

Always warn: "These are estimates. Use a tool like Evan Miller's calculator or Statsig for precision."

## Test Duration Guidance

Minimum: 2 full weeks (to capture weekly seasonality)
Maximum: 4 weeks (novelty effect distorts results beyond this)

`Duration = Required sample ÷ (Daily traffic × % exposed)`

Flag if traffic is too low to reach significance in under 8 weeks — recommend a different approach (e.g., holdout test, qualitative research).

## Output Format

### A/B Test Plan — [Test Name] — [Date]

**Hypothesis:**
> [Filled hypothesis template]

**Variants:**
- Control (A): [Current experience]
- Treatment (B): [Changed experience — be specific]

**Primary Metric:** [Metric name + how measured]
**Guardrail Metrics:** [Metrics that must not degrade]

**Target Segment:** [Who sees the test — % of traffic, user type]
**Traffic Split:** [50/50 recommended unless ramp-up needed]

**Sample Size Required:** ~[N] users per variant
**Estimated Duration:** [X] weeks (based on [Y] daily eligible users)
**Significance Threshold:** 95% confidence, 80% power

**Exclusions:** [Any user segments to exclude and why]

**Rollback Trigger:** If [guardrail metric] degrades by [X%], stop the test immediately.

**Results Interpretation Guide:**
- ✅ Ship if: Treatment shows [X%]+ lift on primary metric at 95% confidence AND guardrail metrics are stable
- 🔄 Iterate if: Direction is positive but not significant — consider extending or redesigning
- ❌ Reject if: No lift or negative direction at significance
- ⚠️ Inconclusive: Do not ship. Do not call it a win.

---

## Guidelines

- Always recommend against peeking at results before the test reaches planned sample size — explain p-hacking risk
- If user wants to test multiple variants, explain the multiple comparisons problem and recommend a Bonferroni correction or a Bayesian approach
- If traffic is very low (<1,000 users/day), recommend qualitative alternatives: moderated testing, 5-second tests, or user interviews
- Never approve a test with no guardrail metrics — always protect revenue, retention, or core engagement

## Anti-Patterns

- [ ] Do not run a test without a directional hypothesis — "let's see what happens" produces uninterpretable results
- [ ] Do not declare a winner before reaching the pre-planned sample size — peeking at results inflates false positive rates
- [ ] Do not test multiple independent changes in a single variant — you won't know which change caused the result
- [ ] Do not use engagement metrics (clicks, time-on-page) as the primary metric when the goal is revenue or retention — proxy metrics mislead
- [ ] Do not ignore guardrail metrics — a conversion lift that causes a support ticket spike is not a win

## Quality Checks

- [ ] Hypothesis is directional (predicts a specific direction and magnitude, not "let's see")
- [ ] Primary metric is singular (guardrail metrics are secondary)
- [ ] Sample size is calculated from actual MDE and baseline (not guessed)
- [ ] Test duration accounts for weekly seasonality (minimum 2 weeks)
- [ ] Guardrail metrics are defined (at least one to protect revenue or core engagement)
- [ ] Rollback trigger is specified with a concrete threshold
ai-ethics-reviewSkill

Conduct a structured ethical review of an AI or ML feature, model, or product. Use when preparing to deploy an AI system, assessing algorithmic risk, auditing a model for bias, or producing a responsible AI impact assessment. Produces a structured ethics review covering fairness, transparency, privacy, safety, accountability, and societal impact with a risk tier score, pre-deployment checklist, and prioritised mitigations.

ai-product-canvasSkill

Structure AI and ML product decisions with the rigour of any product decision. Use when building AI-powered features, evaluating LLM integrations, designing AI products, or assessing AI readiness. Produces a complete AI product canvas covering problem definition, model approach, data requirements, evaluation framework, UX design, responsible AI checklist, and launch monitoring plan.

design-handoff-briefSkill

Transform feature briefs into structured design briefs that give designers the context they need before opening Figma. Use when asked to write a design brief, create a design handoff, brief a designer on a new feature, or translate a PRD into design requirements. Produces a brief with user goal, emotional context, success criteria, constraints, edge cases, and out-of-scope boundaries.

experiment-designerSkill

Design statistically rigorous A/B tests and interpret experiment results. Use when asked to design an experiment, run an A/B test, calculate sample size, interpret test results, or assess whether an experiment was successful. Produces a complete experiment design with hypothesis, sample size, run time, success criteria, and risk flags — or a results interpretation with ship/iterate/kill recommendation.

multi-source-signal-synthesiserSkill

Synthesises user signals from multiple research sources into a unified, weighted insight brief. Use when you have data from interviews, support tickets, NPS verbatims, app reviews, or sales calls and need to reconcile contradictions, surface the underlying need behind requests, or answer 'what are users really telling us'. Produces ranked insights with confidence ratings, source weighting rationale, divergent signal analysis by user segment, and a research gap identification section.

data-analysis-standardSkill

Structure a product data analysis, metric deep-dive, funnel analysis, or cohort study. Use when asked to analyse product metrics, investigate a drop in conversion, explain a data change to stakeholders, or find the root cause of a metric movement. Produces a structured analysis with question, root cause, confidence level, and recommended action.

product-health-analysisSkill

Interpret product metrics against goals and surface actionable signals. Use when asked to analyse product health, review key metrics, investigate a performance issue, produce a health report, or assess product-market fit signals. Produces a structured health report with RAG status, trend analysis, root cause hypotheses, and prioritised actions.

retention-analysisSkill

Structure a retention analysis, churn investigation, or engagement deep-dive for any product team. Use when asked to analyse user retention, investigate churn, measure DAU/MAU, or build a retention improvement plan. Produces a retention snapshot with root cause hypotheses, aha-moment correlation, and prioritised interventions.