Skip to main content
ClaudeWave
Skill333 repo starsupdated today

experiment-design

This Claude Code skill provides a structured playbook for designing and running trustworthy product experiments, covering the full lifecycle from hypothesis formation through result interpretation and decision-making. Use it before designing or analyzing any A/B test, multivariate test, or holdout experiment to avoid common pitfalls like vague hypotheses, premature result checking, and shipping decisions based on measurement error rather than genuine product impact.

Install in Claude Code
Copy
git clone --depth 1 https://github.com/rampstackco/claude-skills /tmp/experiment-design && cp -r /tmp/experiment-design/dist/pi/.agents/skills/experiment-design ~/.claude/skills/experiment-design
Then start a new Claude Code session; the skill loads automatically.

SKILL.md

# Experiment Design

A senior product manager's playbook for running experiments that produce trustworthy decisions.

The default state of experimentation in most companies is sloppy. PMs run tests against vague hypotheses, look at results too early, ignore guardrails, stratify into noise, and ship features whose lift is mostly measurement error. The cost is real: ship the wrong thing, kill the right thing, learn the wrong lesson, repeat.

This skill is the discipline that prevents most of those mistakes. It assumes you have a working experimentation platform (Statsig, PostHog, GrowthBook, Optimizely, Amplitude, Eppo, Kameleoon; the platform does not matter for the principles). It assumes you have product-design and engineering pipelines that can deliver real treatment changes. The hard part is the thinking, and that is what is here.

When to use this skill: any time you are about to design or interpret an experiment. Read the relevant section before you start, not after the test is running.

---

## What this skill covers

The skill spans the full experiment lifecycle. Pre-experiment readiness (is this thing even worth testing). Hypothesis design (cause, effect, magnitude, mechanism). Sample size and minimum detectable effect (do you have enough traffic to learn anything). Duration (how long is long enough, when does the cycle bias the result). Running discipline (no peeking, guardrails, sequential testing). Interpretation (the three buckets and the inconclusive case). Decision-making (matching the result to a pre-committed rule).

The skill does not cover feature flag operational mechanics; those live in the `feature-flagging` skill, which handles flag taxonomy, environment management, and stale-flag cleanup as a separate discipline. The skill does not cover statistical analysis depth; for delta methods, variance reduction techniques like CUPED, and Bayesian alternatives, see the `experimentation-analytics` skill. The skill does not cover platform-specific tooling; for MCP commands, auth models, and platform-specific configuration, consult the chosen platform's official documentation. This skill produces the experiment design; the platform implements it.

For the orchestration layer above (which experiments to run, in what order, with what cadence), see the forthcoming `experimentation-platform-orchestrator` skill. That skill schedules; this skill designs.

---

## The framework: 12 considerations for trustworthy experiment results

A defensible experiment design sits at the intersection of twelve considerations. Each is covered in detail in its own section below.

1. **Hypothesis discipline.** Cause, effect, magnitude, and mechanism. The hypothesis names what is being tested, what should move, by how much, and why.
2. **Sample size and minimum detectable effect (MDE).** Whether the test has enough traffic to detect the effect at the chosen power. Refuse to run underpowered tests.
3. **Test duration.** Longer of the sample-size-hit duration and a full weekly cycle. UI/UX changes need at least 14 days regardless.
4. **What NOT to A/B test.** UX bugs, legal-required changes, brand-philosophy questions, decisions already made, designs whose randomization cannot be clean.
5. **Segment analysis.** Pre-registered segments are evidence; post-hoc segments are noise mining. The multiple comparisons problem is real.
6. **Interaction effects.** Concurrent tests on the same surface can interfere. Mutex enforcement or coordination required.
7. **Ratio metrics and variance estimation.** Naive variance estimators on ratios understate uncertainty. Confirm the platform uses a ratio-aware estimator.
8. **Network effects and two-sided markets.** Treatment can leak into control via interference. Cluster randomization, switchback, or geographic isolation when needed.
9. **Sequential testing and the peeking problem.** Daily peeking inflates false positive rates. Use sequential testing methods when available; pre-commit otherwise.
10. **Pre-commitment vs p-hacking.** Write down the primary metric, MDE, duration, segments, and decision rule before launch. Apply mechanically when results come in.
11. **Reading results and making the call.** Three buckets: clear win, clear loss, inconclusive. The inconclusive bucket exists for a reason; resist the pull to ship anyway.
12. **Common failures and fixes.** A short rapid-fire pattern catalog, expanded in [`references/common-failures.md`](references/common-failures.md).

The sections below cover each consideration in turn. Read the relevant section before running the experiment, not after.

---

## Hypothesis discipline

The most important section in the skill. Most experiment failures trace back to a vague hypothesis.

A real hypothesis has four parts: cause, effect, magnitude, mechanism. Cause is the change you are making. Effect is the metric you expect to move. Magnitude is how much you expect it to move and from what baseline. Mechanism is why you expect this change to produce this effect.

Bad hypothesis, common shape: "We think the new pricing page will increase conversions." What is wrong with it: no magnitude (how much), no mechanism (why), and the metric is "conversions" rather than a specific event with a clear definition. The team will run this test, look at the result, and argue about what counts as a win. Pre-commitment is impossible because nothing was committed.

Good hypothesis, same domain: "Replacing the three-tier pricing comparison with a single recommended tier will increase signup-to-paid conversion by 8 percent (currently 12 percent, target 13 percent) by reducing decision friction for users who already know they want to subscribe." Cause is the tier replacement. Effect is signup-to-paid conversion, defined as the user reaches the paywall and completes payment within seven days. Magnitude is 8 percent relative lift, taking the rate from 12 to 13 percent absolute. Mechanism is decision friction reduction. Now the team has something to test, a
accessibility-auditSkill

Run a comprehensive WCAG accessibility audit covering perceivable, operable, understandable, and robust principles. Use this skill whenever the user wants to audit accessibility, review WCAG compliance, fix accessibility issues, prepare for accessibility certification, address an accessibility lawsuit risk, or systematically improve a site's accessibility. Triggers on accessibility audit, WCAG audit, a11y audit, accessibility compliance, ADA compliance, screen reader test, keyboard navigation, accessibility report, fix accessibility, axe scan. Also triggers when accessibility issues have been reported and need systematic remediation.

ads-creative-developmentSkill

How to produce ad creative that converts at performance scale. Hook patterns, format selection, video pacing, variation systems, sequential testing methodology, fatigue detection, brand-voice alignment without conversion dilution, and platform-specific creative norms. Triggers on ad creative, ad design, hook patterns, ad video pacing, creative testing, ad variations, creative refresh, creative fatigue, refresh ad creative, video ads for Meta, TikTok creative, LinkedIn ad creative, ad asset library. Also triggers when a team is producing creative at scale, planning a creative test cycle, or auditing why creative is not converting.

ads-performance-analyticsSkill

How to read paid media dashboards without fooling yourself. Attribution models, platform reporting quirks, multi-platform reconciliation, ROAS vs LTV horizon traps, statistical noise in performance metrics, incrementality testing, and the failure modes that produce expensive lessons. Triggers on read paid media dashboard, attribution analysis, ROAS vs LTV, multi-platform reconciliation, ad incrementality, geo holdout, conversion lift study, ghost bidding, paid media reporting, board-deck paid media metrics, blended CAC, MMM, MTA, last-click attribution. Also triggers when a marketer is about to scale, kill, or rebudget a campaign based on platform metrics, or when reconciling platform reports against warehouse revenue.

after-action-reportSkill

Run a structured after-action review (postmortem, retrospective) on a launch, incident, or completed project to capture timeline, root cause analysis, contributing factors, and actionable lessons. Use this skill whenever the user wants to run a postmortem, retrospective, AAR, or after-action review on any past event. Triggers on after-action report, AAR, postmortem, retrospective, retro, post-incident review, what went well what didn't, lessons learned, blameless postmortem, root cause analysis, RCA, five whys. Also triggers when the user has just shipped something or just resolved an incident and wants to capture learnings.

ai-content-collaborationSkill

How humans and AI compose in content workflows. Where AI legitimately participates, where humans must own, hybrid workflow patterns, voice ownership preservation, the AI slop problem, disclosure and transparency, team calibration, and the ethics of intellectually honest AI-assisted content production. Triggers on AI content workflow, AI-assisted writing, hybrid content production, AI in editorial, AI slop, AI disclosure, AI usage policy, AI content ethics, voice preservation with AI, team AI calibration. Also triggers when content feels generic despite quality tools, when team AI usage has drifted into inconsistency, or when a regulated or trust-sensitive context requires explicit AI policy.

analytics-strategySkill

Design measurement frameworks including event taxonomy, KPI hierarchy, dashboard architecture, attribution models, and analytics implementation strategy. Use this skill whenever the user wants to plan analytics, design dashboards, build event taxonomies, define KPIs, set up tracking, or audit existing measurement. Triggers on analytics strategy, measurement plan, event taxonomy, tracking plan, KPI framework, dashboard design, north star metric, attribution model, conversion tracking, GA4 setup, Mixpanel setup, analytics audit. Also triggers when the user has data but no clear way to use it, or wants to make decisions but doesn't know what to track.

art-directionSkill

Direct visual and creative work for campaigns, photography, illustration, video, and branded experiences. Use this skill whenever the user wants to brief a photographer, direct illustrators, plan a creative campaign, develop visual concepts, write a creative direction document, or evaluate creative work for fit. Triggers on art direction, photo brief, photography brief, illustration brief, campaign concept, creative concept, visual direction, mood board, look and feel, visual treatment, video direction. Also triggers when the user has approved brand identity but needs to extend it into specific creative deliverables.

backup-and-disaster-recoverySkill

Plan and run backups, set recovery objectives, and run disaster recovery drills. Use this skill when defining RPO/RTO targets, designing backup architecture, deciding what to back up and how often, planning for full-region or platform outages, or running a restoration drill. Triggers on backup, restore, RPO, RTO, disaster recovery, DR, business continuity, what if the database is gone, what if our hosting goes down, recovery drill, ransomware planning. Also triggers when an incident reveals a gap in restoration capability.