Skip to main content
ClaudeWave
Skill963 repo starsupdated 3d ago

experiment-designer

# Experiment Designer The experiment-designer skill produces statistically rigorous A/B test designs and interprets results to guide ship/iterate/kill decisions. Use it when designing experiments from product hypotheses, calculating required sample sizes, setting success criteria before testing, or analyzing completed test results for both statistical and practical significance. It enforces pre-registration discipline, flags design risks like novelty effects and multiple testing problems, and validates that tests ran for their full planned duration to avoid early-stopping bias.

Install in Claude Code
Copy
git clone --depth 1 https://github.com/mohitagw15856/pm-claude-skills /tmp/experiment-designer && cp -r /tmp/experiment-designer/plugins/pm-advanced/skills/experiment-designer ~/.claude/skills/experiment-designer
Then start a new Claude Code session; the skill loads automatically.

SKILL.md

# Experiment Designer Skill

Produce rigorous experiment designs from product hypotheses, and interpret results with statistical and practical significance — so you can defend every decision to a sceptical engineering lead or data scientist.

## Required Inputs

Ask the user for these if not provided:
**For experiment design:**
- Hypothesis (what change, what metric, what expected movement)
- Current baseline metric value
- Minimum detectable effect (MDE) — the smallest lift worth caring about
- Available daily sample size

**For results interpretation:**
- Control and variant results (raw numbers or percentages)
- P-value or confidence interval
- Run duration (days)
- Any anomalies observed during the test

## Two-Phase Process

### Phase 1: Experiment Design
1. Restate hypothesis as: "If we [change], we expect [metric] to [move by X%] because [reason]"
2. Define control and variant clearly
3. Select primary metric (one only) and secondary guardrail metrics (2-3 max)
4. Calculate required sample size from MDE and baseline
5. Estimate run time in days
6. Set pre-defined success criteria before the test runs — no moving goalposts
7. Flag design risks: novelty effects, seasonal confounds, multiple testing issues, network effects, sample ratio mismatch

### Phase 2: Results Interpretation
1. Assess statistical significance (p < 0.05 threshold)
2. Assess practical significance: was the lift meaningful for the business, not just real?
3. Interpret confidence intervals
4. Investigate confounding factors
5. Recommend: Ship / Iterate / Kill / Run follow-up test
6. **Validate** — Confirm the test ran for the full planned duration. Flag if it was stopped early (peeking problem). Confirm sample ratio mismatch did not occur.

## Output Structure

**[Design or Results header based on phase]**

*Hypothesis:* "If we [change], we expect [metric] to [move by X%] because [reason]"

*Primary metric:* [One metric only]
*Guardrail metrics:* [2-3 max]
*Required sample size:* [n per variant]
*Estimated run time:* [days]
*Pre-defined success threshold:* [specific number]
*Design risk flags:* [any concerns]

**Results (Phase 2 only):**
*Statistical significance:* [p-value and conclusion]
*Practical significance:* [lift size vs. business threshold]
*Recommendation:* Ship / Iterate / Kill / Follow-up — [rationale]

## Quality Checks

- [ ] Hypothesis specifies the change, the metric, the direction, and the reason
- [ ] Primary metric is singular — guardrail metrics are secondary
- [ ] Success criteria are defined before the test launches (not after seeing results)
- [ ] Test was not stopped early (or flagged clearly if it was)
- [ ] Practical significance assessed separately from statistical significance
- [ ] Sample ratio mismatch is checked in results interpretation

## Anti-Patterns

- [ ] Do not define success criteria after seeing preliminary results — post-hoc success definitions are HARKing (Hypothesising After Results are Known) and invalidate the experiment
- [ ] Do not stop a test early because the result looks significant — early stopping dramatically inflates false positive rates; the test must run to the planned sample size
- [ ] Do not treat statistical significance as the same as practical significance — a p < 0.05 result with a 0.1% lift is real but may not be worth shipping
- [ ] Do not run the same experiment on the same population multiple times without correction — multiple testing inflates the chance of a false positive proportionally
- [ ] Do not use more than one primary metric — multiple primary metrics require multiple hypothesis corrections and make the ship/kill decision ambiguous
ai-ethics-reviewSkill

Conduct a structured ethical review of an AI or ML feature, model, or product. Use when preparing to deploy an AI system, assessing algorithmic risk, auditing a model for bias, or producing a responsible AI impact assessment. Produces a structured ethics review covering fairness, transparency, privacy, safety, accountability, and societal impact with a risk tier score, pre-deployment checklist, and prioritised mitigations.

ai-product-canvasSkill

Structure AI and ML product decisions with the rigour of any product decision. Use when building AI-powered features, evaluating LLM integrations, designing AI products, or assessing AI readiness. Produces a complete AI product canvas covering problem definition, model approach, data requirements, evaluation framework, UX design, responsible AI checklist, and launch monitoring plan.

design-handoff-briefSkill

Transform feature briefs into structured design briefs that give designers the context they need before opening Figma. Use when asked to write a design brief, create a design handoff, brief a designer on a new feature, or translate a PRD into design requirements. Produces a brief with user goal, emotional context, success criteria, constraints, edge cases, and out-of-scope boundaries.

multi-source-signal-synthesiserSkill

Synthesises user signals from multiple research sources into a unified, weighted insight brief. Use when you have data from interviews, support tickets, NPS verbatims, app reviews, or sales calls and need to reconcile contradictions, surface the underlying need behind requests, or answer 'what are users really telling us'. Produces ranked insights with confidence ratings, source weighting rationale, divergent signal analysis by user segment, and a research gap identification section.

data-analysis-standardSkill

Structure a product data analysis, metric deep-dive, funnel analysis, or cohort study. Use when asked to analyse product metrics, investigate a drop in conversion, explain a data change to stakeholders, or find the root cause of a metric movement. Produces a structured analysis with question, root cause, confidence level, and recommended action.

product-health-analysisSkill

Interpret product metrics against goals and surface actionable signals. Use when asked to analyse product health, review key metrics, investigate a performance issue, produce a health report, or assess product-market fit signals. Produces a structured health report with RAG status, trend analysis, root cause hypotheses, and prioritised actions.

retention-analysisSkill

Structure a retention analysis, churn investigation, or engagement deep-dive for any product team. Use when asked to analyse user retention, investigate churn, measure DAU/MAU, or build a retention improvement plan. Produces a retention snapshot with root cause hypotheses, aha-moment correlation, and prioritised interventions.

board-deck-narrativeSkill

Build the storyline and slide structure for a board presentation. Use when asked to create a board deck, board presentation narrative, board meeting slides, or quarterly board update. Produces a complete slide-by-slide structure with narrative beats, talking points, and slide content guidance.