ab-test-analysis
This Claude Code skill analyzes A/B test results by calculating statistical significance, validating sample sizes, computing confidence intervals, and checking guardrail metrics to determine whether to ship, extend, or stop an experiment. Use it when you need to interpret split test data, verify statistical rigor, assess practical significance beyond p-values, or make data-driven product decisions about rolling out a variant.
git clone --depth 1 https://github.com/phuryn/pm-skills /tmp/ab-test-analysis && cp -r /tmp/ab-test-analysis/pm-data-analytics/skills/ab-test-analysis ~/.claude/skills/ab-test-analysisSKILL.md
## A/B Test Analysis
Evaluate A/B test results with statistical rigor and translate findings into clear product decisions.
### Context
You are analyzing A/B test results for **$ARGUMENTS**.
If the user provides data files (CSV, Excel, or analytics exports), read and analyze them directly. Generate Python scripts for statistical calculations when needed.
### Instructions
1. **Understand the experiment**:
- What was the hypothesis?
- What was changed (the variant)?
- What is the primary metric? Any guardrail metrics?
- How long did the test run?
- What is the traffic split?
2. **Validate the test setup**:
- **Sample size**: Is the sample large enough for the expected effect size?
- Use the formula: n = (Z²α/2 × 2 × p × (1-p)) / MDE²
- Flag if the test is underpowered (<80% power)
- **Duration**: Did the test run for at least 1-2 full business cycles?
- **Randomization**: Any evidence of sample ratio mismatch (SRM)?
- **Novelty/primacy effects**: Was there enough time to wash out initial behavior changes?
3. **Calculate statistical significance**:
- **Conversion rate** for control and variant
- **Relative lift**: (variant - control) / control × 100
- **p-value**: Using a two-tailed z-test or chi-squared test
- **Confidence interval**: 95% CI for the difference
- **Statistical significance**: Is p < 0.05?
- **Practical significance**: Is the lift meaningful for the business?
If the user provides raw data, generate and run a Python script to calculate these.
4. **Check guardrail metrics**:
- Did any guardrail metrics (revenue, engagement, page load time) degrade?
- A winning primary metric with degraded guardrails may not be a true win
5. **Interpret results**:
| Outcome | Recommendation |
|---|---|
| Significant positive lift, no guardrail issues | **Ship it** — roll out to 100% |
| Significant positive lift, guardrail concerns | **Investigate** — understand trade-offs before shipping |
| Not significant, positive trend | **Extend the test** — need more data or larger effect |
| Not significant, flat | **Stop the test** — no meaningful difference detected |
| Significant negative lift | **Don't ship** — revert to control, analyze why |
6. **Provide the analysis summary**:
```
## A/B Test Results: [Test Name]
**Hypothesis**: [What we expected]
**Duration**: [X days] | **Sample**: [N control / M variant]
| Metric | Control | Variant | Lift | p-value | Significant? |
|---|---|---|---|---|---|
| [Primary] | X% | Y% | +Z% | 0.0X | Yes/No |
| [Guardrail] | ... | ... | ... | ... | ... |
**Recommendation**: [Ship / Extend / Stop / Investigate]
**Reasoning**: [Why]
**Next steps**: [What to do]
```
Think step by step. Save as markdown. Generate Python scripts for calculations if raw data is provided.
---
### Further Reading
- [A/B Testing 101 + Examples](https://www.productcompass.pm/p/ab-testing-101-for-pms)
- [Testing Product Ideas: The Ultimate Validation Experiments Library](https://www.productcompass.pm/p/the-ultimate-experiments-library)
- [Are You Tracking the Right Metrics?](https://www.productcompass.pm/p/are-you-tracking-the-right-metrics)The method for finding the gap between what a system is supposed to do and what the code actually does — the class of bug generic scanners miss because they have no model of intent. Defines what counts as documented intent, what counts as implementation evidence, which mismatches matter, and how to avoid hand-wavy findings. Use when auditing AI-built code, reviewing access control against documented permissions, or checking whether a codebase matches its own documentation.
The durable documentation set that makes an AI-built (vibe-coded) app reviewable before shipping. A small core every app needs — architecture, user/permission flows, permissions, variables/secrets, and a test-coverage map — plus conditional docs added only when they apply: emails, scheduled work, SEO, and embedded agents/automation. Defines what each doc must capture and how a reviewer or auditor uses it. Use when documenting a codebase for handoff, mapping user journeys and trust-boundary crossings, planning test coverage, or preparing for a security or performance audit.
Perform cohort analysis on user engagement data — retention curves, feature adoption trends, and segment-level insights. Use when analyzing user retention by cohort, studying feature adoption over time, investigating churn patterns, or identifying engagement trends.
Generate SQL queries from natural language descriptions. Supports BigQuery, PostgreSQL, MySQL, and other dialects. Reads database schemas from uploaded diagrams or documentation. Use when writing SQL, building data reports, exploring databases, or translating business questions into queries.
Brainstorm team-level OKRs aligned with company objectives — qualitative objectives with measurable key results. Use when setting quarterly OKRs, aligning team goals with company strategy, drafting objectives, or learning how to write effective OKRs.
Create a Product Requirements Document using a comprehensive 8-section template covering problem, objectives, segments, value propositions, solution, and release planning. Use when writing a PRD, documenting product requirements, preparing a feature spec, or reviewing an existing PRD.
Generate realistic dummy datasets for testing with customizable columns, constraints, and output formats (CSV, JSON, SQL, Python script). Use when creating test data, building mock datasets, or generating sample data for development and demos.
Create job stories using the 'When [situation], I want to [motivation], so I can [outcome]' format with detailed acceptance criteria. Use when writing job stories, creating JTBD-style backlog items, or expressing user situations and motivations.