experimentation-analytics
The experimentation-analytics skill teaches how to interpret experiment result panels correctly to avoid shipping wrong decisions. It covers confidence intervals, p-values, multiple testing corrections, sequential testing, CUPED variance reduction, heterogeneous treatment effects, ratio metrics, network effects, and dashboard reconciliation. Use this when reading experiment results before making ship, kill, or iterate decisions.
git clone --depth 1 https://github.com/rampstackco/claude-skills /tmp/experimentation-analytics && cp -r /tmp/experimentation-analytics/dist/pi/.agents/skills/experimentation-analytics ~/.claude/skills/experimentation-analyticsSKILL.md
# Experimentation Analytics A data-team-mentor's playbook for interpreting experiment results without fooling yourself. The result panel is the moment-of-truth for an experiment. The numbers on it determine whether you ship, kill, or iterate. They also expose every shortcut taken in the design phase: an underpowered test produces wide confidence intervals; a peeked test produces a too-narrow p-value; a ratio metric without delta-method correction produces overconfident lift estimates. Most ship-the-wrong-thing decisions trace back to misreading the result panel. This skill is the discipline that prevents misreading. It assumes the experiment was designed well (see the `experiment-design` skill). It assumes the platform's results panel is technically correct (most modern platforms are; some older ones are not). It assumes you can read a number off a screen. The hard part is knowing what each number actually means and what it does not, and that is what is here. When to use this skill: any time you are reading an experiment result panel and about to make a ship, kill, or iterate decision. --- ## What this skill is for This skill covers result interpretation, the statistical concepts that make the numbers trustworthy, and the dashboard reconciliation work that prevents executive-level confusion when the experiment number does not match the BI number. The audience is product managers and data analysts who read experiment results together and need a shared vocabulary that does not paper over the dangerous parts of statistics. Companion skills cover the adjacent territory. The `experiment-design` skill covers pre-experiment thinking: hypothesis, sample size, MDE, segments, what NOT to test. Read it before designing the test; read this skill when reading the result. The `feature-flagging` skill covers the operational mechanics of flag management, environment promotion, and stale-flag cleanup. Together the three skills span the experimentation lifecycle from intent through interpretation. For platform-specific MCP commands, consult the chosen platform's docs; Statsig, PostHog, Optimizely, GrowthBook, Eppo, Amplitude, and Kameleoon all expose rich analytics surfaces that this skill informs how to read. --- ## The result panel: what every modern platform should expose A result panel that omits any of the following is a black box. Treat results from black-box platforms with extra skepticism, and consider exporting raw assignment and event data into a notebook where you can compute the missing pieces yourself. What a competent platform exposes: - Variants and traffic allocation (e.g., 50/50, 33/33/33). Allocation drift across the test window indicates assignment bugs. - Per-variant primary metric: point estimate, confidence interval (or credible interval for Bayesian), sample size at the variant level. - Lift: variant minus control, expressed as both absolute change and relative percent. Both numbers matter; relative is intuitive, absolute is what shows up in revenue calculations. - Statistical significance: p-value (frequentist) or probability of being best (Bayesian). The methodology should be labeled clearly so you know which interpretation rules apply. - Variance reduction technique applied: CUPED, post-stratification, regression adjustment. If the platform applies these silently, ask which. - Guardrail metric statuses: each guardrail labeled green, amber, or red against its tolerance. The tolerance was set at design time; the panel just enforces it. - Per-segment results for pre-registered segments only. Post-hoc segment slicers are tempting and dangerous. - Test status: running, ended, decision filed. - A time series of the lift across the test window. This is where novelty effects, primacy effects, and assignment bugs become visible. If you are looking at a result panel that hides any of these, the first move is to surface them, not to ship. --- ## Confidence intervals: the most important number The single most important number on the result panel is the confidence interval (CI) on the lift. More important than the point estimate. More important than the p-value. The CI tells you what you actually know. What a 95% CI of [+2%, +6%] means: under repeated sampling, the true effect would fall in this range 95% of the time. The true effect is most likely somewhere near the middle, but the extremes are entirely consistent with the data. What it does not mean: it does not literally mean "there is a 95% chance the true effect is between +2% and +6%." That is the Bayesian credible interval, which often gives similar numerical answers but is conceptually different. PMs can usually live with the loose intuition; analysts should know the precise version when defending a number to a skeptic. The width of the CI matters more than the center for most ship decisions. A wide CI means you do not know much yet. A narrow CI means you know with precision. The point estimate is your best guess; the width is your humility. Practical decision rules, in order of importance: 1. If the CI includes zero AND a meaningful positive number (say [-1%, +5%]), you do not have enough data to ship. Period. The point estimate may look favorable, but the data is consistent with no effect and consistent with a meaningful win. You cannot tell which. 2. If the CI is all-positive (lower bound greater than zero, e.g., [+1%, +4%]), there is a real effect. Now evaluate magnitude: is the lower bound large enough to be worth the implementation cost? 3. If the CI is all-negative (upper bound less than zero, e.g., [-5%, -1%]), there is real harm. Kill the test. 4. If the CI straddles zero but is narrow (e.g., [-0.5%, +0.5%]), this is a real null result. The effect is small enough to call essentially zero. Useful information; do not ship the change for "lift" reasons (you found none) but do not panic about harm either. 5. If the CI straddles zero and is wide (e.g., [-5%, +8%]), the test is inconclusive. The data is consi
Run a comprehensive WCAG accessibility audit covering perceivable, operable, understandable, and robust principles. Use this skill whenever the user wants to audit accessibility, review WCAG compliance, fix accessibility issues, prepare for accessibility certification, address an accessibility lawsuit risk, or systematically improve a site's accessibility. Triggers on accessibility audit, WCAG audit, a11y audit, accessibility compliance, ADA compliance, screen reader test, keyboard navigation, accessibility report, fix accessibility, axe scan. Also triggers when accessibility issues have been reported and need systematic remediation.
How to produce ad creative that converts at performance scale. Hook patterns, format selection, video pacing, variation systems, sequential testing methodology, fatigue detection, brand-voice alignment without conversion dilution, and platform-specific creative norms. Triggers on ad creative, ad design, hook patterns, ad video pacing, creative testing, ad variations, creative refresh, creative fatigue, refresh ad creative, video ads for Meta, TikTok creative, LinkedIn ad creative, ad asset library. Also triggers when a team is producing creative at scale, planning a creative test cycle, or auditing why creative is not converting.
How to read paid media dashboards without fooling yourself. Attribution models, platform reporting quirks, multi-platform reconciliation, ROAS vs LTV horizon traps, statistical noise in performance metrics, incrementality testing, and the failure modes that produce expensive lessons. Triggers on read paid media dashboard, attribution analysis, ROAS vs LTV, multi-platform reconciliation, ad incrementality, geo holdout, conversion lift study, ghost bidding, paid media reporting, board-deck paid media metrics, blended CAC, MMM, MTA, last-click attribution. Also triggers when a marketer is about to scale, kill, or rebudget a campaign based on platform metrics, or when reconciling platform reports against warehouse revenue.
Run a structured after-action review (postmortem, retrospective) on a launch, incident, or completed project to capture timeline, root cause analysis, contributing factors, and actionable lessons. Use this skill whenever the user wants to run a postmortem, retrospective, AAR, or after-action review on any past event. Triggers on after-action report, AAR, postmortem, retrospective, retro, post-incident review, what went well what didn't, lessons learned, blameless postmortem, root cause analysis, RCA, five whys. Also triggers when the user has just shipped something or just resolved an incident and wants to capture learnings.
How humans and AI compose in content workflows. Where AI legitimately participates, where humans must own, hybrid workflow patterns, voice ownership preservation, the AI slop problem, disclosure and transparency, team calibration, and the ethics of intellectually honest AI-assisted content production. Triggers on AI content workflow, AI-assisted writing, hybrid content production, AI in editorial, AI slop, AI disclosure, AI usage policy, AI content ethics, voice preservation with AI, team AI calibration. Also triggers when content feels generic despite quality tools, when team AI usage has drifted into inconsistency, or when a regulated or trust-sensitive context requires explicit AI policy.
Design measurement frameworks including event taxonomy, KPI hierarchy, dashboard architecture, attribution models, and analytics implementation strategy. Use this skill whenever the user wants to plan analytics, design dashboards, build event taxonomies, define KPIs, set up tracking, or audit existing measurement. Triggers on analytics strategy, measurement plan, event taxonomy, tracking plan, KPI framework, dashboard design, north star metric, attribution model, conversion tracking, GA4 setup, Mixpanel setup, analytics audit. Also triggers when the user has data but no clear way to use it, or wants to make decisions but doesn't know what to track.
Direct visual and creative work for campaigns, photography, illustration, video, and branded experiences. Use this skill whenever the user wants to brief a photographer, direct illustrators, plan a creative campaign, develop visual concepts, write a creative direction document, or evaluate creative work for fit. Triggers on art direction, photo brief, photography brief, illustration brief, campaign concept, creative concept, visual direction, mood board, look and feel, visual treatment, video direction. Also triggers when the user has approved brand identity but needs to extend it into specific creative deliverables.
Plan and run backups, set recovery objectives, and run disaster recovery drills. Use this skill when defining RPO/RTO targets, designing backup architecture, deciding what to back up and how often, planning for full-region or platform outages, or running a restoration drill. Triggers on backup, restore, RPO, RTO, disaster recovery, DR, business continuity, what if the database is gone, what if our hosting goes down, recovery drill, ransomware planning. Also triggers when an incident reveals a gap in restoration capability.