ab-test-setup
The ab-test-setup skill provides a structured framework for designing rigorous A/B tests by enforcing mandatory gates at critical stages: hypothesis validation, metrics definition, statistical power calculation, and assumptions review. Use this skill before launching any A/B test to prevent common pitfalls like peeking at results, underpowered studies, or invalid hypotheses that could lead to misleading conclusions.
git clone --depth 1 https://github.com/sickn33/antigravity-awesome-skills /tmp/ab-test-setup && cp -r /tmp/ab-test-setup/plugins/antigravity-awesome-skills-claude/skills/ab-test-setup ~/.claude/skills/ab-test-setupSKILL.md
# A/B Test Setup ## 1️⃣ Purpose & Scope Ensure every A/B test is **valid, rigorous, and safe** before a single line of code is written. - Prevents "peeking" - Enforces statistical power - Blocks invalid hypotheses --- ## 2️⃣ Pre-Requisites You must have: - A clear user problem - Access to an analytics source - Roughly estimated traffic volume ### Hypothesis Quality Checklist A valid hypothesis includes: - Observation or evidence - Single, specific change - Directional expectation - Defined audience - Measurable success criteria --- ### 3️⃣ Hypothesis Lock (Hard Gate) Before designing variants or metrics, you MUST: - Present the **final hypothesis** - Specify: - Target audience - Primary metric - Expected direction of effect - Minimum Detectable Effect (MDE) Ask explicitly: > “Is this the final hypothesis we are committing to for this test?” **Do NOT proceed until confirmed.** --- ### 4️⃣ Assumptions & Validity Check (Mandatory) Explicitly list assumptions about: - Traffic stability - User independence - Metric reliability - Randomization quality - External factors (seasonality, campaigns, releases) If assumptions are weak or violated: - Warn the user - Recommend delaying or redesigning the test --- ### 5️⃣ Test Type Selection Choose the simplest valid test: - **A/B Test** – single change, two variants - **A/B/n Test** – multiple variants, higher traffic required - **Multivariate Test (MVT)** – interaction effects, very high traffic - **Split URL Test** – major structural changes Default to **A/B** unless there is a clear reason otherwise. --- ### 6️⃣ Metrics Definition #### Primary Metric (Mandatory) - Single metric used to evaluate success - Directly tied to the hypothesis - Pre-defined and frozen before launch #### Secondary Metrics - Provide context - Explain _why_ results occurred - Must not override the primary metric #### Guardrail Metrics - Metrics that must not degrade - Used to prevent harmful wins - Trigger test stop if significantly negative --- ### 7️⃣ Sample Size & Duration Define upfront: - Baseline rate - MDE - Significance level (typically 95%) - Statistical power (typically 80%) Estimate: - Required sample size per variant - Expected test duration **Do NOT proceed without a realistic sample size estimate.** --- ### 8️⃣ Execution Readiness Gate (Hard Stop) You may proceed to implementation **only if all are true**: - Hypothesis is locked - Primary metric is frozen - Sample size is calculated - Test duration is defined - Guardrails are set - Tracking is verified If any item is missing, stop and resolve it. --- ## Running the Test ### During the Test **DO:** - Monitor technical health - Document external factors **DO NOT:** - Stop early due to “good-looking” results - Change variants mid-test - Add new traffic sources - Redefine success criteria --- ## Analyzing Results ### Analysis Discipline When interpreting results: - Do NOT generalize beyond the tested population - Do NOT claim causality beyond the tested change - Do NOT override guardrail failures - Separate statistical significance from business judgment ### Interpretation Outcomes | Result | Action | | -------------------- | -------------------------------------- | | Significant positive | Consider rollout | | Significant negative | Reject variant, document learning | | Inconclusive | Consider more traffic or bolder change | | Guardrail failure | Do not ship, even if primary wins | --- ## Documentation & Learning ### Test Record (Mandatory) Document: - Hypothesis - Variants - Metrics - Sample size vs achieved - Results - Decision - Learnings - Follow-up ideas Store records in a shared, searchable location to avoid repeated failures. --- ## Refusal Conditions (Safety) Refuse to proceed if: - Baseline rate is unknown and cannot be estimated - Traffic is insufficient to detect the MDE - Primary metric is undefined - Multiple variables are changed without proper design - Hypothesis cannot be clearly stated Explain why and recommend next steps. --- ## Key Principles (Non-Negotiable) - One hypothesis per test - One primary metric - Commit before launch - No peeking - Learning over winning - Statistical rigor first --- ## Final Reminder A/B testing is not about proving ideas right. It is about **learning the truth with confidence**. If you feel tempted to rush, simplify, or “just try it” — that is the signal to **slow down and re-check the design**. ## When to Use This skill is applicable to execute the workflow or actions described in the overview. ## Limitations - Use this skill only when the task clearly matches the scope described above. - Do not treat the output as a substitute for environment-specific validation, testing, or expert review. - Stop and ask for clarification if required inputs, permissions, safety boundaries, or success criteria are missing.
Arquitecto de Soluciones Principal y Consultor Tecnológico de Andru.ia. Diagnostica y traza la hoja de ruta óptima para proyectos de IA en español.
Security audit, hardening, threat modeling (STRIDE/PASTA), Red/Blue Team, OWASP checks, code review, incident response, and infrastructure security for any project.
Ingeniero de Sistemas de Andru.ia. Diseña, redacta y despliega nuevas habilidades (skills) dentro del repositorio siguiendo el Estándar de Diamante.
Estratega de Inteligencia de Dominio de Andru.ia. Analiza el nicho específico de un proyecto para inyectar conocimientos, regulaciones y estándares únicos del sector. Actívalo tras definir el nicho.
AI-powered presentation generation via the 2slides API — create slides from text, match a reference image style, summarize documents into decks, add AI voice narration, and export pages/audio. Use for any \"make slides\", \"create a deck\", or \"slides from this document\" request.
Expert in building 3D experiences for the web - Three.js, React
Use when a coding task should be driven end-to-end from issue intake through implementation, review, deployment, and acceptance verification with minimal human re-intervention.
You are an accessibility expert specializing in WCAG compliance, inclusive design, and assistive technology compatibility. Conduct audits, identify barriers, and provide remediation guidance.