Skill503 estrellas del repoactualizado 7d ago

data-warehouse-experimentation

This skill provides a framework for running A/B experiments directly within data warehouses like BigQuery or Snowflake, using dbt for metric definitions and SQL or Python for statistical analysis instead of relying on dedicated experimentation platforms. Use it when deciding between warehouse-native and platform-based experimentation, building internal experiment infrastructure, or handling custom metrics that existing platforms cannot support.

Ver fuente Repositorio: claude-skills

Instalar en Claude Code

Copiar

git clone --depth 1 https://github.com/rampstackco/claude-skills /tmp/data-warehouse-experimentation && cp -r /tmp/data-warehouse-experimentation/dist/pi/.agents/skills/data-warehouse-experimentation ~/.claude/skills/data-warehouse-experimentation

Después abre una sesión nueva de Claude Code; el skill carga automáticamente.

Definición

SKILL.md

# Data Warehouse Experimentation

A senior data scientist's playbook for running experiments natively out of BigQuery, Snowflake, or any modern data warehouse, with metric definitions in dbt and statistical analysis in SQL or Python.

Most companies that run experiments at scale use a dedicated platform. Statsig, Optimizely, LaunchDarkly with experimentation, PostHog, Amplitude Experiment. The platforms are good. They handle assignment, instrumentation, and analysis in one product, and the SQL-savvy data team does not have to reinvent the variance reduction wheel.

There is a different operational model that mature data teams increasingly choose: warehouse-native experimentation. Assignment happens in code or via feature flags. Exposure events fire to the warehouse like any other event. Metrics are defined as dbt models. Statistical analysis runs as SQL or in a Python notebook against warehouse data. The "experiment platform" is just your existing data stack.

This skill covers when warehouse-native is the right call, the architecture, and the specific techniques that make it work: assignment patterns, exposure logging discipline, metric definitions in dbt, t-tests and CUPED in SQL, sequential testing, and the pitfalls that take down homegrown setups.

When to use this skill: deciding between platform vs warehouse-native, building a warehouse-native experiment infrastructure, auditing an existing one, or running a specific experiment when the platform of record cannot handle a custom metric or segmentation.

---

## What this skill is for

This skill spans the operational execution model for warehouse-native experimentation. It does not replace the methodology and interpretation skills; it composes with them.

- `experiment-design` covers methodology: hypotheses, sample size, randomization unit, primary metric. Tool-agnostic. Read it first to design the experiment correctly regardless of where it runs.
- `experimentation-analytics` covers interpretation: confidence intervals, p-values, effect size, decision frameworks. Tool-agnostic. Read it when results land.
- `experimentation-platform-orchestrator` covers the platform-vs-warehouse decision in detail. Read it to decide whether to use a platform or this skill.
- `feature-flagging` covers assignment infrastructure when not running through a platform. Read it for the flag-management discipline that this skill assumes.
- This skill (`data-warehouse-experimentation`) covers the operational execution: SQL-based assignment, exposure logging, metric definitions in dbt, statistical analysis in SQL or Python, variance reduction, sequential testing.

The distinction is between "what to do" (the methodology and interpretation skills) and "how to do it without a vendor platform" (this skill). Read this skill after you have decided warehouse-native is the right call. If you are still deciding, start with `experimentation-platform-orchestrator`.

---

## When warehouse-native is the right call

Six factors push the decision toward warehouse-native.

1. **Cost at volume.** Platforms charge per MAU or per event. At 10K MAU the platform is cheap; at 1M MAU the bill becomes a real budget item. Warehouse-native runs on infrastructure you already pay for.
2. **Custom metrics.** If your primary metric is a complex business metric (revenue with refund-aware logic, cohort LTV, retention bracket, multi-event composites), platforms can struggle. Warehouse-native expresses any metric you can write in SQL.
3. **Custom segmentation.** Enterprise customers, account-tier crosscuts, complex behavioral segments. Platforms have segmentation features; the depth varies. dbt models compose without limit.
4. **Trust requirements.** Regulated industries (healthcare, finance, government) need full transparency into the math. Warehouse-native gives you every step of the calculation auditable in SQL.
5. **Existing data team strength.** If you have data engineers and data scientists, you have most of the infrastructure. Adding experimentation discipline on top costs less than adopting a new platform.
6. **Iteration on metric definitions.** Platforms ship metric updates on their own cadence. Warehouse-native iterates as fast as your dbt deployments.

Five factors push toward platform.

1. **Frontend visual experiments.** Optimizely's bread and butter. Variant code injected via a script tag, with WYSIWYG editing.
2. **Sub-week iteration speed.** Some platforms set up an experiment in 30 minutes; warehouse-native often takes a day or more for the first run of a new metric pattern.
3. **Teams without strong data infrastructure.** If you do not have a warehouse, dbt, and analysts, do not start with warehouse-native. The platform is the right call.
4. **Mobile experimentation.** SDK-based assignment with offline support is the platform's job, not the warehouse's.
5. **Out-of-the-box sequential testing with strict guarantees.** Statsig and Eppo ship mSPRT with calibrated alpha-spending. Building this in-house is real work.

Detail and a decision tree in [`references/warehouse-vs-platform-decision.md`](references/warehouse-vs-platform-decision.md). Many mature teams use both; warehouse-native for the hard cases, platform for fast iteration on standard experiments.

---

## The architecture

Four components, in order of data flow.

1. **Assignment.** How users get bucketed into variants. Hash function, feature flag, or randomized assignment table.
2. **Exposure logging.** A discrete event fired the first time a user is exposed to the experiment, written to the warehouse like any other event.
3. **Metric definitions.** SQL queries (or dbt models) that compute the primary and secondary metrics from warehouse events.
4. **Analysis.** Statistical computation in SQL or Python that joins exposure to metrics and produces effect estimates with confidence intervals.

The flow. User visits the product. Assignment determines the bucket (control or treatment). If the user is exposed to the variant

Del mismo repositorio

accessibility-auditSkill

Run a comprehensive WCAG accessibility audit covering perceivable, operable, understandable, and robust principles. Use this skill whenever the user wants to audit accessibility, review WCAG compliance, fix accessibility issues, prepare for accessibility certification, address an accessibility lawsuit risk, or systematically improve a site's accessibility. Triggers on accessibility audit, WCAG audit, a11y audit, accessibility compliance, ADA compliance, screen reader test, keyboard navigation, accessibility report, fix accessibility, axe scan. Also triggers when accessibility issues have been reported and need systematic remediation.

ads-creative-developmentSkill

How to produce ad creative that converts at performance scale. Hook patterns, format selection, video pacing, variation systems, sequential testing methodology, fatigue detection, brand-voice alignment without conversion dilution, and platform-specific creative norms. Triggers on ad creative, ad design, hook patterns, ad video pacing, creative testing, ad variations, creative refresh, creative fatigue, refresh ad creative, video ads for Meta, TikTok creative, LinkedIn ad creative, ad asset library. Also triggers when a team is producing creative at scale, planning a creative test cycle, or auditing why creative is not converting.

ads-performance-analyticsSkill

How to read paid media dashboards without fooling yourself. Attribution models, platform reporting quirks, multi-platform reconciliation, ROAS vs LTV horizon traps, statistical noise in performance metrics, incrementality testing, and the failure modes that produce expensive lessons. Triggers on read paid media dashboard, attribution analysis, ROAS vs LTV, multi-platform reconciliation, ad incrementality, geo holdout, conversion lift study, ghost bidding, paid media reporting, board-deck paid media metrics, blended CAC, MMM, MTA, last-click attribution. Also triggers when a marketer is about to scale, kill, or rebudget a campaign based on platform metrics, or when reconciling platform reports against warehouse revenue.

after-action-reportSkill

Run a structured after-action review (postmortem, retrospective) on a launch, incident, or completed project to capture timeline, root cause analysis, contributing factors, and actionable lessons. Use this skill whenever the user wants to run a postmortem, retrospective, AAR, or after-action review on any past event. Triggers on after-action report, AAR, postmortem, retrospective, retro, post-incident review, what went well what didn't, lessons learned, blameless postmortem, root cause analysis, RCA, five whys. Also triggers when the user has just shipped something or just resolved an incident and wants to capture learnings.

ai-content-collaborationSkill

How humans and AI compose in content workflows. Where AI legitimately participates, where humans must own, hybrid workflow patterns, voice ownership preservation, the AI slop problem, disclosure and transparency, team calibration, and the ethics of intellectually honest AI-assisted content production. Triggers on AI content workflow, AI-assisted writing, hybrid content production, AI in editorial, AI slop, AI disclosure, AI usage policy, AI content ethics, voice preservation with AI, team AI calibration. Also triggers when content feels generic despite quality tools, when team AI usage has drifted into inconsistency, or when a regulated or trust-sensitive context requires explicit AI policy.

analytics-strategySkill

Design measurement frameworks including event taxonomy, KPI hierarchy, dashboard architecture, attribution models, and analytics implementation strategy. Use this skill whenever the user wants to plan analytics, design dashboards, build event taxonomies, define KPIs, set up tracking, or audit existing measurement. Triggers on analytics strategy, measurement plan, event taxonomy, tracking plan, KPI framework, dashboard design, north star metric, attribution model, conversion tracking, GA4 setup, Mixpanel setup, analytics audit. Also triggers when the user has data but no clear way to use it, or wants to make decisions but doesn't know what to track.

art-directionSkill

Direct visual and creative work for campaigns, photography, illustration, video, and branded experiences. Use this skill whenever the user wants to brief a photographer, direct illustrators, plan a creative campaign, develop visual concepts, write a creative direction document, or evaluate creative work for fit. Triggers on art direction, photo brief, photography brief, illustration brief, campaign concept, creative concept, visual direction, mood board, look and feel, visual treatment, video direction. Also triggers when the user has approved brand identity but needs to extend it into specific creative deliverables.

backup-and-disaster-recoverySkill

Plan and run backups, set recovery objectives, and run disaster recovery drills. Use this skill when defining RPO/RTO targets, designing backup architecture, deciding what to back up and how often, planning for full-region or platform outages, or running a restoration drill. Triggers on backup, restore, RPO, RTO, disaster recovery, DR, business continuity, what if the database is gone, what if our hosting goes down, recovery drill, ransomware planning. Also triggers when an incident reveals a gap in restoration capability.