data-pipeline-spec
# data-pipeline-spec This Claude Code skill generates comprehensive data pipeline specifications for ETL and ELT processes, covering sources, transformations, destinations, SLAs, error handling, and data quality rules. Use it when designing data pipelines, documenting data ingestion workflows, planning data integrations, or preparing architecture reviews for engineering teams.
git clone --depth 1 https://github.com/mohitagw15856/pm-claude-skills /tmp/data-pipeline-spec && cp -r /tmp/data-pipeline-spec/plugins/pm-data/skills/data-pipeline-spec ~/.claude/skills/data-pipeline-specSKILL.md
# Data Pipeline Spec Skill This skill produces a complete data pipeline specification covering sources, transformations, destinations, scheduling, SLAs, error handling, data quality checks, and monitoring requirements. Output is ready for engineering handoff or architecture review. ## Required Inputs Ask the user for these if not provided: - **Pipeline purpose** — what business question or workflow does this pipeline serve? - **Source systems** — where does data come from? (databases, APIs, files, event streams) - **Destination** — where does data land? (data warehouse, data lake, downstream DB, reporting tool) - **Transformation type** — ETL (transform before loading) or ELT (load raw, transform in warehouse)? - **Frequency / SLA** — how often must data be fresh? (real-time / hourly / daily / weekly) - **Volume estimate** — approximate rows/events per run - **Data quality requirements** — completeness, deduplication, freshness, schema enforcement - **Team or stack** — any specific tools in use? (Airflow, dbt, Fivetran, Spark, Kafka, etc.) ## Output Structure --- # Data Pipeline Spec: [Pipeline Name] **Purpose:** [One sentence — what decision or workflow does this pipeline enable?] **Type:** [ETL / ELT / Streaming / Batch] **Owner:** [Team or individual] **Version:** [1.0] **Date:** [Date] **Status:** [Draft / Under Review / Approved] --- ## 1. Overview [2–3 sentences describing the pipeline end-to-end: what data moves, from where to where, at what cadence, and why.] **Architecture diagram (text):** ``` [Source A] ──┐ [Source B] ──┤──► [Ingestion Layer] ──► [Transform Layer] ──► [Destination] ──► [Consumers] [Source C] ──┘ ``` --- ## 2. Sources | Source | System | Connection type | Data format | Update pattern | Volume | |---|---|---|---|---|---| | [Source 1] | [PostgreSQL / Salesforce / S3 / Kafka] | [JDBC / REST API / SDK / Webhook] | [JSON / CSV / Parquet / CDC] | [Append / Full refresh / Incremental] | [X rows/day] | | [Source 2] | [...] | [...] | [...] | [...] | [...] | **Incremental key (if applicable):** [The column used to identify new or changed records — e.g. `updated_at`, `event_id`] **Authentication:** [API key / OAuth / IAM role / connection string — note where credentials are stored] --- ## 3. Ingestion Layer **Tool:** [Fivetran / Airbyte / Kafka Connect / custom script / dbt source] **Ingestion method:** - [ ] Full extract (full table refresh each run) - [ ] Incremental extract (only new/changed rows since last run) - [ ] CDC (change data capture from database transaction log) - [ ] Event streaming (continuous ingestion from Kafka/Kinesis) **Raw landing zone:** [Where raw data lands before transformation — e.g. `raw.salesforce_opportunities` in Snowflake, S3 bucket `s3://data-raw/crm/`] **Schema handling:** [Strict schema enforcement / Schema evolution allowed / Union schema] --- ## 4. Transformation Logic List each transformation in execution order. For ELT pipelines, this is the dbt model or SQL layer. | Step | Name | Description | Input | Output | Tool | |---|---|---|---|---|---| | 1 | [Deduplicate events] | [Remove duplicate event rows based on event_id] | `raw.events` | `staging.events_deduped` | [dbt / SQL / Spark] | | 2 | [Join user profile] | [Enrich events with user attributes from CRM] | `staging.events_deduped`, `raw.users` | `staging.events_enriched` | [...] | | 3 | [Aggregate to daily] | [Roll up to user×day grain] | `staging.events_enriched` | `mart.user_daily_activity` | [...] | **Business logic rules:** - [e.g. Revenue is recognised on `payment_confirmed_at`, not `payment_initiated_at`] - [e.g. Users in the `internal@company.com` domain are excluded from all metrics] - [e.g. Currency conversion uses the ECB rate from the first business day of each month] **Slowly Changing Dimensions (SCD) — if applicable:** - [e.g. `users.plan_tier` is SCD Type 2 — keep history of plan changes with `valid_from` / `valid_to`] --- ## 5. Destination | Destination | System | Schema / Table | Write mode | Consumers | |---|---|---|---|---| | [Primary] | [Snowflake / BigQuery / Redshift / PostgreSQL] | [`analytics.mart_user_activity`] | [Append / Upsert / Full replace] | [Looker / Metabase / downstream pipeline] | | [Secondary] | [...] | [...] | [...] | [...] | **Partitioning / Clustering:** [e.g. Partitioned by `event_date`, clustered by `user_id` — reduces query cost for time-range scans] **Retention policy:** [e.g. Raw data retained for 90 days; mart tables retained indefinitely] --- ## 6. Scheduling & SLAs | SLA | Target | Breach action | |---|---|---| | **Data freshness** | [Data must be ≤ X hours old by HH:MM UTC] | [Page on-call / alert Slack channel] | | **Pipeline completion** | [Must complete within X minutes of trigger] | [Alert and auto-retry] | | **Availability** | [Pipeline must run successfully X% of days per month] | [Incident review] | **Schedule:** [Cron expression and human description — e.g. `0 6 * * *` — daily at 06:00 UTC] **Trigger type:** - [ ] Time-based (cron) - [ ] Event-based (triggered by upstream pipeline success / file arrival / Kafka lag) - [ ] Manual (ad hoc runs only) **Backfill strategy:** [How to reprocess historical data if the pipeline fails or logic changes — e.g. parameterised date range, full drop-and-reload] --- ## 7. Data Quality Rules | Check | Table | Rule | Failure action | |---|---|---|---| | Completeness | `staging.events` | `event_id IS NOT NULL` — 100% of rows | Block load / Alert | | Uniqueness | `mart.user_daily_activity` | `(user_id, date)` must be unique | Block load | | Freshness | `mart.user_daily_activity` | `max(event_date) >= CURRENT_DATE - 1` | Alert | | Volume | `staging.events` | Row count within ±20% of 7-day average | Alert | | Referential integrity | `staging.events` | All `user_id` values exist in `users` table | Alert | **DQ tool:** [dbt tests / Great Expectations / Monte Carlo / custom SQL assertions] --- ## 8. Error Handling & Recovery **Retry policy:** [e.g. 3 retr
Conduct a structured ethical review of an AI or ML feature, model, or product. Use when preparing to deploy an AI system, assessing algorithmic risk, auditing a model for bias, or producing a responsible AI impact assessment. Produces a structured ethics review covering fairness, transparency, privacy, safety, accountability, and societal impact with a risk tier score, pre-deployment checklist, and prioritised mitigations.
Structure AI and ML product decisions with the rigour of any product decision. Use when building AI-powered features, evaluating LLM integrations, designing AI products, or assessing AI readiness. Produces a complete AI product canvas covering problem definition, model approach, data requirements, evaluation framework, UX design, responsible AI checklist, and launch monitoring plan.
Transform feature briefs into structured design briefs that give designers the context they need before opening Figma. Use when asked to write a design brief, create a design handoff, brief a designer on a new feature, or translate a PRD into design requirements. Produces a brief with user goal, emotional context, success criteria, constraints, edge cases, and out-of-scope boundaries.
Design statistically rigorous A/B tests and interpret experiment results. Use when asked to design an experiment, run an A/B test, calculate sample size, interpret test results, or assess whether an experiment was successful. Produces a complete experiment design with hypothesis, sample size, run time, success criteria, and risk flags — or a results interpretation with ship/iterate/kill recommendation.
Synthesises user signals from multiple research sources into a unified, weighted insight brief. Use when you have data from interviews, support tickets, NPS verbatims, app reviews, or sales calls and need to reconcile contradictions, surface the underlying need behind requests, or answer 'what are users really telling us'. Produces ranked insights with confidence ratings, source weighting rationale, divergent signal analysis by user segment, and a research gap identification section.
Structure a product data analysis, metric deep-dive, funnel analysis, or cohort study. Use when asked to analyse product metrics, investigate a drop in conversion, explain a data change to stakeholders, or find the root cause of a metric movement. Produces a structured analysis with question, root cause, confidence level, and recommended action.
Interpret product metrics against goals and surface actionable signals. Use when asked to analyse product health, review key metrics, investigate a performance issue, produce a health report, or assess product-market fit signals. Produces a structured health report with RAG status, trend analysis, root cause hypotheses, and prioritised actions.
Structure a retention analysis, churn investigation, or engagement deep-dive for any product team. Use when asked to analyse user retention, investigate churn, measure DAU/MAU, or build a retention improvement plan. Produces a retention snapshot with root cause hypotheses, aha-moment correlation, and prioritised interventions.