Skill229 repo starsupdated 12d ago

data-pipeline-builder

The data-pipeline-builder Claude Code skill designs and implements production-ready ETL/ELT data pipelines by accepting data sources, transformation requirements, and destination specifications, then generating a complete pipeline specification document alongside Python/SQL implementation files, scheduling configurations, error handling logic, data quality checks, and monitoring setup. Use this skill when building enterprise data workflows that require comprehensive architecture planning, multi-stage transformations, and operational infrastructure.

View source Repository: claude-skills

Install in Claude Code

Copy

git clone --depth 1 https://github.com/OneWave-AI/claude-skills /tmp/data-pipeline-builder && cp -r /tmp/data-pipeline-builder/data-pipeline-builder ~/.claude/skills/data-pipeline-builder

Then start a new Claude Code session; the skill loads automatically.

Definition

SKILL.md

# Data Pipeline Builder

Design and implement production-grade ETL/ELT data pipelines: take data sources, a destination, and transformation requirements, then produce a complete pipeline specification plus all implementation files needed to run it.

## Contents

- `references/project-structure.md` -- output file layout, architecture pattern selection, component selection.
- `references/python-patterns.md` -- Python code standards and base extractor/transformer/loader/retry patterns.
- `references/quality-checks.md` -- composable data quality check framework and built-in checks.
- `references/orchestration-config.md` -- Airflow DAG, pipeline config YAML, and monitoring/alerting patterns.
- `references/spec-template.md` -- the `data-pipeline-spec.md` output template.

## Workflow

1. Gather requirements. If the user gave clear requirements, proceed to design. Otherwise ask targeted questions: data sources (databases, APIs, files, streams); destination (warehouse, lake, database); transformations (joins, aggregations, filters, business rules); freshness requirement (real-time, hourly, daily); technology preferences (Airflow, dbt, Spark, cloud provider); data quality and compliance requirements.

2. Analyze and design. Catalog each source (connection type, auth, schema, volume, CDC availability, rate limits). Define the destination (platform, schema design, partitioning, clustering, access patterns). Map transformations (field mappings, business logic, type conversions, joins, aggregations, deduplication, SCD handling, derived fields). Establish non-functional requirements (freshness SLA, processing window, failure tolerance, retention, compliance). Select an architecture pattern and components per `references/project-structure.md`.

3. Present the design before generating code. Confirm architecture, sources, destination, schedule, key transformations, and quality gates with the user, then proceed on approval.

4. Generate implementation. Produce all files following the layout in `references/project-structure.md`, customized to the specific pipeline with no placeholder code requiring manual editing:
- For each source, generate a concrete extractor inheriting from `BaseExtractor` (see `references/python-patterns.md`).
- For each transformation, generate a concrete transformer class or SQL file.
- For each destination, generate a concrete loader inheriting from `BaseLoader`.
- Generate the Airflow DAG with all task dependencies wired up and the pipeline config YAML (see `references/orchestration-config.md`).
- Generate quality checks tailored to the data and monitoring config with appropriate alert thresholds (see `references/quality-checks.md` and `references/orchestration-config.md`).
- Generate tests for all custom business logic.

5. Generate the specification last. Produce `data-pipeline-spec.md` using `references/spec-template.md`, referencing all implementation files and incorporating design decisions made during the process.

## Operating Rules

- Design for idempotency -- make every step safely re-runnable.
- Include watermark/checkpoint tracking for incremental pipelines.
- Include dead letter handling for records that fail processing.
- Include schema evolution handling -- sources will change their schemas.
- Never hardcode credentials -- use environment variables or secret managers.
- Never skip quality checks -- they are the first line of defense against bad data.
- Prefer SQL for transformations expressible in SQL; use Python for complex logic that does not map cleanly to SQL.
- Include a backfill strategy and an operational runbook covering common failure scenarios in the spec.
- Use structured logging throughout and track data lineage at every transformation step.