Skill342 repo starsupdated 2d ago

python-pipeline

The python-pipeline skill provides patterns for building production Python data processing workflows using modern frameworks like Polars, DuckDB, and pandas. Use it when designing multi-stage content pipelines, implementing batch processing systems, integrating external APIs like Google Sheets, or optimizing data transformations across gigabyte-scale datasets. The skill covers dispatcher patterns, async task management, and zero-copy DataFrame interoperability across tools.

View source Repository: claude-skills-journalism

Install in Claude Code

Copy

git clone --depth 1 https://github.com/jamditis/claude-skills-journalism /tmp/python-pipeline && cp -r /tmp/python-pipeline/dev-toolkit/skills/python-pipeline ~/.claude/skills/python-pipeline

Then start a new Claude Code session; the skill loads automatically.

Definition

SKILL.md

# Python data pipeline development

Patterns for building production-quality data processing pipelines with Python.

**Targeted at Python 3.11+** for `asyncio.TaskGroup` and exception groups; Python 3.12+ for the lighter `type X = ...` syntax. Pin a 3.13+ runtime if you want the JIT or experimental free-threading; the patterns here don't depend on either.

## Choosing a DataFrame engine: pandas vs polars vs DuckDB

For a long time pandas was the default for any tabular work in Python. As of 2026 the default has shifted: **polars** is the right pick for multi-GB pipelines on a single machine, **DuckDB** is the right pick when SQL or larger-than-RAM scans are involved, and **pandas** stays useful for small data and the ML/notebook ecosystem (scikit-learn, statsmodels, plotnine all speak it natively).

| Tool | When | Why |
|---|---|---|
| pandas | < ~1 GB data, ML interop, single-threaded familiarity | Mature, ubiquitous, eager DataFrame model. Slowest in benchmarks but most ecosystem support. |
| polars | 1 GB - tens of GB on one box, performance-critical pipelines | Multithreaded by default, lazy query engine, Arrow-native. ~5x speedup over pandas on filter / aggregate at 100M rows. |
| DuckDB | SQL workflows, larger-than-RAM, parquet/CSV scanning, joins across many files | Vectorized + pipelined execution, cost-based optimizer, streaming scans. Works great as a thin wrapper over a directory of parquet files. |

All three speak Apache Arrow, so zero-copy interop between them is the pragmatic answer most of the time:

```python
import polars as pl
import duckdb

# Polars: read a directory of CSVs, filter, group
df = (
    pl.scan_csv('data/articles_*.csv')
      .filter(pl.col('published_at') >= '2026-01-01')
      .group_by('source')
      .agg(pl.len().alias('count'), pl.col('word_count').mean())
      .collect()
)

# DuckDB: same shape with SQL, no intermediate copy
con = duckdb.connect()
df = con.execute("""
    SELECT source, COUNT(*) AS count, AVG(word_count) AS avg_wc
    FROM 'data/articles_*.csv'
    WHERE published_at >= '2026-01-01'
    GROUP BY source
""").pl()  # returns a Polars DataFrame; use .df() for pandas

# Hand off to pandas only at the boundary that needs it (e.g. scikit-learn)
import pandas as pd
pdf = df.to_pandas()
```

If your pipeline already uses pandas everywhere, don't pre-emptively rewrite. Migrate the bottleneck stages first — typically the CSV-load + filter step.

## Architecture patterns

### Modular processor architecture
```
src/
├── workflow.py              # Main orchestrator
├── dispatcher.py            # Content-type router
├── processors/
│   ├── __init__.py
│   ├── base.py             # Abstract base class
│   ├── article_processor.py
│   ├── video_processor.py
│   └── audio_processor.py
├── services/
│   ├── sheets_service.py   # Google Sheets integration
│   ├── drive_service.py    # Google Drive integration
│   └── ai_service.py       # Gemini API wrapper
├── utils/
│   ├── logger.py
│   └── rate_limiter.py
└── config.py               # Environment configuration
```

### Dispatcher pattern

```python
from typing import Protocol
from urllib.parse import urlparse

class Processor(Protocol):
    def can_process(self, url: str) -> bool: ...
    def process(self, url: str, metadata: dict) -> dict: ...

class Dispatcher:
    def __init__(self):
        self.processors: list[Processor] = [
            ArticleProcessor(),
            VideoProcessor(),
            AudioProcessor(),
            SocialProcessor(),
        ]

    def dispatch(self, url: str, metadata: dict) -> dict:
        for processor in self.processors:
            if processor.can_process(url):
                return processor.process(url, metadata)
        raise ValueError(f"No processor found for URL: {url}")

# Pattern-based routing
class ArticleProcessor:
    DOMAINS = ['nytimes.com', 'washingtonpost.com', 'medium.com']

    def can_process(self, url: str) -> bool:
        domain = urlparse(url).netloc.replace('www.', '')
        return any(d in domain for d in self.DOMAINS)
```

### CSV-based pipeline workflow

```python
import csv
from pathlib import Path
from dataclasses import dataclass, asdict
from typing import Iterator

@dataclass
class Record:
    id: str
    url: str
    title: str | None = None
    content: str | None = None
    status: str = 'pending'

def read_input(path: Path) -> Iterator[Record]:
    with open(path, 'r', encoding='utf-8') as f:
        reader = csv.DictReader(f)
        for row in reader:
            yield Record(**{k: v for k, v in row.items() if k in Record.__annotations__})

def write_output(records: list[Record], path: Path):
    with open(path, 'w', encoding='utf-8', newline='') as f:
        writer = csv.DictWriter(f, fieldnames=list(Record.__annotations__.keys()))
        writer.writeheader()
        writer.writerows(asdict(r) for r in records)

def process_batch(input_path: Path, output_path: Path):
    dispatcher = Dispatcher()
    results = []

    for record in read_input(input_path):
        try:
            processed = dispatcher.dispatch(record.url, asdict(record))
            record.status = 'completed'
            record.title = processed.get('title')
            record.content = processed.get('content')
        except Exception as e:
            record.status = f'failed: {e}'
        results.append(record)

    write_output(results, output_path)
```

## Google Sheets integration

```python
import gspread
from google.oauth2.service_account import Credentials

SCOPES = [
    'https://www.googleapis.com/auth/spreadsheets',
    'https://www.googleapis.com/auth/drive'
]

class SheetsService:
    def __init__(self, credentials_path: str):
        creds = Credentials.from_service_account_file(credentials_path, scopes=SCOPES)
        self.client = gspread.authorize(creds)

    def get_worksheet(self, spreadsheet_id: str, sheet_name: str):
        spreadsheet = self.client.open_by_key(spreadsheet_id)
        return spreadsheet.worksheet(shee