Skip to main content
ClaudeWave
Skill374 estrellas del repoactualizado 6mo ago

architecting-data

This skill provides strategic guidance for designing modern data platforms, covering storage paradigms like data lake and lakehouse, data modeling approaches including dimensional and data vault patterns, and architectural frameworks such as medallion architecture and data mesh principles. Use it when architecting new data platforms, selecting between centralized versus decentralized patterns, evaluating table formats like Iceberg and Delta Lake, or designing data governance frameworks.

Instalar en Claude Code
Copiar
git clone --depth 1 https://github.com/ancoleman/ai-design-components /tmp/architecting-data && cp -r /tmp/architecting-data/skills/architecting-data ~/.claude/skills/architecting-data
Después abre una sesión nueva de Claude Code; el skill carga automáticamente.

SKILL.md

# Data Architecture

## Purpose

Guide architects and platform engineers through strategic data architecture decisions for modern cloud-native data platforms.

## When to Use This Skill

Invoke this skill when:
- Designing a new data platform or modernizing legacy systems
- Choosing between data lake, data warehouse, or data lakehouse
- Deciding on data modeling approaches (dimensional, normalized, data vault, wide tables)
- Evaluating centralized vs data mesh architecture
- Selecting open table formats (Apache Iceberg, Delta Lake, Apache Hudi)
- Designing medallion architecture (bronze, silver, gold layers)
- Implementing data governance and cataloging

## Core Concepts

### 1. Storage Paradigms

Three primary patterns for analytical data storage:

**Data Lake:** Centralized repository for raw data at scale
- Schema-on-read, cost-optimized ($0.02-0.03/GB/month)
- Use when: Diverse data sources, exploratory analytics, ML/AI training data

**Data Warehouse:** Structured repository optimized for BI
- Schema-on-write, ACID transactions, fast queries
- Use when: Known BI requirements, strong governance needed

**Data Lakehouse:** Hybrid combining lake flexibility with warehouse reliability
- Open table formats (Iceberg, Delta Lake), ACID on object storage
- Use when: Mixed BI + ML workloads, cost optimization (60-80% cheaper than warehouse)

**Decision Framework:**
- BI/Reporting only + Known queries → Data Warehouse
- ML/AI primary + Raw data needed → Data Lake or Lakehouse
- Mixed BI + ML + Cost optimization → Data Lakehouse (recommended)
- Exploratory/Unknown use cases → Data Lake

For detailed comparison, see [references/storage-paradigms.md](references/storage-paradigms.md).

### 2. Data Modeling Approaches

Four primary modeling patterns:

**Dimensional (Kimball):** Star/snowflake schemas for BI
- Use when: Known query patterns, BI dashboards, trend analysis

**Normalized (3NF):** Eliminate redundancy for transactional systems
- Use when: OLTP systems, frequent updates, strong consistency

**Data Vault 2.0:** Flexible model with complete audit trail
- Use when: Compliance requirements, multiple sources, agile warehousing

**Wide Tables:** Denormalized, optimized for columnar storage
- Use when: ML feature stores, data science notebooks, high-performance dashboards

**Decision Framework:**
- Analytical (BI) + Known queries → Dimensional (Star Schema)
- Transactional (OLTP) → Normalized (3NF)
- Compliance/Audit → Data Vault 2.0
- Data Science/ML → Wide Tables

For detailed patterns, see [references/modeling-approaches.md](references/modeling-approaches.md).

### 3. Data Mesh Principles

Decentralized architecture for large organizations (>500 people).

**Four Core Principles:**
1. Domain-oriented decentralization
2. Data as a product (SLAs, quality, documentation)
3. Self-serve data infrastructure
4. Federated computational governance

**Readiness Assessment (Score 1-5 each):**
1. Domain clarity
2. Team maturity
3. Platform capability
4. Governance maturity
5. Scale need
6. Organizational buy-in

**Scoring:** 24-30: Strong candidate | 18-23: Hybrid | 12-17: Build foundation first | 6-11: Centralized

**Red Flags:** Small org (<100 people), unclear domains, no platform team, weak governance

For full guide, see [references/data-mesh-guide.md](references/data-mesh-guide.md).

### 4. Medallion Architecture

Standard lakehouse pattern: Bronze (raw) → Silver (cleaned) → Gold (business-level)

**Bronze Layer:** Exact copy of source data, immutable, append-only

**Silver Layer:** Validated, deduplicated, typed data

**Gold Layer:** Business logic, aggregates, dimensional models, ML features

**Data Quality by Layer:**
- Bronze → Silver: Schema validation, type checks, deduplication
- Silver → Gold: Business rule validation, referential integrity
- Gold: Anomaly detection, statistical checks

For patterns, see [references/medallion-pattern.md](references/medallion-pattern.md).

### 5. Open Table Formats

Enable ACID transactions on data lakes:

**Apache Iceberg:** Multi-engine, vendor-neutral (Context7: 79.7 score)
- Use when: Avoid vendor lock-in, multi-engine flexibility

**Delta Lake:** Databricks ecosystem, Spark-optimized
- Use when: Committed to Databricks

**Apache Hudi:** Optimized for CDC and frequent upserts
- Use when: CDC-heavy workloads

**Recommendation:** Apache Iceberg for new projects (vendor-neutral, broadest support)

For comparison, see [references/table-formats.md](references/table-formats.md).

### 6. Modern Data Stack

**Standard Layers:**
- Ingestion: Fivetran, Airbyte, Kafka
- Storage: Snowflake, Databricks, BigQuery
- Transformation: dbt (Context7: 87.0 score), Spark
- Orchestration: Airflow, Dagster, Prefect
- Visualization: Tableau, Looker, Power BI
- Governance: DataHub, Alation, Great Expectations

**Tool Selection:**
- Fivetran vs Airbyte: Pre-built connectors vs cost-sensitive
- Snowflake vs Databricks: BI-focused vs ML-focused
- dbt vs Spark: SQL-based vs large-scale processing

For detailed recommendations, see [references/tool-recommendations.md](references/tool-recommendations.md) and [references/modern-data-stack.md](references/modern-data-stack.md).

### 7. Data Governance

**Data Catalog:** Searchable inventory (DataHub, Alation, Collibra)

**Data Lineage:** Track data flow (OpenLineage, Marquez)

**Data Quality:** Validation and testing (Great Expectations, Soda, dbt tests)

**Access Control:**
- RBAC: Role-based (sales_analyst role)
- ABAC: Attribute-based (row-level security)
- Column-level: Dynamic data masking for PII

For governance patterns, see [references/governance-patterns.md](references/governance-patterns.md).

## Decision Frameworks

### Framework 1: Storage Paradigm Selection

**Step 1: Identify Primary Use Case**
- BI/Reporting only → Data Warehouse
- ML/AI primary → Data Lake or Lakehouse
- Mixed BI + ML → Data Lakehouse
- Exploratory → Data Lake

**Step 2: Evaluate Budget**
- High budget, known queries → Data Warehouse
- Cost-sensitive, f
administering-linuxSkill

Manage Linux systems covering systemd services, process management, filesystems, networking, performance tuning, and troubleshooting. Use when deploying applications, optimizing server performance, diagnosing production issues, or managing users and security on Linux servers.

ai-data-engineeringSkill

Data pipelines, feature stores, and embedding generation for AI/ML systems. Use when building RAG pipelines, ML feature serving, or data transformations. Covers feature stores (Feast, Tecton), embedding pipelines, chunking strategies, orchestration (Dagster, Prefect, Airflow), dbt transformations, data versioning (LakeFS), and experiment tracking (MLflow, W&B).

architecting-networksSkill

Design cloud network architectures with VPC patterns, subnet strategies, zero trust principles, and hybrid connectivity. Use when planning VPC topology, implementing multi-cloud networking, or establishing secure network segmentation for cloud workloads.

architecting-securitySkill

Design comprehensive security architectures using defense-in-depth, zero trust principles, threat modeling (STRIDE, PASTA), and control frameworks (NIST CSF, CIS Controls, ISO 27001). Use when designing security for new systems, auditing existing architectures, or establishing security governance programs.

assembling-componentsSkill

Assembles component outputs from AI Design Components skills into unified, production-ready component systems with validated token integration, proper import chains, and framework-specific scaffolding. Use as the capstone skill after running theming, layout, dashboard, data-viz, or feedback skills to wire components into working React/Next.js, Python, or Rust projects.

building-ai-chatSkill

Builds AI chat interfaces and conversational UI with streaming responses, context management, and multi-modal support. Use when creating ChatGPT-style interfaces, AI assistants, code copilots, or conversational agents. Handles streaming text, token limits, regeneration, feedback loops, tool usage visualization, and AI-specific error patterns. Provides battle-tested components from leading AI products with accessibility and performance built in.

building-ci-pipelinesSkill

Constructs secure, efficient CI/CD pipelines with supply chain security (SLSA), monorepo optimization, caching strategies, and parallelization patterns for GitHub Actions, GitLab CI, and Argo Workflows. Use when setting up automated testing, building, or deployment workflows.

building-clisSkill

Build professional command-line interfaces in Python, Go, and Rust using modern frameworks like Typer, Cobra, and clap. Use when creating developer tools, automation scripts, or infrastructure management CLIs with robust argument parsing, interactive features, and multi-platform distribution.