Skip to main content
ClaudeWave
Skill374 estrellas del repoactualizado 6mo ago

ai-data-engineering

This skill provides architecture patterns and implementation guidance for building AI/ML data infrastructure including RAG pipelines, feature stores, and embedding generation systems. Use it when constructing retrieval-augmented generation systems, setting up semantic search with vector databases, implementing ML feature stores for real-time serving, orchestrating data workflows with tools like Dagster or Airflow, or evaluating RAG quality using metrics frameworks. Skip this skill for traditional CRUD applications or simple key-value storage needs.

Instalar en Claude Code
Copiar
git clone --depth 1 https://github.com/ancoleman/ai-design-components /tmp/ai-data-engineering && cp -r /tmp/ai-data-engineering/skills/ai-data-engineering ~/.claude/skills/ai-data-engineering
Después abre una sesión nueva de Claude Code; el skill carga automáticamente.

SKILL.md

# AI Data Engineering

## Purpose

Build data infrastructure for AI/ML systems including RAG pipelines, feature stores, and embedding generation. Provides architecture patterns, orchestration workflows, and evaluation metrics for production AI applications.

## When to Use

**Use this skill when:**
- Building RAG (Retrieval-Augmented Generation) pipelines
- Implementing semantic search or vector databases
- Setting up ML feature stores for real-time serving
- Creating embedding generation pipelines
- Evaluating RAG quality with RAGAS metrics
- Orchestrating data workflows for AI systems
- Integrating with frontend skills (ai-chat, search-filter)

**Skip this skill if:**
- Building traditional CRUD applications (use databases-relational)
- Simple key-value storage (use databases-nosql)
- No AI/ML components in the application

## RAG Pipeline Architecture

RAG pipelines have 5 distinct stages. Understanding this architecture is critical for production implementations.

```
┌─────────────────────────────────────────────────────────────┐
│                    RAG Pipeline (5 Stages)                   │
├─────────────────────────────────────────────────────────────┤
│                                                              │
│  1. INGESTION → Load documents (PDF, DOCX, Markdown)        │
│  2. INDEXING → Chunk (512 tokens) + Embed + Store           │
│  3. RETRIEVAL → Query embedding + Vector search + Filters   │
│  4. GENERATION → Context injection + LLM streaming          │
│  5. EVALUATION → RAGAS metrics (faithfulness, relevancy)    │
│                                                              │
└─────────────────────────────────────────────────────────────┘
```

**For complete RAG architecture with implementation patterns, see:**
- `references/rag-architecture.md` - Detailed 5-stage breakdown
- `examples/langchain-rag/basic_rag.py` - Working implementation

## Chunking Strategies

Chunking is the most critical decision for RAG quality. Poor chunking breaks retrieval.

**Default Recommendation:**
- **Size:** 512 tokens
- **Overlap:** 50-100 tokens
- **Method:** Fixed token-based

**Why these values:**
- Too small (<256 tokens): Loses context, requires many retrievals
- Too large (>1024 tokens): Includes irrelevant content, hits token limits
- Overlap prevents information loss at chunk boundaries

**Alternative strategies for special cases:**

```python
# Code-aware chunking (preserves functions/classes)
from langchain.text_splitter import RecursiveCharacterTextSplitter

code_splitter = RecursiveCharacterTextSplitter.from_language(
    language="python",
    chunk_size=512,
    chunk_overlap=50
)

# Semantic chunking (splits on meaning, not tokens)
from langchain.text_splitter import SemanticChunker

semantic_splitter = SemanticChunker(
    embeddings=embeddings,
    breakpoint_threshold_type="percentile"  # Split at semantic boundaries
)
```

**See:** `references/chunking-strategies.md` for complete decision framework

## Embedding Generation

Embedding quality directly impacts retrieval accuracy. Voyage AI is currently best-in-class.

**Primary Recommendation: Voyage AI voyage-3**
- Dimensions: 1024
- MTEB Score: 69.0 (highest as of Dec 2025)
- Cost: $$$ but 9.74% better than OpenAI
- Use for: Production systems requiring best retrieval quality

**Cost-Effective Alternative: OpenAI text-embedding-3-small**
- Dimensions: 1536
- MTEB Score: 62.3
- Cost: $ (5x cheaper than voyage-3)
- Use for: Development, prototyping, cost-sensitive applications

**Implementation:**

```python
from langchain_voyageai import VoyageAIEmbeddings
from langchain_openai import OpenAIEmbeddings

# Production (best quality)
embeddings = VoyageAIEmbeddings(
    model="voyage-3",
    voyage_api_key="your-api-key"
)

# Development (cost-effective)
embeddings = OpenAIEmbeddings(
    model="text-embedding-3-small",
    openai_api_key="your-api-key"
)
```

**See:** `references/embedding-strategies.md` for complete provider comparison

## RAGAS Evaluation Metrics

Traditional metrics (BLEU, ROUGE) don't measure RAG quality. RAGAS provides LLM-as-judge evaluation.

**4 Core Metrics:**

| Metric | Measures | Good Score |
|--------|----------|------------|
| **Faithfulness** | Factual consistency with retrieved context | > 0.8 |
| **Answer Relevancy** | Does answer address the user's question? | > 0.7 |
| **Context Precision** | Are retrieved chunks actually relevant? | > 0.6 |
| **Context Recall** | Were all necessary chunks retrieved? | > 0.7 |

**Quick evaluation script:**

```bash
# Run RAGAS evaluation (TOKEN-FREE script execution)
python scripts/evaluate_rag.py --dataset eval_data.json --output results.json
```

**Manual implementation:**

```python
from ragas import evaluate
from ragas.metrics import faithfulness, answer_relevancy

dataset = {
    "question": ["What is the capital of France?"],
    "answer": ["Paris is the capital of France."],
    "contexts": [["France's capital is Paris."]],
    "ground_truth": ["Paris"]
}

result = evaluate(dataset, metrics=[faithfulness, answer_relevancy])
print(f"Faithfulness: {result['faithfulness']}")
print(f"Answer Relevancy: {result['answer_relevancy']}")
```

**See:** `references/evaluation-metrics.md` for complete RAGAS implementation guide

## Feature Stores

Feature stores solve the "training-serving skew" problem by providing consistent feature computation.

**Primary Recommendation: Feast** - Open source, works with any backend (PostgreSQL, Redis, DynamoDB, S3, BigQuery, Snowflake)

**Basic usage:**

```python
from feast import FeatureStore
store = FeatureStore(repo_path="feature_repo/")

# Online serving (low-latency)
features = store.get_online_features(
    features=["user_features:total_orders"],
    entity_rows=[{"user_id": 1001}]
).to_dict()
```

**See:** `references/feature-stores.md` for complete Feast setup and alternatives (Tecton, Hopsworks)

## LangChain Orchestration

LangChain is the primary framework for LLM orchestration with the largest ecosys
administering-linuxSkill

Manage Linux systems covering systemd services, process management, filesystems, networking, performance tuning, and troubleshooting. Use when deploying applications, optimizing server performance, diagnosing production issues, or managing users and security on Linux servers.

architecting-dataSkill

Strategic guidance for designing modern data platforms, covering storage paradigms (data lake, warehouse, lakehouse), modeling approaches (dimensional, normalized, data vault, wide tables), data mesh principles, and medallion architecture patterns. Use when architecting data platforms, choosing between centralized vs decentralized patterns, selecting table formats (Iceberg, Delta Lake), or designing data governance frameworks.

architecting-networksSkill

Design cloud network architectures with VPC patterns, subnet strategies, zero trust principles, and hybrid connectivity. Use when planning VPC topology, implementing multi-cloud networking, or establishing secure network segmentation for cloud workloads.

architecting-securitySkill

Design comprehensive security architectures using defense-in-depth, zero trust principles, threat modeling (STRIDE, PASTA), and control frameworks (NIST CSF, CIS Controls, ISO 27001). Use when designing security for new systems, auditing existing architectures, or establishing security governance programs.

assembling-componentsSkill

Assembles component outputs from AI Design Components skills into unified, production-ready component systems with validated token integration, proper import chains, and framework-specific scaffolding. Use as the capstone skill after running theming, layout, dashboard, data-viz, or feedback skills to wire components into working React/Next.js, Python, or Rust projects.

building-ai-chatSkill

Builds AI chat interfaces and conversational UI with streaming responses, context management, and multi-modal support. Use when creating ChatGPT-style interfaces, AI assistants, code copilots, or conversational agents. Handles streaming text, token limits, regeneration, feedback loops, tool usage visualization, and AI-specific error patterns. Provides battle-tested components from leading AI products with accessibility and performance built in.

building-ci-pipelinesSkill

Constructs secure, efficient CI/CD pipelines with supply chain security (SLSA), monorepo optimization, caching strategies, and parallelization patterns for GitHub Actions, GitLab CI, and Argo Workflows. Use when setting up automated testing, building, or deployment workflows.

building-clisSkill

Build professional command-line interfaces in Python, Go, and Rust using modern frameworks like Typer, Cobra, and clap. Use when creating developer tools, automation scripts, or infrastructure management CLIs with robust argument parsing, interactive features, and multi-platform distribution.