Skill390 repo starsupdated 7mo ago

embedding-optimization

This skill optimizes embedding generation for RAG and semantic search systems by providing frameworks for selecting cost-effective embedding models, implementing chunking strategies tailored to content types, and tuning performance parameters. Use it when building retrieval systems that need to balance embedding quality, API costs, and document processing efficiency across large corpora.

View source Repository: ai-design-components

Install in Claude Code

Copy

git clone --depth 1 https://github.com/ancoleman/ai-design-components /tmp/embedding-optimization && cp -r /tmp/embedding-optimization/skills/embedding-optimization ~/.claude/skills/embedding-optimization

Then start a new Claude Code session; the skill loads automatically.

Definition

SKILL.md

# Embedding Optimization

Optimize embedding generation for cost, performance, and quality in RAG and semantic search systems.

## When to Use This Skill

Trigger this skill when:
- Building RAG (Retrieval Augmented Generation) systems
- Implementing semantic search or similarity detection
- Optimizing embedding API costs (reducing by 70-90%)
- Improving document retrieval quality through better chunking
- Processing large document corpora (thousands to millions of documents)
- Selecting between API-based vs. local embedding models

## Model Selection Framework

Choose the optimal embedding model based on requirements:

**Quick Recommendations:**
- **Startup/MVP:** `all-MiniLM-L6-v2` (local, 384 dims, zero API costs)
- **Production:** `text-embedding-3-small` (API, 1,536 dims, balanced quality/cost)
- **High Quality:** `text-embedding-3-large` (API, 3,072 dims, premium)
- **Multilingual:** `multilingual-e5-base` (local, 768 dims) or Cohere `embed-multilingual-v3.0`

For detailed decision frameworks including cost comparisons, quality benchmarks, and data privacy considerations, see `references/model-selection-guide.md`.

**Model Comparison Summary:**

| Model | Type | Dimensions | Cost per 1M tokens | Best For |
|-------|------|-----------|-------------------|----------|
| all-MiniLM-L6-v2 | Local | 384 | $0 (compute only) | High volume, tight budgets |
| BGE-base-en-v1.5 | Local | 768 | $0 (compute only) | Quality + cost balance |
| text-embedding-3-small | API | 1,536 | $0.02 | General purpose production |
| text-embedding-3-large | API | 3,072 | $0.13 | Premium quality requirements |
| embed-multilingual-v3.0 | API | 1,024 | $0.10 | 100+ language support |

## Chunking Strategies

Select chunking strategy based on content type and use case:

**Content Type → Strategy Mapping:**
- **Documentation:** Recursive (heading-aware), 800 chars, 100 overlap
- **Code:** Recursive (function-level), 1,000 chars, 100 overlap
- **Q&A/FAQ:** Fixed-size, 500 chars, 50 overlap (precise retrieval)
- **Legal/Technical:** Semantic (large), 1,500 chars, 200 overlap (context preservation)
- **Blog Posts:** Semantic (paragraph), 1,000 chars, 100 overlap
- **Academic Papers:** Recursive (section-aware), 1,200 chars, 150 overlap

For detailed chunking patterns, decision trees, and implementation guidance, see `references/chunking-strategies.md`.

**Quick Start with CLI:**
```bash
python scripts/chunk_document.py \
  --input document.txt \
  --content-type markdown \
  --chunk-size 800 \
  --overlap 100 \
  --output chunks.jsonl
```

## Caching Implementation

Achieve 80-90% cost reduction through content-addressable caching.

**Caching Architecture by Query Volume:**
- **<10K queries/month:** In-memory cache (Python `lru_cache`)
- **10K-100K queries/month:** Redis (fast, TTL-based expiration)
- **100K-1M queries/month:** Redis (hot) + PostgreSQL (warm)
- **>1M queries/month:** Multi-tier (Redis + PostgreSQL + S3)

**Production Caching with Redis:**
```bash
# Embed documents with caching enabled
python scripts/cached_embedder.py \
  --model text-embedding-3-small \
  --input documents.jsonl \
  --output embeddings.npy \
  --cache-backend redis \
  --cache-ttl 2592000  # 30 days
```

**Caching ROI Example:**
- 50,000 document chunks
- 20% duplicate content
- Without caching: $0.50 API cost
- With caching (60% hit rate): $0.20 API cost
- **Savings: 60% ($0.30)**

## Dimensionality Trade-offs

Balance storage, search speed, and quality:

| Dimensions | Storage (1M vectors) | Search Speed (p95) | Quality | Use Case |
|-----------|---------------------|-------------------|---------|----------|
| 384 | 1.5 GB | 10ms | Good | Large-scale search |
| 768 | 3 GB | 15ms | High | General purpose RAG |
| 1,536 | 6 GB | 25ms | Very High | High-quality retrieval |
| 3,072 | 12 GB | 40ms | Highest | Premium applications |

**Key Insight:** For most RAG applications, 768 dimensions (BGE-base-en-v1.5 local or equivalent) provides the best quality/cost/speed balance.

## Batch Processing Optimization

Maximize throughput for large-scale ingestion:

**OpenAI API:**
- Batch up to 2,048 inputs per request
- Implement rate limiting (tier-dependent: 500-5,000 RPM)
- Use parallel requests with backoff on rate limits

**Local Models (sentence-transformers):**
- GPU acceleration (CUDA, MPS for Apple Silicon)
- Batch size tuning (32-128 based on GPU memory)
- Multi-GPU support for maximum throughput

**Expected Throughput:**
- OpenAI API: 1,000-5,000 texts/minute (rate limit dependent)
- Local GPU (RTX 3090): 5,000-10,000 texts/minute
- Local CPU: 100-500 texts/minute

## Performance Monitoring

Track key metrics for optimization:

**Critical Metrics:**
- **Latency:** Embedding generation time (p50, p95, p99)
- **Throughput:** Embeddings per second/minute
- **Cost:** API usage tracking (USD per 1K/1M tokens)
- **Cache Efficiency:** Hit rate percentage

For detailed monitoring setup, metric collection patterns, and dashboarding, see `references/performance-monitoring.md`.

**Monitor with Wrapper:**
```python
from scripts.performance_monitor import MonitoredEmbedder

monitored = MonitoredEmbedder(
    embedder=your_embedder,
    cost_per_1k_tokens=0.00002  # OpenAI pricing
)

embeddings = monitored.embed_batch(texts)
metrics = monitored.get_metrics()
print(f"Cache hit rate: {metrics['cache_hit_rate_pct']}%")
print(f"Total cost: ${metrics['total_cost_usd']}")
```

## Working Examples

See `examples/` directory for complete implementations:

**Python Examples:**
- `examples/openai_cached.py` - OpenAI embeddings with Redis caching
- `examples/local_embedder.py` - sentence-transformers local embedding
- `examples/smart_chunker.py` - Content-aware recursive chunking
- `examples/performance_monitor.py` - Pipeline performance tracking
- `examples/batch_processor.py` - Large-scale document processing

All examples include:
- Complete, runnable code
- Dependency installation instructions
- Error handling and retry logic
- Configuration options

##