embedding-optimization
This skill optimizes embedding generation for RAG and semantic search systems by providing frameworks for selecting cost-effective embedding models, implementing chunking strategies tailored to content types, and tuning performance parameters. Use it when building retrieval systems that need to balance embedding quality, API costs, and document processing efficiency across large corpora.
git clone --depth 1 https://github.com/ancoleman/ai-design-components /tmp/embedding-optimization && cp -r /tmp/embedding-optimization/skills/embedding-optimization ~/.claude/skills/embedding-optimizationSKILL.md
# Embedding Optimization
Optimize embedding generation for cost, performance, and quality in RAG and semantic search systems.
## When to Use This Skill
Trigger this skill when:
- Building RAG (Retrieval Augmented Generation) systems
- Implementing semantic search or similarity detection
- Optimizing embedding API costs (reducing by 70-90%)
- Improving document retrieval quality through better chunking
- Processing large document corpora (thousands to millions of documents)
- Selecting between API-based vs. local embedding models
## Model Selection Framework
Choose the optimal embedding model based on requirements:
**Quick Recommendations:**
- **Startup/MVP:** `all-MiniLM-L6-v2` (local, 384 dims, zero API costs)
- **Production:** `text-embedding-3-small` (API, 1,536 dims, balanced quality/cost)
- **High Quality:** `text-embedding-3-large` (API, 3,072 dims, premium)
- **Multilingual:** `multilingual-e5-base` (local, 768 dims) or Cohere `embed-multilingual-v3.0`
For detailed decision frameworks including cost comparisons, quality benchmarks, and data privacy considerations, see `references/model-selection-guide.md`.
**Model Comparison Summary:**
| Model | Type | Dimensions | Cost per 1M tokens | Best For |
|-------|------|-----------|-------------------|----------|
| all-MiniLM-L6-v2 | Local | 384 | $0 (compute only) | High volume, tight budgets |
| BGE-base-en-v1.5 | Local | 768 | $0 (compute only) | Quality + cost balance |
| text-embedding-3-small | API | 1,536 | $0.02 | General purpose production |
| text-embedding-3-large | API | 3,072 | $0.13 | Premium quality requirements |
| embed-multilingual-v3.0 | API | 1,024 | $0.10 | 100+ language support |
## Chunking Strategies
Select chunking strategy based on content type and use case:
**Content Type → Strategy Mapping:**
- **Documentation:** Recursive (heading-aware), 800 chars, 100 overlap
- **Code:** Recursive (function-level), 1,000 chars, 100 overlap
- **Q&A/FAQ:** Fixed-size, 500 chars, 50 overlap (precise retrieval)
- **Legal/Technical:** Semantic (large), 1,500 chars, 200 overlap (context preservation)
- **Blog Posts:** Semantic (paragraph), 1,000 chars, 100 overlap
- **Academic Papers:** Recursive (section-aware), 1,200 chars, 150 overlap
For detailed chunking patterns, decision trees, and implementation guidance, see `references/chunking-strategies.md`.
**Quick Start with CLI:**
```bash
python scripts/chunk_document.py \
--input document.txt \
--content-type markdown \
--chunk-size 800 \
--overlap 100 \
--output chunks.jsonl
```
## Caching Implementation
Achieve 80-90% cost reduction through content-addressable caching.
**Caching Architecture by Query Volume:**
- **<10K queries/month:** In-memory cache (Python `lru_cache`)
- **10K-100K queries/month:** Redis (fast, TTL-based expiration)
- **100K-1M queries/month:** Redis (hot) + PostgreSQL (warm)
- **>1M queries/month:** Multi-tier (Redis + PostgreSQL + S3)
**Production Caching with Redis:**
```bash
# Embed documents with caching enabled
python scripts/cached_embedder.py \
--model text-embedding-3-small \
--input documents.jsonl \
--output embeddings.npy \
--cache-backend redis \
--cache-ttl 2592000 # 30 days
```
**Caching ROI Example:**
- 50,000 document chunks
- 20% duplicate content
- Without caching: $0.50 API cost
- With caching (60% hit rate): $0.20 API cost
- **Savings: 60% ($0.30)**
## Dimensionality Trade-offs
Balance storage, search speed, and quality:
| Dimensions | Storage (1M vectors) | Search Speed (p95) | Quality | Use Case |
|-----------|---------------------|-------------------|---------|----------|
| 384 | 1.5 GB | 10ms | Good | Large-scale search |
| 768 | 3 GB | 15ms | High | General purpose RAG |
| 1,536 | 6 GB | 25ms | Very High | High-quality retrieval |
| 3,072 | 12 GB | 40ms | Highest | Premium applications |
**Key Insight:** For most RAG applications, 768 dimensions (BGE-base-en-v1.5 local or equivalent) provides the best quality/cost/speed balance.
## Batch Processing Optimization
Maximize throughput for large-scale ingestion:
**OpenAI API:**
- Batch up to 2,048 inputs per request
- Implement rate limiting (tier-dependent: 500-5,000 RPM)
- Use parallel requests with backoff on rate limits
**Local Models (sentence-transformers):**
- GPU acceleration (CUDA, MPS for Apple Silicon)
- Batch size tuning (32-128 based on GPU memory)
- Multi-GPU support for maximum throughput
**Expected Throughput:**
- OpenAI API: 1,000-5,000 texts/minute (rate limit dependent)
- Local GPU (RTX 3090): 5,000-10,000 texts/minute
- Local CPU: 100-500 texts/minute
## Performance Monitoring
Track key metrics for optimization:
**Critical Metrics:**
- **Latency:** Embedding generation time (p50, p95, p99)
- **Throughput:** Embeddings per second/minute
- **Cost:** API usage tracking (USD per 1K/1M tokens)
- **Cache Efficiency:** Hit rate percentage
For detailed monitoring setup, metric collection patterns, and dashboarding, see `references/performance-monitoring.md`.
**Monitor with Wrapper:**
```python
from scripts.performance_monitor import MonitoredEmbedder
monitored = MonitoredEmbedder(
embedder=your_embedder,
cost_per_1k_tokens=0.00002 # OpenAI pricing
)
embeddings = monitored.embed_batch(texts)
metrics = monitored.get_metrics()
print(f"Cache hit rate: {metrics['cache_hit_rate_pct']}%")
print(f"Total cost: ${metrics['total_cost_usd']}")
```
## Working Examples
See `examples/` directory for complete implementations:
**Python Examples:**
- `examples/openai_cached.py` - OpenAI embeddings with Redis caching
- `examples/local_embedder.py` - sentence-transformers local embedding
- `examples/smart_chunker.py` - Content-aware recursive chunking
- `examples/performance_monitor.py` - Pipeline performance tracking
- `examples/batch_processor.py` - Large-scale document processing
All examples include:
- Complete, runnable code
- Dependency installation instructions
- Error handling and retry logic
- Configuration options
##Manage Linux systems covering systemd services, process management, filesystems, networking, performance tuning, and troubleshooting. Use when deploying applications, optimizing server performance, diagnosing production issues, or managing users and security on Linux servers.
Data pipelines, feature stores, and embedding generation for AI/ML systems. Use when building RAG pipelines, ML feature serving, or data transformations. Covers feature stores (Feast, Tecton), embedding pipelines, chunking strategies, orchestration (Dagster, Prefect, Airflow), dbt transformations, data versioning (LakeFS), and experiment tracking (MLflow, W&B).
Strategic guidance for designing modern data platforms, covering storage paradigms (data lake, warehouse, lakehouse), modeling approaches (dimensional, normalized, data vault, wide tables), data mesh principles, and medallion architecture patterns. Use when architecting data platforms, choosing between centralized vs decentralized patterns, selecting table formats (Iceberg, Delta Lake), or designing data governance frameworks.
Design cloud network architectures with VPC patterns, subnet strategies, zero trust principles, and hybrid connectivity. Use when planning VPC topology, implementing multi-cloud networking, or establishing secure network segmentation for cloud workloads.
Design comprehensive security architectures using defense-in-depth, zero trust principles, threat modeling (STRIDE, PASTA), and control frameworks (NIST CSF, CIS Controls, ISO 27001). Use when designing security for new systems, auditing existing architectures, or establishing security governance programs.
Assembles component outputs from AI Design Components skills into unified, production-ready component systems with validated token integration, proper import chains, and framework-specific scaffolding. Use as the capstone skill after running theming, layout, dashboard, data-viz, or feedback skills to wire components into working React/Next.js, Python, or Rust projects.
Builds AI chat interfaces and conversational UI with streaming responses, context management, and multi-modal support. Use when creating ChatGPT-style interfaces, AI assistants, code copilots, or conversational agents. Handles streaming text, token limits, regeneration, feedback loops, tool usage visualization, and AI-specific error patterns. Provides battle-tested components from leading AI products with accessibility and performance built in.
Constructs secure, efficient CI/CD pipelines with supply chain security (SLSA), monorepo optimization, caching strategies, and parallelization patterns for GitHub Actions, GitLab CI, and Argo Workflows. Use when setting up automated testing, building, or deployment workflows.