Skill110 repo starsupdated 7mo ago

ml-cv-specialist

The ml-cv-specialist skill provides decision frameworks and technical guidance for selecting machine learning and computer vision models, choosing between API-based and self-hosted solutions, and designing production ML systems. Use it when planning architecture for ML features, evaluating model options across text, vision, audio, or structured data tasks, or determining cost-effective deployment strategies for AI-powered applications.

View source Repository: claude-cto-team

Install in Claude Code

Copy

git clone --depth 1 https://github.com/alirezarezvani/claude-cto-team /tmp/ml-cv-specialist && cp -r /tmp/ml-cv-specialist/skills/ml-cv-specialist ~/.claude/skills/ml-cv-specialist

Then start a new Claude Code session; the skill loads automatically.

Definition

SKILL.md

# ML/CV Specialist

Provides specialized guidance for machine learning and computer vision system design, model selection, and production deployment.

## When to Use

- Selecting ML models for specific use cases
- Designing training and inference pipelines
- Optimizing ML system performance and cost
- Evaluating build vs. API for ML capabilities
- Planning data pipelines for ML workloads

## ML System Design Framework

### Model Selection Decision Tree

```
Use Case Identified
    │
    ├─► Text/Language Tasks
    │   ├─► Classification → BERT, DistilBERT, or API (OpenAI, Claude)
    │   ├─► Generation → GPT-4, Claude, Llama (self-hosted)
    │   ├─► Embeddings → OpenAI Ada, sentence-transformers
    │   └─► Search/RAG → Vector DB + Embeddings + LLM
    │
    ├─► Computer Vision Tasks
    │   ├─► Classification → ResNet, EfficientNet, ViT
    │   ├─► Object Detection → YOLOv8, DETR, Faster R-CNN
    │   ├─► Segmentation → SAM, Mask R-CNN, U-Net
    │   ├─► OCR → Tesseract, PaddleOCR, Cloud Vision API
    │   └─► Face Recognition → InsightFace, DeepFace
    │
    ├─► Audio Tasks
    │   ├─► Speech-to-Text → Whisper, DeepSpeech, Cloud APIs
    │   ├─► Text-to-Speech → ElevenLabs, Coqui TTS
    │   └─► Audio Classification → PANNs, AudioSet models
    │
    └─► Structured Data
        ├─► Tabular → XGBoost, LightGBM, CatBoost
        ├─► Time Series → Prophet, ARIMA, Transformer-based
        └─► Recommendations → Two-tower, matrix factorization
```

---

## API vs. Self-Hosted Decision

### When to Use APIs

| Factor | API Preferred | Self-Hosted Preferred |
|--------|---------------|----------------------|
| **Volume** | < 10K requests/month | > 100K requests/month |
| **Latency** | > 500ms acceptable | < 100ms required |
| **Customization** | General use case | Domain-specific fine-tuning |
| **Data Privacy** | Non-sensitive data | PII, HIPAA, financial |
| **Team Expertise** | No ML engineers | ML team available |
| **Budget** | Predictable per-call costs | High volume justifies infra |

### Cost Comparison Framework

```markdown
## API Costs (Example: OpenAI GPT-4)
- Input: $0.03/1K tokens
- Output: $0.06/1K tokens
- Average request: 500 input + 200 output tokens
- Cost per request: $0.027
- 100K requests/month: $2,700

## Self-Hosted Costs (Example: Llama 70B)
- GPU instance: $3/hour (A100 40GB)
- Throughput: ~50 requests/minute = 3K/hour
- Cost per request: $0.001
- 100K requests/month: $100 + $500 engineering time

## Break-even Analysis
- < 50K requests: API likely cheaper
- > 50K requests: Self-hosted may be cheaper
- Factor in: engineering time, ops burden, model quality
```

---

## Training Pipeline Architecture

### Standard ML Pipeline

```
┌─────────────────────────────────────────────────────────────┐
│                    DATA LAYER                                │
├─────────────────────────────────────────────────────────────┤
│  Data Sources → ETL → Feature Store → Training Data         │
│  (S3, DBs)     (Airflow)  (Feast)     (Versioned)          │
└─────────────────────────────────────────────────────────────┘
                            │
                            ▼
┌─────────────────────────────────────────────────────────────┐
│                  TRAINING LAYER                              │
├─────────────────────────────────────────────────────────────┤
│  Experiment Tracking → Training Jobs → Model Registry       │
│  (MLflow, W&B)         (SageMaker)    (MLflow, S3)         │
└─────────────────────────────────────────────────────────────┘
                            │
                            ▼
┌─────────────────────────────────────────────────────────────┐
│                  SERVING LAYER                               │
├─────────────────────────────────────────────────────────────┤
│  Model Server → Load Balancer → Monitoring                  │
│  (TorchServe)   (K8s/ELB)      (Prometheus)                │
└─────────────────────────────────────────────────────────────┘
```

### Component Selection Guide

| Component | Options | Recommendation |
|-----------|---------|----------------|
| **Feature Store** | Feast, Tecton, SageMaker | Feast (open source), Tecton (enterprise) |
| **Experiment Tracking** | MLflow, Weights & Biases, Neptune | MLflow (free), W&B (best UX) |
| **Training Orchestration** | Kubeflow, SageMaker, Vertex AI | SageMaker (AWS), Vertex (GCP) |
| **Model Registry** | MLflow, SageMaker, custom S3 | MLflow (standard) |
| **Model Serving** | TorchServe, TFServing, Triton | Triton (multi-framework) |

---

## Inference Architecture Patterns

### Pattern 1: Synchronous API

Best for: Low-latency requirements, simple integration

```
Client → API Gateway → Model Server → Response
                           │
                      Load Balancer
                           │
                    ┌──────┴──────┐
                    │             │
                Model Pod    Model Pod
```

**Latency targets**:
- P50: < 100ms
- P95: < 300ms
- P99: < 500ms

### Pattern 2: Asynchronous Processing

Best for: Long-running inference, batch processing

```
Client → API → Queue (SQS) → Worker → Result Store → Webhook/Poll
                                          │
                                     S3/Redis
```

**Use when**:
- Inference > 5 seconds
- Batch processing required
- Variable load patterns

### Pattern 3: Edge Inference

Best for: Privacy, offline capability, ultra-low latency

```
┌─────────────────────────────────────────┐
│              EDGE DEVICE                 │
│  ┌─────────┐    ┌─────────────────────┐ │
│  │ Camera  │───▶│ Optimized Model     │ │
│  └─────────┘    │ (ONNX, TFLite)      │ │
│                 └─────────────────────┘ │
│                          │              │
│                     Local Result        │
└─────────────────────────────────────────┘
                           │
                    Sync to Cloud
                    (non-blocking)
```

**Model optimization for edge**:
- Quantization (INT8): 4x smaller, 2-