Skip to main content
ClaudeWave
Skill28k repo starsupdated yesterday

nemo-curator

NeMo Curator is NVIDIA's GPU-accelerated toolkit for preparing training data for large language models, offering fast fuzzy deduplication, quality filtering with 30+ heuristics, PII redaction, and NSFW detection across text, image, video, and audio formats. Use it when preparing web-scraped datasets, deduplicating large corpora at scale, or filtering for quality and safety in multi-modal training data preparation.

Install in Claude Code
Copy
git clone --depth 1 https://github.com/davila7/claude-code-templates /tmp/nemo-curator && cp -r /tmp/nemo-curator/cli-tool/components/skills/ai-research/data-processing-nemo-curator ~/.claude/skills/nemo-curator
Then start a new Claude Code session; the skill loads automatically.

SKILL.md

# NeMo Curator - GPU-Accelerated Data Curation

NVIDIA's toolkit for preparing high-quality training data for LLMs.

## When to use NeMo Curator

**Use NeMo Curator when:**
- Preparing LLM training data from web scrapes (Common Crawl)
- Need fast deduplication (16× faster than CPU)
- Curating multi-modal datasets (text, images, video, audio)
- Filtering low-quality or toxic content
- Scaling data processing across GPU cluster

**Performance**:
- **16× faster** fuzzy deduplication (8TB RedPajama v2)
- **40% lower TCO** vs CPU alternatives
- **Near-linear scaling** across GPU nodes

**Use alternatives instead**:
- **datatrove**: CPU-based, open-source data processing
- **dolma**: Allen AI's data toolkit
- **Ray Data**: General ML data processing (no curation focus)

## Quick start

### Installation

```bash
# Text curation (CUDA 12)
uv pip install "nemo-curator[text_cuda12]"

# All modalities
uv pip install "nemo-curator[all_cuda12]"

# CPU-only (slower)
uv pip install "nemo-curator[cpu]"
```

### Basic text curation pipeline

```python
from nemo_curator import ScoreFilter, Modify
from nemo_curator.datasets import DocumentDataset
import pandas as pd

# Load data
df = pd.DataFrame({"text": ["Good document", "Bad doc", "Excellent text"]})
dataset = DocumentDataset(df)

# Quality filtering
def quality_score(doc):
    return len(doc["text"].split()) > 5  # Filter short docs

filtered = ScoreFilter(quality_score)(dataset)

# Deduplication
from nemo_curator.modules import ExactDuplicates
deduped = ExactDuplicates()(filtered)

# Save
deduped.to_parquet("curated_data/")
```

## Data curation pipeline

### Stage 1: Quality filtering

```python
from nemo_curator.filters import (
    WordCountFilter,
    RepeatedLinesFilter,
    UrlRatioFilter,
    NonAlphaNumericFilter
)

# Apply 30+ heuristic filters
from nemo_curator import ScoreFilter

# Word count filter
dataset = dataset.filter(WordCountFilter(min_words=50, max_words=100000))

# Remove repetitive content
dataset = dataset.filter(RepeatedLinesFilter(max_repeated_line_fraction=0.3))

# URL ratio filter
dataset = dataset.filter(UrlRatioFilter(max_url_ratio=0.2))
```

### Stage 2: Deduplication

**Exact deduplication**:
```python
from nemo_curator.modules import ExactDuplicates

# Remove exact duplicates
deduped = ExactDuplicates(id_field="id", text_field="text")(dataset)
```

**Fuzzy deduplication** (16× faster on GPU):
```python
from nemo_curator.modules import FuzzyDuplicates

# MinHash + LSH deduplication
fuzzy_dedup = FuzzyDuplicates(
    id_field="id",
    text_field="text",
    num_hashes=260,      # MinHash parameters
    num_buckets=20,
    hash_method="md5"
)

deduped = fuzzy_dedup(dataset)
```

**Semantic deduplication**:
```python
from nemo_curator.modules import SemanticDuplicates

# Embedding-based deduplication
semantic_dedup = SemanticDuplicates(
    id_field="id",
    text_field="text",
    embedding_model="sentence-transformers/all-MiniLM-L6-v2",
    threshold=0.8  # Cosine similarity threshold
)

deduped = semantic_dedup(dataset)
```

### Stage 3: PII redaction

```python
from nemo_curator.modules import Modify
from nemo_curator.modifiers import PIIRedactor

# Redact personally identifiable information
pii_redactor = PIIRedactor(
    supported_entities=["EMAIL_ADDRESS", "PHONE_NUMBER", "PERSON", "LOCATION"],
    anonymize_action="replace"  # or "redact"
)

redacted = Modify(pii_redactor)(dataset)
```

### Stage 4: Classifier filtering

```python
from nemo_curator.classifiers import QualityClassifier

# Quality classification
quality_clf = QualityClassifier(
    model_path="nvidia/quality-classifier-deberta",
    batch_size=256,
    device="cuda"
)

# Filter low-quality documents
high_quality = dataset.filter(lambda doc: quality_clf(doc["text"]) > 0.5)
```

## GPU acceleration

### GPU vs CPU performance

| Operation | CPU (16 cores) | GPU (A100) | Speedup |
|-----------|----------------|------------|---------|
| Fuzzy dedup (8TB) | 120 hours | 7.5 hours | 16× |
| Exact dedup (1TB) | 8 hours | 0.5 hours | 16× |
| Quality filtering | 2 hours | 0.2 hours | 10× |

### Multi-GPU scaling

```python
from nemo_curator import get_client
import dask_cuda

# Initialize GPU cluster
client = get_client(cluster_type="gpu", n_workers=8)

# Process with 8 GPUs
deduped = FuzzyDuplicates(...)(dataset)
```

## Multi-modal curation

### Image curation

```python
from nemo_curator.image import (
    AestheticFilter,
    NSFWFilter,
    CLIPEmbedder
)

# Aesthetic scoring
aesthetic_filter = AestheticFilter(threshold=5.0)
filtered_images = aesthetic_filter(image_dataset)

# NSFW detection
nsfw_filter = NSFWFilter(threshold=0.9)
safe_images = nsfw_filter(filtered_images)

# Generate CLIP embeddings
clip_embedder = CLIPEmbedder(model="openai/clip-vit-base-patch32")
image_embeddings = clip_embedder(safe_images)
```

### Video curation

```python
from nemo_curator.video import (
    SceneDetector,
    ClipExtractor,
    InternVideo2Embedder
)

# Detect scenes
scene_detector = SceneDetector(threshold=27.0)
scenes = scene_detector(video_dataset)

# Extract clips
clip_extractor = ClipExtractor(min_duration=2.0, max_duration=10.0)
clips = clip_extractor(scenes)

# Generate embeddings
video_embedder = InternVideo2Embedder()
video_embeddings = video_embedder(clips)
```

### Audio curation

```python
from nemo_curator.audio import (
    ASRInference,
    WERFilter,
    DurationFilter
)

# ASR transcription
asr = ASRInference(model="nvidia/stt_en_fastconformer_hybrid_large_pc")
transcribed = asr(audio_dataset)

# Filter by WER (word error rate)
wer_filter = WERFilter(max_wer=0.3)
high_quality_audio = wer_filter(transcribed)

# Duration filtering
duration_filter = DurationFilter(min_duration=1.0, max_duration=30.0)
filtered_audio = duration_filter(high_quality_audio)
```

## Common patterns

### Web scrape curation (Common Crawl)

```python
from nemo_curator import ScoreFilter, Modify
from nemo_curator.filters import *
from nemo_curator.modules import *
from nemo_
agent-expertSubagent

Use this agent when creating specialized Claude Code agents for the claude-code-templates components system. Specializes in agent design, prompt engineering, domain expertise modeling, and agent best practices. Examples: <example>Context: User wants to create a new specialized agent. user: 'I need to create an agent that specializes in React performance optimization' assistant: 'I'll use the agent-expert agent to create a comprehensive React performance agent with proper domain expertise and practical examples' <commentary>Since the user needs to create a specialized agent, use the agent-expert agent for proper agent structure and implementation.</commentary></example> <example>Context: User needs help with agent prompt design. user: 'How do I create an agent that can handle both frontend and backend security?' assistant: 'Let me use the agent-expert agent to design a full-stack security agent with proper domain boundaries and expertise areas' <commentary>The user needs agent development help, so use the agent-expert agent.</commentary></example>

blog-writerSubagent

Use this agent to create blog articles for aitmpl.com from Claude Code Templates components. Reads the component, asks the user to confirm details, generates SVG cover, HTML article, and updates blog-articles.json. Examples: <example>Context: User wants a blog for a component. user: 'Create a blog article for cli-tool/components/hooks/security/secret-scanner.json' assistant: 'I'll use the blog-writer agent to create the full blog article with cover image and proper structure' <commentary>The user wants a blog article from a component, use blog-writer for the full pipeline.</commentary></example>

build-checkerSubagent

Runs pre-deploy build checks on the dashboard. Validates Astro build, checks for common esbuild/JSX issues, verifies API endpoints compile, and reports errors with fixes. Use before merging PRs that touch dashboard/.

catalog-generatorSubagent

Regenerates the component catalog (docs/components.json) by running the Python script. Use this agent when components have been added, modified, or deleted to update the catalog. Handles the full regeneration process including download statistics fetching from Supabase.

cli-ui-designerSubagent

CLI interface design specialist. Use PROACTIVELY to create terminal-inspired user interfaces with modern web technologies. Expert in CLI aesthetics, terminal themes, and command-line UX patterns.

command-expertSubagent

Use this agent when creating CLI commands for the claude-code-templates components system. Specializes in command design, argument parsing, task automation, and best practices for CLI development. Examples: <example>Context: User wants to create a new CLI command. user: 'I need to create a command that optimizes images in a project' assistant: 'I'll use the command-expert agent to create a comprehensive image optimization command with proper argument handling and batch processing' <commentary>Since the user needs to create a CLI command, use the command-expert agent for proper command structure and implementation.</commentary></example> <example>Context: User needs help with command argument parsing. user: 'How do I create a command that accepts multiple file patterns?' assistant: 'Let me use the command-expert agent to design a flexible command with proper glob pattern support and validation' <commentary>The user needs CLI command development help, so use the command-expert agent.</commentary></example>

component-improverSubagent

Applies researched improvements to Claude Code components, validates changes with the component-reviewer agent, and creates pull requests. The only agent that modifies files and creates PRs.

component-migratorSubagent

Migrates components (agents, commands, skills, hooks, settings, MCPs) from external GitHub repositories to claude-code-templates, validates them with component-reviewer, and regenerates the catalog