Skip to main content
ClaudeWave
Skill1.1k estrellas del repoactualizado today

semtools

Semtools provides semantic search capabilities using embedding-based similarity matching to find code and text by meaning rather than exact keywords. Use it for discovering functionality when you don't know specific names, locating conceptually similar code across modules, or searching documents like PDFs and DOCX files by topic, making it ideal for exploration when traditional keyword or syntax-based search falls short.

Instalar en Claude Code
Copiar
git clone --depth 1 https://github.com/massgen/MassGen /tmp/semtools && cp -r /tmp/semtools/massgen/skills/semtools ~/.claude/skills/semtools
Después abre una sesión nueva de Claude Code; el skill carga automáticamente.

SKILL.md

# Semtools: Semantic Search

Perform semantic (meaning-based) search across code and documents using embedding-based similarity matching.

## Purpose

The semtools skill provides access to Semtools, a high-performance Rust-based CLI for semantic search and document processing. Unlike traditional text search (ripgrep) which matches exact strings, or structural search (ast-grep) which matches syntax patterns, semtools understands **semantic meaning** through embeddings.

**Key capabilities:**

1. **Semantic Search**: Find code/text by meaning, not just keywords
2. **Workspace Management**: Index large codebases for fast repeated searches
3. **Document Parsing**: Convert PDFs, DOCX, PPTX to searchable text (requires API key)

Semtools excels at **discovery** - finding relevant code when you don't know the exact keywords, function names, or syntax patterns.

## When to Use This Skill

Use the semtools skill when you need meaning-based search:

**Semantic Code Discovery:**
- Finding code that implements a concept ("error handling", "data validation")
- Discovering similar functionality across different modules
- Locating examples of a pattern when you don't know exact names
- Understanding what code does without reading everything

**Documentation & Knowledge:**
- Searching documentation by concept, not keywords
- Finding related discussions in comments or docs
- Discovering similar issues or solutions
- Analyzing technical documents (PDFs, reports)

**Use Cases:**
- "Find all authentication-related code" (without knowing function names)
- "Show me error handling patterns" (regardless of specific error types)
- "Find code similar to this implementation" (semantic similarity)
- "Search research papers for 'distributed consensus'" (document search)

**Choose semtools over file-search (ripgrep/ast-grep) when:**
- You know the **concept** but not the **keywords**
- Exact string matching misses relevant results
- You want semantically similar code, not exact matches
- Searching across languages or mixed content

**Still use file-search when:**
- You know exact keywords, function names, or patterns
- You need structural code matching (ast-grep)
- Speed is critical (ripgrep is faster for exact matches)
- You're searching for specific symbols or references

## Available Commands

Semtools provides three CLI commands you can use via `execute_command`:

- **`search`** - Semantic search across code and text files
- **`workspace`** - Manage workspaces for caching embeddings
- **`parse`** - Convert documents (PDF, DOCX, PPTX) to searchable text

**All commands work out-of-the-box** in your execution environment. Document parsing requires the LLAMA_CLOUD_API_KEY environment variable to be set.

## Core Operations

### 1. Semantic Search (`search`)

Find files and code sections by semantic meaning:

```bash
# Basic semantic search
search "authentication logic" src/

# Search with more context (5 lines before/after)
search "error handling" --n-lines 5 src/

# Get more results (default: 3)
search "database queries" --top-k 10 src/

# Control similarity threshold (0.0-1.0, lower = more lenient)
search "API endpoints" --max-distance 0.4 src/
```

**Parameters:**
- `--n-lines N`: Show N lines of context around matches (default: 3)
- `--top-k K`: Return top K most similar matches (default: 3)
- `--max-distance D`: Maximum embedding distance (0.0-1.0, default: 0.3)
- `-i`: Case-insensitive matching

**Output format:**
```
Match 1 (similarity: 0.12)
File: src/auth/handlers.py
Lines: 42-47
----
def authenticate_user(username: str, password: str) -> Optional[User]:
    """Authenticate user credentials against database."""
    user = get_user_by_username(username)
    if user and verify_password(password, user.password_hash):
        return user
    return None
----

Match 2 (similarity: 0.18)
File: src/middleware/auth.py
...
```

### 2. Workspace Management (`workspace`)

For large codebases, create workspaces to cache embeddings and enable fast repeated searches:

```bash
# Create/activate workspace
workspace use my-project

# Set workspace via environment variable
export SEMTOOLS_WORKSPACE=my-project

# Index files in workspace (workspace auto-detected from env var)
search "query" src/

# Check workspace status
workspace status

# Clean up old workspaces
workspace prune
```

**Benefits:**
- **Fast repeated searches**: Embeddings cached, no re-computation
- **Large codebases**: IVF_PQ indexing for scalability
- **Session persistence**: Maintain context across multiple searches

**When to use workspaces:**
- Searching the same codebase multiple times
- Very large projects (1000+ files)
- Interactive exploration sessions
- CI/CD pipelines with repeated searches

### 3. Document Parsing (`parse`) ⚠️ Requires API Key

Convert documents to searchable markdown (requires LlamaParse API key):

```bash
# Parse PDFs to markdown
parse research_papers/*.pdf

# Parse Word documents
parse reports/*.docx

# Parse presentations
parse slides/*.pptx

# Parse and pipe to search
parse docs/*.pdf | xargs search "neural networks"
```

**Supported formats:**
- PDF (.pdf)
- Word (.docx)
- PowerPoint (.pptx)

**Configuration:**
```bash
# Via environment variable
export LLAMA_CLOUD_API_KEY="llx-..."

# Via config file
cat > ~/.parse_config.json << EOF
{
  "api_key": "llx-...",
  "max_concurrent_requests": 10,
  "timeout_seconds": 3600
}
EOF
```

**Important:** Document parsing is **optional**. Semantic search works without it.

## Workflow Patterns

### Pattern 1: Concept Discovery

When you know what you're looking for conceptually but not by name:

```bash
# Step 1: Broad semantic search
search "rate limiting implementation" src/

# Step 2: Review results, refine query
search "throttle requests per user" src/ --top-k 10

# Step 3: Use ripgrep for exact follow-up
rg "RateLimiter" --type py src/
```

### Pattern 2: Similar Code Finder

When you want to find code similar to a reference implementation:

```bash
# Step 1: Extract key concepts
audio-generationSkill

Guide to audio generation and understanding in MassGen. Covers text-to-speech, music, sound effects, and audio understanding across ElevenLabs and OpenAI backends.

backend-integratorSkill

Complete guide for integrating a new LLM backend into MassGen. Use when adding a new provider (e.g., Codex, Mistral, DeepSeek) or when auditing an existing backend for missing integration points. Covers all ~15 files that need touching.

evolving-skill-creatorSkill

Guide for creating evolving skills - detailed workflow plans that capture what you'll do, what tools you'll create, and learnings from execution. Use this when starting a new task that could benefit from a reusable workflow.

file-searchSkill

This skill should be used when agents need to search codebases for text patterns or structural code patterns. Provides fast search using ripgrep for text and ast-grep for syntax-aware code search.

image-generationSkill

Guide to image generation and editing in MassGen. Use when creating images, editing existing images, iterating on image designs, or choosing between image backends (OpenAI, Google Gemini/Imagen, Grok, OpenRouter).

massgen-config-creatorSkill

Guide for creating properly structured YAML configuration files for MassGen. This skill should be used when agents need to create new configs for examples, case studies, testing, or demonstrating features.

massgen-develops-massgenSkill

Guide for using MassGen to develop and improve itself. This skill should be used when agents need to run MassGen experiments programmatically (using automation mode) OR analyze terminal UI/UX quality (using visual evaluation tools). These are mutually exclusive workflows for different improvement goals.

massgen-log-analyzerSkill

Run MassGen experiments and analyze logs using automation mode, logfire tracing, and SQL queries. Use this skill for performance analysis, debugging agent behavior, evaluating coordination patterns, and improving the logging structure, or whenever an ANALYSIS_REPORT.md is needed in a log directory.