Skill1.1k repo starsupdated 1mo ago

semtools

Semtools provides semantic search capabilities using embedding-based similarity matching to find code and text by meaning rather than exact keywords. Use it for discovering functionality when you don't know specific names, locating conceptually similar code across modules, or searching documents like PDFs and DOCX files by topic, making it ideal for exploration when traditional keyword or syntax-based search falls short.

View source Repository: MassGen

Install in Claude Code

Copy

git clone --depth 1 https://github.com/massgen/MassGen /tmp/semtools && cp -r /tmp/semtools/massgen/skills/semtools ~/.claude/skills/semtools

Then start a new Claude Code session; the skill loads automatically.

Definition

SKILL.md

# Semtools: Semantic Search

Perform semantic (meaning-based) search across code and documents using embedding-based similarity matching.

## Purpose

The semtools skill provides access to Semtools, a high-performance Rust-based CLI for semantic search and document processing. Unlike traditional text search (ripgrep) which matches exact strings, or structural search (ast-grep) which matches syntax patterns, semtools understands **semantic meaning** through embeddings.

**Key capabilities:**

1. **Semantic Search**: Find code/text by meaning, not just keywords
2. **Workspace Management**: Index large codebases for fast repeated searches
3. **Document Parsing**: Convert PDFs, DOCX, PPTX to searchable text (requires API key)

Semtools excels at **discovery** - finding relevant code when you don't know the exact keywords, function names, or syntax patterns.

## When to Use This Skill

Use the semtools skill when you need meaning-based search:

**Semantic Code Discovery:**
- Finding code that implements a concept ("error handling", "data validation")
- Discovering similar functionality across different modules
- Locating examples of a pattern when you don't know exact names
- Understanding what code does without reading everything

**Documentation & Knowledge:**
- Searching documentation by concept, not keywords
- Finding related discussions in comments or docs
- Discovering similar issues or solutions
- Analyzing technical documents (PDFs, reports)

**Use Cases:**
- "Find all authentication-related code" (without knowing function names)
- "Show me error handling patterns" (regardless of specific error types)
- "Find code similar to this implementation" (semantic similarity)
- "Search research papers for 'distributed consensus'" (document search)

**Choose semtools over file-search (ripgrep/ast-grep) when:**
- You know the **concept** but not the **keywords**
- Exact string matching misses relevant results
- You want semantically similar code, not exact matches
- Searching across languages or mixed content

**Still use file-search when:**
- You know exact keywords, function names, or patterns
- You need structural code matching (ast-grep)
- Speed is critical (ripgrep is faster for exact matches)
- You're searching for specific symbols or references

## Available Commands

Semtools provides three CLI commands you can use via `execute_command`:

- **`search`** - Semantic search across code and text files
- **`workspace`** - Manage workspaces for caching embeddings
- **`parse`** - Convert documents (PDF, DOCX, PPTX) to searchable text

**All commands work out-of-the-box** in your execution environment. Document parsing requires the LLAMA_CLOUD_API_KEY environment variable to be set.

## Core Operations

### 1. Semantic Search (`search`)

Find files and code sections by semantic meaning:

```bash
# Basic semantic search
search "authentication logic" src/

# Search with more context (5 lines before/after)
search "error handling" --n-lines 5 src/

# Get more results (default: 3)
search "database queries" --top-k 10 src/

# Control similarity threshold (0.0-1.0, lower = more lenient)
search "API endpoints" --max-distance 0.4 src/
```

**Parameters:**
- `--n-lines N`: Show N lines of context around matches (default: 3)
- `--top-k K`: Return top K most similar matches (default: 3)
- `--max-distance D`: Maximum embedding distance (0.0-1.0, default: 0.3)
- `-i`: Case-insensitive matching

**Output format:**
```
Match 1 (similarity: 0.12)
File: src/auth/handlers.py
Lines: 42-47
----
def authenticate_user(username: str, password: str) -> Optional[User]:
    """Authenticate user credentials against database."""
    user = get_user_by_username(username)
    if user and verify_password(password, user.password_hash):
        return user
    return None
----

Match 2 (similarity: 0.18)
File: src/middleware/auth.py
...
```

### 2. Workspace Management (`workspace`)

For large codebases, create workspaces to cache embeddings and enable fast repeated searches:

```bash
# Create/activate workspace
workspace use my-project

# Set workspace via environment variable
export SEMTOOLS_WORKSPACE=my-project

# Index files in workspace (workspace auto-detected from env var)
search "query" src/

# Check workspace status
workspace status

# Clean up old workspaces
workspace prune
```

**Benefits:**
- **Fast repeated searches**: Embeddings cached, no re-computation
- **Large codebases**: IVF_PQ indexing for scalability
- **Session persistence**: Maintain context across multiple searches

**When to use workspaces:**
- Searching the same codebase multiple times
- Very large projects (1000+ files)
- Interactive exploration sessions
- CI/CD pipelines with repeated searches

### 3. Document Parsing (`parse`) ⚠️ Requires API Key

Convert documents to searchable markdown (requires LlamaParse API key):

```bash
# Parse PDFs to markdown
parse research_papers/*.pdf

# Parse Word documents
parse reports/*.docx

# Parse presentations
parse slides/*.pptx

# Parse and pipe to search
parse docs/*.pdf | xargs search "neural networks"
```

**Supported formats:**
- PDF (.pdf)
- Word (.docx)
- PowerPoint (.pptx)

**Configuration:**
```bash
# Via environment variable
export LLAMA_CLOUD_API_KEY="llx-..."

# Via config file
cat > ~/.parse_config.json << EOF
{
  "api_key": "llx-...",
  "max_concurrent_requests": 10,
  "timeout_seconds": 3600
}
EOF
```

**Important:** Document parsing is **optional**. Semantic search works without it.

## Workflow Patterns

### Pattern 1: Concept Discovery

When you know what you're looking for conceptually but not by name:

```bash
# Step 1: Broad semantic search
search "rate limiting implementation" src/

# Step 2: Review results, refine query
search "throttle requests per user" src/ --top-k 10

# Step 3: Use ripgrep for exact follow-up
rg "RateLimiter" --type py src/
```

### Pattern 2: Similar Code Finder

When you want to find code similar to a reference implementation:

```bash
# Step 1: Extract key concepts