Skip to main content
ClaudeWave
Skill125 repo starsupdated today

semantic-grep

In-process semantic search over text files or in-memory strings, using Gemini embeddings via the CF AI Gateway. Use when user wants fuzzy/conceptual search where exact-keyword grep would miss — "sessions discussing regulatory constraints", "code about retry logic", "notes mentioning burnout even if the word isn't there". Complements searching-codebases (regex/AST) and extracting-keywords (YAKE). Do NOT use when an exact string/regex match is what's wanted — grep/rg wins on speed and precision there.

Install in Claude Code
Copy
git clone --depth 1 https://github.com/oaustegard/claude-skills /tmp/semantic-grep && cp -r /tmp/semantic-grep/semantic-grep ~/.claude/skills/semantic-grep
Then start a new Claude Code session; the skill loads automatically.

SKILL.md

# Semantic Grep

jina-grep-style semantic search, done in-process via Python rather than as an external CLI. Embeds query + corpus chunks with `gemini-embedding-001`, ranks by cosine similarity, returns grep-format output.

## When Semantic Search Helps

The core trade-off (lifted from `jina-grep-cli`'s own docs and validated in testing):

| Task | Tool |
|------|------|
| Known exact string, filename, or regex | `grep` / `rg` / `searching-codebases` |
| "What files discuss concept X" when X may not appear verbatim | **semantic-grep** |
| Hybrid: prefilter with grep, rerank by concept | grep → `rerank_candidates()` |

**Regression test result (workshop session corpus, 135 docs):**
- *"handling regulatory constraints"* → top hit *"Engineering AI Systems Under Sovereignty Constraints"* (0.67). ✓
- *"sessions about GEPA"* → top hit *"Gemma, DeepMind's Family of Open Models"* (0.69). ✗ — false positive on phonetic neighbor. GEPA is mentioned verbatim in one session description; grep would find it correctly.

**Rule: when the user query reads like a named entity or keyword, try grep first. Only reach for semantic-grep when paraphrase/concept matching is actually needed.**

## Setup

Credentials via `proxy.env` (Cloudflare AI Gateway w/ BYOK — same pattern as `invoking-gemini`):

```
CF_ACCOUNT_ID=...
CF_GATEWAY_ID=...
CF_API_TOKEN=...
```

Direct-API fallback: `GOOGLE_API_KEY` or `GEMINI_API_KEY` env var. No dependencies beyond `requests` + `numpy`.

## Quick Start

```python
import sys
sys.path.insert(0, '/mnt/skills/user/semantic-grep/scripts')
from semantic_grep import semantic_grep, format_grep

# Directory of .txt files
results = semantic_grep("error handling under load", "/path/to/notes",
                        top_k=5, granularity="paragraph")
print(format_grep(results))
# notes/incidents.txt:42:  When the queue depth exceeds... [0.71]
# notes/postmortem.txt:8:  Under sustained traffic we saw... [0.68]
```

## Core API

### `semantic_grep(query, corpus, *, top_k=10, threshold=None, ...)`

Main search function.

- `query` *(str)* — the search query (embedded with `RETRIEVAL_QUERY` task type)
- `corpus` *(str | Path | list[Chunk])* — a file, directory, or pre-chunked list
- `top_k` *(int | None)* — max results; `None` = all above threshold
- `threshold` *(float | None)* — cosine similarity cutoff; `None` = no filter (top_k only)
- `granularity` *("paragraph" | "line")* — how to chunk files (default paragraph)
- `include` *(str)* — filename-glob filter when `corpus` is a directory (default `"*.txt"`). Matches against `Path.name` only, not the full path — `"*.md"` works, `"docs/*.md"` does not.
- `model` *(str)* — default `"gemini-embedding-001"`
- `dim` *(int)* — 128 / 768 / 1536 / 3072 (default 768; MRL-truncated + renormalized)
- `task` *("text" | "code")* — selects text vs code task types

Returns `list[Match]` where `Match` has `path`, `line`, `text`, `score`.

### `load_corpus(path, *, include="*.txt", granularity="paragraph") -> list[Chunk]`

Load and chunk a file or directory without embedding. Useful for inspecting what gets embedded before paying for the API call.

### `embed_batch(texts, task_type, *, model, dim, group_size=100) -> np.ndarray`

Lower-level: embed a list of strings directly via `:batchEmbedContents`. Returns `(N, dim)` float32 array, rows normalized when `dim < 3072`.

### `format_grep(matches, *, max_text_chars=200, show_score=True) -> str`

Format matches as grep output: `path:line: snippet  [score]`.

## Pipe-mode Rerank Pattern

The highest-leverage use isn't naive full-corpus semantic search — it's hybrid retrieval: **fast coarse filter → semantic rerank**.

```python
import subprocess
from semantic_grep import Chunk, semantic_grep, format_grep

# Stage 1: fast exact/regex prefilter with rg
result = subprocess.run(
    ["rg", "-n", "--no-heading", "error|fail|timeout", "logs/"],
    capture_output=True, text=True,
)

# Parse `path:line:text` into Chunks
chunks = []
for raw in result.stdout.splitlines():
    path, line, text = raw.split(":", 2)
    chunks.append(Chunk(path=path, line=int(line), text=text))

# Stage 2: semantic rerank on the prefiltered subset
ranked = semantic_grep("intermittent queue saturation during peak traffic",
                       chunks, top_k=10)
print(format_grep(ranked))
```

This is how you scale past the "embed the whole corpus every call" limit without needing a vector DB. The exact-match stage cheaply cuts millions of lines to thousands; semantic reranks those.

## Task Types (Gemini)

- **text mode** (default): query → `RETRIEVAL_QUERY`, docs → `RETRIEVAL_DOCUMENT`. Asymmetric — documented to outperform symmetric encoding for retrieval.
- **code mode**: query → `CODE_RETRIEVAL_QUERY`, docs → `RETRIEVAL_DOCUMENT`. Use when searching code with natural-language queries.

Use `SEMANTIC_SIMILARITY` (symmetric) only if you're doing pairwise sim, not retrieval. This module doesn't expose that path yet.

## Model Notes

`gemini-embedding-001` (GA since Feb 2026):
- 2,048 input token limit per text. Longer texts are truncated at ~8K chars (approximation).
- Matryoshka (MRL) — 3072 native dims, safely truncatable to 1536/768/256/128.
- 3072 is auto-normalized; lower dims need client-side renorm (handled here).
- Pricing: $0.15 / 1M input tokens. 135 medium paragraphs ≈ 15K tokens ≈ $0.002 per query.

`gemini-embedding-2-preview` (March 2026) is multimodal and currently top of MTEB. Set `model="gemini-embedding-2-preview"` to opt in once the preview stabilizes.

## Limitations (v0.1.1)

- **No persistent index.** Every call re-embeds the corpus. Fine for <~1K chunks; prohibitive for real knowledge bases. Phase 2: cache embeddings by content hash.
- **Token budget is approximated by char count (×1.5).** Conservative for mixed-script text; over-truncates English slightly. Real tokenizer would use the Gemini tokenizer endpoint but costs an extra call per embed.
- **Batch bulk-failure diagnostic.** If one text in a group of 100 overflows
accessing-github-reposSkill

GitHub repository access in containerized environments using REST API and credential detection. Use when git clone fails, or when accessing private repos/writing files via API.

api-credentialsSkill

Securely manages API credentials for multiple providers (Anthropic Claude, Google Gemini, GitHub). Use when skills need to access stored API keys for external service invocations.

asking-questionsSkill

Guidance for asking clarifying questions when user requests are ambiguous, have multiple valid approaches, or require critical decisions. Use when implementation choices exist that could significantly affect outcomes.

assessing-impactSkill

>-

bm25Skill

>-

browsing-blueskySkill

Browse Bluesky content via API and firehose - search posts, fetch user activity, sample trending topics, read feeds and lists, analyze and categorize accounts. Supports authenticated access for personalized feeds. Use for Bluesky research, user monitoring, trend analysis, feed reading, firehose sampling, account categorization.

building-github-indexSkill

Generate progressive disclosure indexes for GitHub repositories to use as Claude project knowledge. Use when setting up projects referencing external documentation, creating searchable indexes of technical blogs or knowledge bases, combining multiple repos into one index, or when user mentions "index", "github repo", "project knowledge", or "documentation reference".

categorizing-bsky-accountsSkill

Analyze and categorize Bluesky accounts by topic using keyword extraction. Use when users mention Bluesky account analysis, following/follower lists, topic discovery, account curation, or network analysis.