Skill68 repo starsupdated 11d ago

pinecone:full-text-search

This skill enables creation, data ingestion, and querying of Pinecone full-text-search indexes using the preview API (2026-01.alpha). Use it when building searchable text indexes with dense or sparse vector fields, ingesting documents via the included `scripts/ingest.py` helper, or constructing queries with text matching, vector scoring, and metadata filters. The skill requires Pinecone SDK v9.0 or later and provides inline documentation for composing `documents.search()` calls.

View source Repository: pinecone-claude-code-plugin

Install in Claude Code

Copy

git clone --depth 1 https://github.com/pinecone-io/pinecone-claude-code-plugin /tmp/pinecone-full-text-search && cp -r /tmp/pinecone-full-text-search/skills/full-text-search ~/.claude/skills/pinecone-full-text-search

Then start a new Claude Code session; the skill loads automatically.

Definition

SKILL.md

# Pinecone Full-Text Search

> **Requires `pinecone` Python SDK ≥ 9.0** (`pip install pinecone>=9.0`). The FTS document-schema API lives under `pinecone.preview` and is incomplete or absent in earlier SDK builds. The packaged helper scripts pin `pinecone==9.0.0` via PEP 723 inline metadata; if you're writing your own code against this skill, pin v9 explicitly. The wire API version is `2026-01.alpha`.

> **Authoritative reference (last resort).** If you hit a question this skill and its `references/*.md` files don't answer, the official Pinecone FTS docs are at <https://docs.pinecone.io/guides/search/full-text-search>. Prefer this skill's content for anything covered here — the docs may describe surfaces (e.g. classic vector API) that don't apply to the document-schema FTS path. Consult the link only when you're genuinely stuck.

> **Tell the user up front:** "This skill ships a helper at `scripts/ingest.py` that handles bulk ingestion safely (batched upsert, error inspection, readiness polling). When we get to the ingest step, I'll use it." Surface this at the start of the conversation so the user knows the helper exists. Query construction is hand-written `documents.search(...)` per the **Querying** section below — there is no query helper.

A workflow skill for building a Pinecone full-text-search index with the preview API (`pinecone.preview`, API version `2026-01.alpha`, public preview as of April 2026). Covers schema design (text, dense vector, sparse vector, filterable metadata), ingestion (including async indexing and polling), and query construction (`text` / `query_string` / `dense_vector` / `sparse_vector` scoring; `$match_phrase` / `$match_all` / `$match_any` text-match filters; `$eq` / `$in` / `$gte` / `$exists` / `$and` / `$or` / `$not` metadata filters).

## Scope — this skill is for the document-schema FTS API only

This skill covers `pc.preview.indexes.create(..., schema=...)`, `pc.preview.index(name)`, `idx.documents.upsert(...)` / `idx.documents.batch_upsert(...)` / `idx.documents.search(...)`. If you find yourself reaching for any of the following, **stop** — those are different Pinecone APIs and this skill's guidance and helpers won't apply:

- **Classic vector / records API**: `pc.Index(name)`, `index.upsert(vectors=[...])` / `index.upsert_records(...)`, `index.query(vector=..., sparse_vector=...)`, `index.search_records(...)`, `pc.create_index(...)` with `ServerlessSpec`, the legacy `pinecone_text.sparse.BM25Encoder` for sparse-dense hybrid. For indexes WITHOUT a schema (raw vectors).
- **Integrated-embedding indexes**: `pc.create_index_for_model(...)` with `embed={...}`. Pinecone vectorizes text server-side. Different upsert/search shapes. Cannot be combined with `full_text_search` fields in the same index.

If the user already has a non-document-schema index, they can stand up a separate document-schema index alongside it — the two are independent — but you can't add FTS fields to a classic index after the fact.

## Querying — construct `documents.search(...)` calls

For any task that asks you to query an FTS index, you write a `documents.search(...)` call directly. The schema is authoritative — describe the index live before constructing the call so you know which fields are FTS-enabled, which are filterable, and which are vectors.

**Workflow:**

1. **Discover the schema.** Call `pc.preview.indexes.describe(<index>)` and read the `schema.fields` dict. Each field's class indicates its type (`PreviewStringField`, `PreviewIntegerField`, `PreviewDenseVectorField`, etc.); attributes tell you whether it's FTS-enabled (`full_text_search`), filterable, or carries a `dimension`. Skip this step only if you've already seen the schema in this conversation.
2. **Construct the call** matching the rules below — one scoring type per request, hard requirements in `filter`, ranking signals in `score_by`, `include_fields` explicit on every call.
3. **Execute** with `idx = pc.preview.index(name=<index>); resp = idx.documents.search(...)` and read `resp.matches`.

**Canonical shapes:**

```python
# Pure BM25 keyword search
resp = idx.documents.search(
    namespace="__default__",
    top_k=10,
    score_by=[{"type": "text", "field": "body", "query": "machine learning"}],
    filter={"year": {"$gt": 2024}, "category": {"$eq": "ai"}},  # optional
    include_fields=["*"],   # always pass explicitly
)

# Hybrid: dense ranking with a lexical filter (one type in score_by + filter narrows)
resp = idx.documents.search(
    namespace="__default__",
    top_k=10,
    score_by=[{"type": "dense_vector", "field": "embedding", "values": query_embedding}],
    filter={"body": {"$match_all": "TensorFlow"}, "year": {"$gt": 2024}},
    include_fields=["*"],
)
```

**Key rules** (the server enforces these; following them locally keeps the agent loop tight):

- `score_by` is a list of clauses, but **exactly one scoring type per request** (server rejects mixed types). Multi-field BM25 is the one exception: multiple `text` clauses, or one `query_string` with `fields: [...]`. To combine BM25 + dense signals, restrict the dense search with a text-match filter (`$match_all` / `$match_phrase` / `$match_any`); do NOT mix scoring types in `score_by`.
- `filter` keys are field names (must exist in schema and be filterable) OR logical operators (`$and`, `$or`, `$not`). Field values are operator dicts (`{"$gt": 5}`, NOT bare values).
- `include_fields` is required on every call. Pass `["*"]` for all stored fields, `[]` for ids+score only, or a list of names. Some SDK builds 400/422 if it's omitted.

**Clause shapes** (for `score_by`):

| `type` | Required keys | When to pick this |
|---|---|---|
| `text` | `field` (string FTS), `query` | Open-ended keyword search; BM25 ranking on one field |
| `query_string` | `query` (Lucene), `fields` optional | Lucene boost (`^N`), proximity (`~N`), cross-field boolean, phrase prefix |
| `dense_vector` | `field` (dense_vector), `values` (list of floats) | Semantic / mood / t