Skip to main content
ClaudeWave
Skill209 estrellas del repoactualizado 7d ago

algo-nlp-similarity

Calculate text similarity using lexical and semantic methods for matching and deduplication. Use this skill when the user needs to find similar documents, detect near-duplicates, or measure semantic closeness between texts — even if they say 'how similar are these texts', 'find duplicates', or 'semantic matching'.

Instalar en Claude Code
Copiar
git clone --depth 1 https://github.com/asgard-ai-platform/skills /tmp/algo-nlp-similarity && cp -r /tmp/algo-nlp-similarity/algo-nlp-similarity ~/.claude/skills/algo-nlp-similarity
Después abre una sesión nueva de Claude Code; el skill carga automáticamente.

SKILL.md

# Text Similarity

## Overview

Text similarity measures how close two texts are in meaning or surface form. Lexical methods (Jaccard, cosine on TF-IDF) compare word overlap. Semantic methods (sentence embeddings) capture meaning even with different words. Choice depends on whether you need exact matching or meaning matching.

## When to Use

**Trigger conditions:**
- Finding similar or duplicate documents in a collection
- Matching queries to FAQ answers or knowledge base entries
- Detecting plagiarism or content reuse

**When NOT to use:**
- For topic-level grouping (use topic modeling / LDA)
- For entity extraction from text (use NER)

## Algorithm

```
IRON LAW: Lexical Similarity ≠ Semantic Similarity
"The car is fast" and "The automobile is speedy" have LOW lexical
similarity (different words) but HIGH semantic similarity (same meaning).
"Bank of the river" and "Bank account" have HIGH lexical similarity
but LOW semantic similarity. Choose the method that matches your
definition of "similar."
```

### Phase 1: Input Validation
Determine: similarity type needed (lexical or semantic), text preprocessing requirements, scale (pairwise vs all-pairs vs query-to-corpus).
**Gate:** Texts preprocessed, method selected.

### Phase 2: Core Algorithm
**Lexical methods:**
- Jaccard: |A∩B| / |A∪B| on word sets
- Cosine on TF-IDF vectors: cos(θ) = (A·B) / (|A|×|B|)

**Semantic methods:**
- Sentence embeddings: encode texts with sentence-transformers (all-MiniLM-L6-v2)
- Cosine similarity on embedding vectors
- For large-scale: use FAISS or Annoy for approximate nearest neighbor search

### Phase 3: Verification
Spot-check: highly similar pairs should be genuinely similar. Low-similarity pairs should be genuinely different. Check threshold calibration.
**Gate:** Similarity scores align with human judgment on sample pairs.

### Phase 4: Output
Return similarity scores or nearest neighbors.

## Output Format

```json
{
  "similarities": [{"text_a": "doc1", "text_b": "doc5", "score": 0.92, "method": "semantic_cosine"}],
  "metadata": {"method": "sentence-transformers", "model": "all-MiniLM-L6-v2", "pairs_computed": 500}
}
```

## Examples

### Sample I/O
**Input:** Text A: "How to reset my password", Text B: "I forgot my login credentials"
**Expected:** Lexical (Jaccard) ≈ 0.07 (almost no word overlap). Semantic ≈ 0.82 (same intent).

### Edge Cases
| Input | Expected | Why |
|-------|----------|-----|
| Identical texts | Score = 1.0 | Exact match |
| Empty text | Undefined or 0 | Handle gracefully |
| Different languages | Lexical=0, semantic depends on model | Multilingual models can match cross-language |

## Gotchas

- **Threshold is use-case specific**: 0.8 similarity might mean "duplicate" for deduplication but "somewhat related" for recommendation. Calibrate threshold on labeled examples.
- **Text length effects**: Cosine on TF-IDF is sensitive to document length. Very short texts have sparse vectors with unreliable similarity. Use embeddings for short texts.
- **Embedding model choice**: Different models have different strengths. all-MiniLM-L6-v2 is fast but less accurate than larger models. Match model to performance needs.
- **Computational scaling**: All-pairs similarity on N documents is O(N²). For large corpora, use approximate methods (locality-sensitive hashing, FAISS).
- **Domain adaptation**: General-purpose embedding models may not capture domain-specific similarity (legal, medical). Fine-tune on domain data for best results.

## References

- For embedding model comparison and benchmarks, see `references/model-benchmarks.md`
- For approximate nearest neighbor search at scale, see `references/ann-search.md`
algo-ad-biddingSkill

Implement and select ad bidding strategies from manual CPC to automated target-CPA and target-ROAS. Use this skill when the user needs to choose a bidding strategy, set up automated bidding, or optimize bid parameters — even if they say 'what bidding strategy should I use', 'target CPA setup', or 'smart bidding configuration'.

algo-ad-budgetSkill

Optimize advertising budget allocation across campaigns using marginal returns analysis. Use this skill when the user needs to distribute budget across multiple campaigns, optimize spend pacing, or maximize overall ROAS under budget constraints — even if they say 'how to split my ad budget', 'campaign budget optimization', or 'diminishing returns on ad spend'.

algo-ad-ctrSkill

Build CTR prediction models for estimating ad click-through rates from features. Use this skill when the user needs to predict click probability, build an ad ranking model, or evaluate ad creative performance — even if they say 'predict click rate', 'ad relevance scoring', or 'which ad will get more clicks'.

algo-ad-gspSkill

Implement Generalized Second Price auction for ad slot allocation and pricing. Use this skill when the user needs to understand search ad auctions, compute ad positions and costs-per-click, or analyze bidding dynamics — even if they say 'how does Google Ads auction work', 'ad rank calculation', or 'second price auction for ads'.

algo-ad-vcgSkill

Implement VCG mechanism for incentive-compatible ad slot allocation with truthful bidding. Use this skill when the user needs to design a truthful auction mechanism, compute externality-based payments, or understand why platforms may prefer GSP over VCG — even if they say 'truthful auction design', 'VCG payments', or 'incentive-compatible mechanism'.

algo-blockchain-basicsSkill

Explain blockchain fundamentals including distributed ledger architecture, consensus mechanisms, and block structure. Use this skill when the user needs to understand blockchain concepts, evaluate whether blockchain fits a use case, or design a blockchain-based solution — even if they say 'how does blockchain work', 'do I need blockchain', or 'distributed ledger'.

algo-blockchain-smart-contractSkill

Design and implement smart contracts as self-executing programmatic agreements on blockchain. Use this skill when the user needs to build automated on-chain logic, evaluate smart contract security, or design tokenized business rules — even if they say 'smart contract development', 'automated agreement', or 'on-chain logic'.

algo-ecom-bm25Skill

Implement BM25 ranking function for e-commerce product search relevance scoring. Use this skill when the user needs to build a text-based product search engine, improve search result relevance, or replace basic TF-IDF with a more robust ranking function — even if they say 'product search ranking', 'search relevance', or 'BM25 implementation'.