algo-nlp-lda
Implement LDA topic modeling to discover latent topics in document collections. Use this skill when the user needs to extract topics from a text corpus, categorize documents by theme, or explore thematic structure — even if they say 'what are the main topics', 'topic extraction', or 'document clustering by theme'.
git clone --depth 1 https://github.com/asgard-ai-platform/skills /tmp/algo-nlp-lda && cp -r /tmp/algo-nlp-lda/algo-nlp-lda ~/.claude/skills/algo-nlp-ldaSKILL.md
# LDA Topic Modeling
## Overview
Latent Dirichlet Allocation models each document as a mixture of topics and each topic as a distribution over words. Discovers K latent topics from a corpus without supervision. Uses Gibbs sampling or variational inference. Complexity: O(N × K × iterations) where N = total word tokens.
## When to Use
**Trigger conditions:**
- Discovering latent themes in a large document collection
- Organizing/categorizing documents by automatically discovered topics
- Exploratory text analysis when categories are unknown
**When NOT to use:**
- When categories are known (use supervised classification)
- For short texts (tweets, titles) — too few words per document for reliable topic assignment
- When you need semantic understanding (use embeddings)
## Algorithm
```
IRON LAW: The Number of Topics K Must Be Chosen, Not Discovered
LDA does NOT tell you how many topics exist. K is a hyperparameter.
Too few topics: overly broad, mixed themes. Too many: fragmented,
redundant topics. Use coherence score (C_v) to compare K values,
but the final choice requires human judgment on topic interpretability.
```
### Phase 1: Input Validation
Preprocess: tokenize, remove stop words, apply lemmatization. Build document-term matrix. Filter: remove terms appearing in <5 or >50% of documents.
**Gate:** Clean DTM, vocabulary size reasonable (1K-50K terms).
### Phase 2: Core Algorithm
1. Choose K (start with √(N/2), try range K=5,10,15,20,...)
2. Set hyperparameters: α = 50/K (document-topic density), β = 0.01 (topic-word density)
3. Run LDA (Gibbs sampling: 1000+ iterations, or variational inference)
4. Extract: topic-word distributions (top 10-20 words per topic) and document-topic distributions
### Phase 3: Verification
Evaluate: topic coherence (C_v score, higher is better), manual inspection of top words per topic, check for "junk" topics (mixed incoherent words).
**Gate:** Coherence score acceptable, topics are humanly interpretable.
### Phase 4: Output
Return topics with top words and document assignments.
## Output Format
```json
{
"topics": [{"id": 0, "label": "finance", "top_words": ["revenue", "profit", "quarter", "growth"], "coherence": 0.55}],
"doc_topics": [{"doc_id": "d1", "dominant_topic": 0, "topic_distribution": [0.7, 0.1, 0.2]}],
"metadata": {"K": 10, "coherence_avg": 0.48, "documents": 5000, "vocabulary": 8000}
}
```
## Examples
### Sample I/O
**Input:** 1000 news articles, K=5
**Expected:** Topics like: {politics, sports, technology, business, entertainment} with coherent top words per topic.
### Edge Cases
| Input | Expected | Why |
|-------|----------|-----|
| Very short documents | Poor topic assignment | Too few words for reliable mixture estimation |
| Homogeneous corpus | 1-2 topics dominate | All documents are similar, limited topic diversity |
| K=1 | Single topic = corpus vocabulary | Degenerate case, no discrimination |
## Gotchas
- **Stop words MUST be removed**: LDA will create "junk" topics dominated by common words ("the", "is", "and") if stop words remain.
- **Topic labeling is manual**: LDA gives word distributions, NOT topic names. You must interpret and label topics based on top words.
- **Reproducibility**: Gibbs sampling is stochastic. Different random seeds give different topics. Run multiple times and check stability.
- **Dynamic topics**: Standard LDA assumes topics are static. For evolving corpora (news over years), use Dynamic Topic Models.
- **Hyperparameter sensitivity**: Low α produces documents with fewer, more distinct topics. Low β produces topics with fewer, more specific words. Tune or use automatic methods.
## References
- For coherence metrics and K selection, see `references/topic-evaluation.md`
- For dynamic and correlated topic models, see `references/advanced-lda.md`Implement and select ad bidding strategies from manual CPC to automated target-CPA and target-ROAS. Use this skill when the user needs to choose a bidding strategy, set up automated bidding, or optimize bid parameters — even if they say 'what bidding strategy should I use', 'target CPA setup', or 'smart bidding configuration'.
Optimize advertising budget allocation across campaigns using marginal returns analysis. Use this skill when the user needs to distribute budget across multiple campaigns, optimize spend pacing, or maximize overall ROAS under budget constraints — even if they say 'how to split my ad budget', 'campaign budget optimization', or 'diminishing returns on ad spend'.
Build CTR prediction models for estimating ad click-through rates from features. Use this skill when the user needs to predict click probability, build an ad ranking model, or evaluate ad creative performance — even if they say 'predict click rate', 'ad relevance scoring', or 'which ad will get more clicks'.
Implement Generalized Second Price auction for ad slot allocation and pricing. Use this skill when the user needs to understand search ad auctions, compute ad positions and costs-per-click, or analyze bidding dynamics — even if they say 'how does Google Ads auction work', 'ad rank calculation', or 'second price auction for ads'.
Implement VCG mechanism for incentive-compatible ad slot allocation with truthful bidding. Use this skill when the user needs to design a truthful auction mechanism, compute externality-based payments, or understand why platforms may prefer GSP over VCG — even if they say 'truthful auction design', 'VCG payments', or 'incentive-compatible mechanism'.
Explain blockchain fundamentals including distributed ledger architecture, consensus mechanisms, and block structure. Use this skill when the user needs to understand blockchain concepts, evaluate whether blockchain fits a use case, or design a blockchain-based solution — even if they say 'how does blockchain work', 'do I need blockchain', or 'distributed ledger'.
Design and implement smart contracts as self-executing programmatic agreements on blockchain. Use this skill when the user needs to build automated on-chain logic, evaluate smart contract security, or design tokenized business rules — even if they say 'smart contract development', 'automated agreement', or 'on-chain logic'.
Implement BM25 ranking function for e-commerce product search relevance scoring. Use this skill when the user needs to build a text-based product search engine, improve search result relevance, or replace basic TF-IDF with a more robust ranking function — even if they say 'product search ranking', 'search relevance', or 'BM25 implementation'.