celltypist-cell-annotation
Automated scRNA-seq cell type annotation via pre-trained logistic regression. 45+ models: immune, gut, lung, brain, fetal, cancer microenvironments. Input normalized AnnData; outputs per-cell labels, majority-vote cluster labels, confidence scores. Use for fast, reference-backed annotation without manual marker inspection.
git clone --depth 1 https://github.com/jaechang-hits/SciAgent-Skills /tmp/celltypist-cell-annotation && cp -r /tmp/celltypist-cell-annotation/skills/genomics-bioinformatics/single-cell/celltypist-cell-annotation ~/.claude/skills/celltypist-cell-annotationSKILL.md
# CellTypist Cell Type Annotation
## Overview
CellTypist is an automated cell type classifier for single-cell RNA-seq data built on logistic regression models trained on curated reference atlases. Given a normalized AnnData object, it predicts cell type labels at the single-cell level and optionally applies majority voting within user-defined clusters to produce consensus, biologically coherent annotations. The tool ships with 45+ ready-to-use models spanning pan-immune, organ-specific, and developmental contexts, and supports training custom models from labeled data.
## When to Use
- Annotating PBMC, whole-blood, lymph node, or other immune cell datasets using a single standardized reference model
- Generating a first-pass cell type annotation before manual curation with canonical marker genes
- Annotating cluster-level cell types in published or in-house datasets using majority voting to smooth noisy per-cell predictions
- Comparing annotation results across multiple tissue-specific models to determine the most biologically relevant reference
- Training a custom CellTypist model from a labeled reference dataset for a tissue or species not covered by pre-built models
- Quantifying annotation confidence to flag low-certainty cells (confidence score < 0.5) for manual review or exclusion
- Use **scVI/scANVI** (scvi-tools-single-cell) instead when you need probabilistic label transfer with batch correction and uncertainty quantification via a variational autoencoder
- Use **popV** (popv-cell-annotation) instead when you want ensemble consensus from 10+ methods including deep learning and KNN-based approaches
## Prerequisites
- **Python packages**: `celltypist>=1.6`, `scanpy>=1.9`, `anndata`
- **Data requirements**: AnnData with normalized, log1p-transformed counts in `adata.X` (10,000 UMIs per cell target sum). Raw counts must be normalized before calling CellTypist
- **Environment**: Python 3.8+; 8 GB RAM sufficient for most datasets; internet access required for model downloads (first run only)
```bash
pip install celltypist "scanpy[leiden]" anndata
```
## Quick Start
Minimal pipeline — annotate a preprocessed AnnData with the pan-immune model:
```python
import celltypist
import scanpy as sc
# Load a preprocessed AnnData (normalized + log1p, Leiden clusters already in adata.obs)
adata = sc.read_h5ad("preprocessed_pbmc.h5ad")
# Run annotation with majority voting across Leiden clusters
predictions = celltypist.annotate(
adata,
model="Immune_All_Low.pkl",
majority_voting=True,
)
adata = predictions.to_adata()
print(adata.obs[["predicted_labels", "majority_voting", "conf_score"]].head(10))
# predicted_labels majority_voting conf_score
# CD4+ T cells CD4+ T cells 0.92
# ...
```
## Workflow
### Step 1: Installation and Model Setup
Install CellTypist and download pre-trained models. Models are cached locally after the first download.
```bash
pip install celltypist "scanpy[leiden]" anndata
```
```python
import celltypist
from celltypist import models
# Download all available models (only needed once; ~2 GB total)
models.download_models(force_update=False)
# List available models with metadata
models_df = models.models_description()
print(models_df[["model", "description", "n_celltypes", "n_cells"]].to_string())
# Output (excerpt):
# model description n_celltypes n_cells
# Immune_All_Low.pkl Pan-immune low-hierarchy (98 cell types) 98 324,320
# Immune_All_High.pkl Pan-immune high-hierarchy (30 cell types) 30 324,320
# Human_Lung_Atlas.pkl Lung cell types from Human Lung Atlas 61 584,944
```
### Step 2: Data Preparation
CellTypist requires normalized, log1p-transformed counts in `adata.X`. Run normalization before annotation. Raw counts must be stored separately.
```python
import scanpy as sc
# Load raw count matrix
adata = sc.read_h5ad("raw_counts.h5ad")
# Alternatively from 10X:
# adata = sc.read_10x_mtx("filtered_feature_bc_matrix/")
# adata.var_names_make_unique()
# Store raw counts before normalization
adata.layers["counts"] = adata.X.copy()
# Normalize to 10,000 UMIs per cell and log1p-transform
sc.pp.normalize_total(adata, target_sum=1e4)
sc.pp.log1p(adata)
print(f"Prepared: {adata.n_obs} cells x {adata.n_vars} genes")
print(f"adata.X mean: {adata.X.mean():.3f} (expected ~0.5–2.0 after log1p normalization)")
```
### Step 3: Model Selection
Choose the model that best matches your tissue type and desired annotation resolution.
```python
from celltypist import models
# Show full model table with filtering
models_df = models.models_description()
# Filter to human immune models
immune_models = models_df[models_df["description"].str.contains("immune|Immune", case=False)]
print(immune_models[["model", "description", "n_celltypes"]].to_string())
# Load a specific model to inspect its cell type labels
model = models.Model.load("Immune_All_Low.pkl")
print(f"Model cell types ({len(model.cell_types)}):")
print(model.cell_types[:20]) # first 20 labels
```
**Available models (key selection guide):**
| Model | Cell Types | Best For |
|-------|-----------|---------|
| `Immune_All_Low.pkl` | 98 | Pan-immune with fine subtypes (e.g., MAIT, Tfh, cDC1) |
| `Immune_All_High.pkl` | 30 | Pan-immune major lineages (T, B, NK, monocyte, DC) |
| `Human_Lung_Atlas.pkl` | 61 | Lung: alveolar, stromal, immune, endothelial |
| `Pan_Fetal_Human.pkl` | 139 | Fetal human multi-organ development |
| `Developing_Human_Brain.pkl` | 51 | Brain development: progenitors, neurons, glia |
| `Human_Colorectal_Cancer.pkl` | 62 | Colorectal cancer cells + tumor microenvironment |
### Step 4: Automated Annotation
Run `celltypist.annotate()` with `majority_voting=True` for cluster-level consensus labels alongside per-cell predictions.
```python
import celltypist
import scanpy as sc
# Ensure Leiden clusters exist for majority voting
# If|
Opentrons Protocol API v2 for OT-2/Flex: Python protocols for pipetting, serial dilutions, PCR, plate replication; control thermocycler, heater-shaker, magnetic, temperature modules. Use pylabrobot for multi-vendor.
Interactive visualization with Plotly. 40+ chart types (scatter, line, heatmap, 3D, geographic) with hover, zoom, pan. Two APIs: Plotly Express (DataFrame) and Graph Objects (fine control). For static publication figures use matplotlib; for statistical grammar use seaborn.
Statistical visualization on matplotlib + pandas. Distributions (histplot, kdeplot, violin, box), relational (scatter, line), categorical, regression, correlation heatmaps. Auto aggregation/CIs. Use plotly for interactive; matplotlib for low-level.
Best practices for single-cell RNA-seq cell type annotation including marker-based, reference-based, and automated classification approaches.
Bayesian modeling with PyMC 5: priors, likelihood, NUTS/ADVI sampling, diagnostics (R-hat, ESS), LOO/WAIC comparison, prediction. Hierarchical, logistic, GP variants; predictive checks.
Time-to-event modeling with scikit-survival: Cox PH (elastic net), Random Survival Forests, Boosting, SVMs for censored data. C-index, Brier, time-dependent AUC; Kaplan-Meier, Nelson-Aalen, competing risks. Pipeline/GridSearchCV compatible. Use statsmodels for frequentist, pymc for Bayesian, lifelines for parametric.
>-