Skill284 repo starsupdated 4d ago

celltypist-cell-annotation

CellTypist automatically assigns cell type labels to individual cells in single-cell RNA-seq data using pre-trained logistic regression models built from curated reference atlases. With 45+ available models covering immune, organ-specific, and developmental contexts, it outputs per-cell predictions with confidence scores and optional cluster-level consensus labels via majority voting. Use it for rapid, reference-backed annotation of normalized scRNA-seq datasets when you need fast, standardized cell type classification without manual marker gene inspection.

View source Repository: SciAgent-Skills

Install in Claude Code

Copy

git clone --depth 1 https://github.com/jaechang-hits/SciAgent-Skills /tmp/celltypist-cell-annotation && cp -r /tmp/celltypist-cell-annotation/skills/genomics-bioinformatics/single-cell/celltypist-cell-annotation ~/.claude/skills/celltypist-cell-annotation

Then start a new Claude Code session; the skill loads automatically.

Definition

SKILL.md

# CellTypist Cell Type Annotation

## Overview

CellTypist is an automated cell type classifier for single-cell RNA-seq data built on logistic regression models trained on curated reference atlases. Given a normalized AnnData object, it predicts cell type labels at the single-cell level and optionally applies majority voting within user-defined clusters to produce consensus, biologically coherent annotations. The tool ships with 45+ ready-to-use models spanning pan-immune, organ-specific, and developmental contexts, and supports training custom models from labeled data.

## When to Use

- Annotating PBMC, whole-blood, lymph node, or other immune cell datasets using a single standardized reference model
- Generating a first-pass cell type annotation before manual curation with canonical marker genes
- Annotating cluster-level cell types in published or in-house datasets using majority voting to smooth noisy per-cell predictions
- Comparing annotation results across multiple tissue-specific models to determine the most biologically relevant reference
- Training a custom CellTypist model from a labeled reference dataset for a tissue or species not covered by pre-built models
- Quantifying annotation confidence to flag low-certainty cells (confidence score < 0.5) for manual review or exclusion
- Use **scVI/scANVI** (scvi-tools-single-cell) instead when you need probabilistic label transfer with batch correction and uncertainty quantification via a variational autoencoder
- Use **popV** (popv-cell-annotation) instead when you want ensemble consensus from 10+ methods including deep learning and KNN-based approaches

## Prerequisites

- **Python packages**: `celltypist>=1.6`, `scanpy>=1.9`, `anndata`
- **Data requirements**: AnnData with normalized, log1p-transformed counts in `adata.X` (10,000 UMIs per cell target sum). Raw counts must be normalized before calling CellTypist
- **Environment**: Python 3.8+; 8 GB RAM sufficient for most datasets; internet access required for model downloads (first run only)

```bash
pip install celltypist "scanpy[leiden]" anndata
```

## Quick Start

Minimal pipeline — annotate a preprocessed AnnData with the pan-immune model:

```python
import celltypist
import scanpy as sc

# Load a preprocessed AnnData (normalized + log1p, Leiden clusters already in adata.obs)
adata = sc.read_h5ad("preprocessed_pbmc.h5ad")

# Run annotation with majority voting across Leiden clusters
predictions = celltypist.annotate(
    adata,
    model="Immune_All_Low.pkl",
    majority_voting=True,
)
adata = predictions.to_adata()

print(adata.obs[["predicted_labels", "majority_voting", "conf_score"]].head(10))
# predicted_labels  majority_voting  conf_score
# CD4+ T cells      CD4+ T cells     0.92
# ...
```

## Workflow

### Step 1: Installation and Model Setup

Install CellTypist and download pre-trained models. Models are cached locally after the first download.

```bash
pip install celltypist "scanpy[leiden]" anndata
```

```python
import celltypist
from celltypist import models

# Download all available models (only needed once; ~2 GB total)
models.download_models(force_update=False)

# List available models with metadata
models_df = models.models_description()
print(models_df[["model", "description", "n_celltypes", "n_cells"]].to_string())
# Output (excerpt):
#   model                          description                                 n_celltypes  n_cells
#   Immune_All_Low.pkl             Pan-immune low-hierarchy (98 cell types)   98           324,320
#   Immune_All_High.pkl            Pan-immune high-hierarchy (30 cell types)  30           324,320
#   Human_Lung_Atlas.pkl           Lung cell types from Human Lung Atlas       61           584,944
```

### Step 2: Data Preparation

CellTypist requires normalized, log1p-transformed counts in `adata.X`. Run normalization before annotation. Raw counts must be stored separately.

```python
import scanpy as sc

# Load raw count matrix
adata = sc.read_h5ad("raw_counts.h5ad")
# Alternatively from 10X:
# adata = sc.read_10x_mtx("filtered_feature_bc_matrix/")
# adata.var_names_make_unique()

# Store raw counts before normalization
adata.layers["counts"] = adata.X.copy()

# Normalize to 10,000 UMIs per cell and log1p-transform
sc.pp.normalize_total(adata, target_sum=1e4)
sc.pp.log1p(adata)

print(f"Prepared: {adata.n_obs} cells x {adata.n_vars} genes")
print(f"adata.X mean: {adata.X.mean():.3f}  (expected ~0.5–2.0 after log1p normalization)")
```

### Step 3: Model Selection

Choose the model that best matches your tissue type and desired annotation resolution.

```python
from celltypist import models

# Show full model table with filtering
models_df = models.models_description()

# Filter to human immune models
immune_models = models_df[models_df["description"].str.contains("immune|Immune", case=False)]
print(immune_models[["model", "description", "n_celltypes"]].to_string())

# Load a specific model to inspect its cell type labels
model = models.Model.load("Immune_All_Low.pkl")
print(f"Model cell types ({len(model.cell_types)}):")
print(model.cell_types[:20])  # first 20 labels
```

**Available models (key selection guide):**

| Model | Cell Types | Best For |
|-------|-----------|---------|
| `Immune_All_Low.pkl` | 98 | Pan-immune with fine subtypes (e.g., MAIT, Tfh, cDC1) |
| `Immune_All_High.pkl` | 30 | Pan-immune major lineages (T, B, NK, monocyte, DC) |
| `Human_Lung_Atlas.pkl` | 61 | Lung: alveolar, stromal, immune, endothelial |
| `Pan_Fetal_Human.pkl` | 139 | Fetal human multi-organ development |
| `Developing_Human_Brain.pkl` | 51 | Brain development: progenitors, neurons, glia |
| `Human_Colorectal_Cancer.pkl` | 62 | Colorectal cancer cells + tumor microenvironment |

### Step 4: Automated Annotation

Run `celltypist.annotate()` with `majority_voting=True` for cluster-level consensus labels alongside per-cell predictions.

```python
import celltypist
import scanpy as sc

# Ensure Leiden clusters exist for majority voting
# If