Skill284 repo starsupdated 4d ago

scvi-tools-single-cell

scvi-tools is a PyTorch-based probabilistic modeling framework for single-cell genomics that implements variational autoencoders to learn low-dimensional cell representations while explicitly modeling batch effects, noise distributions, and multi-modal data. Use it for batch-corrected integration across scRNA-seq datasets, differential expression with uncertainty quantification, semi-supervised cell type annotation via transfer learning, joint RNA-protein modeling in CITE-seq data, or spatial transcriptomics deconvolution when you need statistically grounded outputs rather than simple linear corrections or clustering.

View source Repository: SciAgent-Skills

Install in Claude Code

Copy

git clone --depth 1 https://github.com/jaechang-hits/SciAgent-Skills /tmp/scvi-tools-single-cell && cp -r /tmp/scvi-tools-single-cell/skills/genomics-bioinformatics/single-cell/scvi-tools-single-cell ~/.claude/skills/scvi-tools-single-cell

Then start a new Claude Code session; the skill loads automatically.

Definition

SKILL.md

# scvi-tools — Single-Cell Deep Generative Models

## Overview

scvi-tools is a probabilistic modeling framework for single-cell genomics built on PyTorch. It implements variational autoencoders (VAEs) that learn low-dimensional latent representations of cells while explicitly modeling batch effects, count noise distributions, and multi-modal data. All models share a unified API: `setup_anndata()` to register data, instantiate the model, `train()`, then extract latent representations, normalized expression, or differential expression results. Models operate on raw count data in AnnData format and return statistically grounded outputs with uncertainty estimates.

## When to Use

- Integrating multiple scRNA-seq batches or studies with probabilistic batch correction that preserves biological variation
- Performing differential expression with uncertainty quantification and composite hypotheses (not just fold-change thresholding)
- Annotating cell types via semi-supervised transfer learning from a partially-labeled reference (scANVI)
- Jointly modeling CITE-seq protein and RNA data to obtain denoised protein estimates and joint embeddings (totalVI)
- Adapting a pretrained model to a new query dataset without full retraining (scARCHES transfer learning)
- Deconvolving spatial transcriptomics spots into cell type proportions using a matched scRNA-seq reference (DestVI)
- Detecting doublets in scRNA-seq data as a QC preprocessing step (Solo)
- Use **harmony-batch-correction** instead when you need fast linear batch correction (seconds vs minutes) without deep learning overhead
- For **multi-modal MuData workflows** (joint RNA+ATAC Multiome, combined modality objects), use **muon** instead
- For **standard clustering and visualization** without batch effects or probabilistic DE, use **scanpy-scrna-seq**

## Prerequisites

- **Python packages**: `scvi-tools>=1.1`, `scanpy`, `anndata`
- **Data requirements**: AnnData (`.h5ad`) with **raw counts** — not log-normalized. Store counts in `adata.layers["counts"]` if `adata.X` has been normalized.
- **Hardware**: CPU works for <50k cells; GPU (8GB+ VRAM) recommended for larger datasets. Training time: 5–30 minutes on GPU for 100k cells.

```bash
pip install scvi-tools scanpy
# GPU acceleration (recommended for >50k cells)
pip install "scvi-tools[cuda12]"   # or scvi-tools[cuda11]
```

## Quick Start

Minimal scVI batch integration on a built-in example dataset:

```python
import scvi
import scanpy as sc

# Load example data with batch labels
adata = scvi.data.heart_cell_atlas_subsampled()

# Preprocessing: filter genes, select HVGs (subset to save training time)
sc.pp.filter_genes(adata, min_counts=3)
adata.layers["counts"] = adata.X.copy()  # preserve raw counts
sc.pp.normalize_total(adata, target_sum=1e4)
sc.pp.log1p(adata)
sc.pp.highly_variable_genes(adata, n_top_genes=2000, batch_key="cell_source", subset=True)

# Register data, train, extract
scvi.model.SCVI.setup_anndata(adata, layer="counts", batch_key="cell_source")
model = scvi.model.SCVI(adata, n_latent=30)
model.train(max_epochs=200, early_stopping=True)

adata.obsm["X_scVI"] = model.get_latent_representation()
sc.pp.neighbors(adata, use_rep="X_scVI")
sc.tl.umap(adata)
sc.tl.leiden(adata, resolution=0.5)
sc.pl.umap(adata, color=["cell_source", "leiden"])
print(f"Latent shape: {adata.obsm['X_scVI'].shape}")
# Latent shape: (14000, 30)
```

## Core API

### 1. Data Registration (setup_anndata)

All models share the same registration pattern. Call `setup_anndata()` on the model class before instantiation to tell scvi-tools where to find counts, batch labels, and covariates.

```python
import scvi

# Minimal: counts in a layer, batch column in obs
scvi.model.SCVI.setup_anndata(
    adata,
    layer="counts",        # Key in adata.layers with raw counts; None = adata.X
    batch_key="batch",     # Column in adata.obs for technical batch
)

# Extended: additional categorical and continuous covariates
scvi.model.SCVI.setup_anndata(
    adata,
    layer="counts",
    batch_key="batch",
    categorical_covariate_keys=["donor", "protocol"],   # Discrete biological/technical vars
    continuous_covariate_keys=["percent_mito", "log_n_counts"],  # Continuous covariates
)

# Inspect registered summary
print(adata.uns["_scvi"]["summary_stats"])
# {'n_vars': 2000, 'n_cells': 45000, 'n_batch': 6, 'n_extra_categorical_covs': 2, ...}
```

### 2. scVI — Batch Correction and Integration

The core unsupervised model. Learns a batch-corrected latent space and a denoised expression layer. Starting point for any multi-batch scRNA-seq analysis.

```python
import scvi

model = scvi.model.SCVI(
    adata,
    n_latent=30,              # Latent space dimensions; 10–50 typical
    n_layers=2,               # Hidden layers in encoder/decoder; 1–3
    n_hidden=128,             # Neurons per hidden layer; 64–256
    gene_likelihood="zinb",   # "zinb" (zero-inflated NB), "nb", or "poisson"
    dispersion="gene",        # "gene" or "gene-batch" for batch-specific dispersion
)
model.train(max_epochs=200, early_stopping=True)

# Extract results
latent = model.get_latent_representation()        # ndarray (n_cells, n_latent)
normalized = model.get_normalized_expression(
    library_size=1e4, n_samples=25, return_mean=True
)
print(f"Latent: {latent.shape}")
print(f"Denoised expression: {normalized.shape}")
# Latent: (45000, 30)
# Denoised expression: (45000, 2000)

adata.obsm["X_scVI"] = latent
adata.layers["scvi_normalized"] = normalized
```

### 3. scANVI — Semi-Supervised Cell Annotation

Extends scVI with label supervision for cell type transfer learning. Accepts partially labeled data (unannotated cells labeled `"Unknown"`) and predicts cell types for unlabeled cells.

```python
import scvi

# Register with cell type labels; mark unannotated cells as "Unknown"
scvi.model.SCANVI.setup_anndata(
    adata,
    layer="counts",
    batch_key="batch",
    labels_key="cell_type",           # Column in adata.obs with known labels