scvi-tools-single-cell
Deep generative models for single-cell omics: probabilistic batch correction (scVI), semi-supervised annotation (scANVI), CITE-seq RNA+protein (totalVI), transfer learning (scARCHES), and DE with uncertainty. Unified setup→train→extract API on AnnData. Use harmony-batch-correction for fast linear correction without deep learning; muon for multi-modal MuData workflows.
git clone --depth 1 https://github.com/jaechang-hits/SciAgent-Skills /tmp/scvi-tools-single-cell && cp -r /tmp/scvi-tools-single-cell/skills/genomics-bioinformatics/single-cell/scvi-tools-single-cell ~/.claude/skills/scvi-tools-single-cellSKILL.md
# scvi-tools — Single-Cell Deep Generative Models
## Overview
scvi-tools is a probabilistic modeling framework for single-cell genomics built on PyTorch. It implements variational autoencoders (VAEs) that learn low-dimensional latent representations of cells while explicitly modeling batch effects, count noise distributions, and multi-modal data. All models share a unified API: `setup_anndata()` to register data, instantiate the model, `train()`, then extract latent representations, normalized expression, or differential expression results. Models operate on raw count data in AnnData format and return statistically grounded outputs with uncertainty estimates.
## When to Use
- Integrating multiple scRNA-seq batches or studies with probabilistic batch correction that preserves biological variation
- Performing differential expression with uncertainty quantification and composite hypotheses (not just fold-change thresholding)
- Annotating cell types via semi-supervised transfer learning from a partially-labeled reference (scANVI)
- Jointly modeling CITE-seq protein and RNA data to obtain denoised protein estimates and joint embeddings (totalVI)
- Adapting a pretrained model to a new query dataset without full retraining (scARCHES transfer learning)
- Deconvolving spatial transcriptomics spots into cell type proportions using a matched scRNA-seq reference (DestVI)
- Detecting doublets in scRNA-seq data as a QC preprocessing step (Solo)
- Use **harmony-batch-correction** instead when you need fast linear batch correction (seconds vs minutes) without deep learning overhead
- For **multi-modal MuData workflows** (joint RNA+ATAC Multiome, combined modality objects), use **muon** instead
- For **standard clustering and visualization** without batch effects or probabilistic DE, use **scanpy-scrna-seq**
## Prerequisites
- **Python packages**: `scvi-tools>=1.1`, `scanpy`, `anndata`
- **Data requirements**: AnnData (`.h5ad`) with **raw counts** — not log-normalized. Store counts in `adata.layers["counts"]` if `adata.X` has been normalized.
- **Hardware**: CPU works for <50k cells; GPU (8GB+ VRAM) recommended for larger datasets. Training time: 5–30 minutes on GPU for 100k cells.
```bash
pip install scvi-tools scanpy
# GPU acceleration (recommended for >50k cells)
pip install "scvi-tools[cuda12]" # or scvi-tools[cuda11]
```
## Quick Start
Minimal scVI batch integration on a built-in example dataset:
```python
import scvi
import scanpy as sc
# Load example data with batch labels
adata = scvi.data.heart_cell_atlas_subsampled()
# Preprocessing: filter genes, select HVGs (subset to save training time)
sc.pp.filter_genes(adata, min_counts=3)
adata.layers["counts"] = adata.X.copy() # preserve raw counts
sc.pp.normalize_total(adata, target_sum=1e4)
sc.pp.log1p(adata)
sc.pp.highly_variable_genes(adata, n_top_genes=2000, batch_key="cell_source", subset=True)
# Register data, train, extract
scvi.model.SCVI.setup_anndata(adata, layer="counts", batch_key="cell_source")
model = scvi.model.SCVI(adata, n_latent=30)
model.train(max_epochs=200, early_stopping=True)
adata.obsm["X_scVI"] = model.get_latent_representation()
sc.pp.neighbors(adata, use_rep="X_scVI")
sc.tl.umap(adata)
sc.tl.leiden(adata, resolution=0.5)
sc.pl.umap(adata, color=["cell_source", "leiden"])
print(f"Latent shape: {adata.obsm['X_scVI'].shape}")
# Latent shape: (14000, 30)
```
## Core API
### 1. Data Registration (setup_anndata)
All models share the same registration pattern. Call `setup_anndata()` on the model class before instantiation to tell scvi-tools where to find counts, batch labels, and covariates.
```python
import scvi
# Minimal: counts in a layer, batch column in obs
scvi.model.SCVI.setup_anndata(
adata,
layer="counts", # Key in adata.layers with raw counts; None = adata.X
batch_key="batch", # Column in adata.obs for technical batch
)
# Extended: additional categorical and continuous covariates
scvi.model.SCVI.setup_anndata(
adata,
layer="counts",
batch_key="batch",
categorical_covariate_keys=["donor", "protocol"], # Discrete biological/technical vars
continuous_covariate_keys=["percent_mito", "log_n_counts"], # Continuous covariates
)
# Inspect registered summary
print(adata.uns["_scvi"]["summary_stats"])
# {'n_vars': 2000, 'n_cells': 45000, 'n_batch': 6, 'n_extra_categorical_covs': 2, ...}
```
### 2. scVI — Batch Correction and Integration
The core unsupervised model. Learns a batch-corrected latent space and a denoised expression layer. Starting point for any multi-batch scRNA-seq analysis.
```python
import scvi
model = scvi.model.SCVI(
adata,
n_latent=30, # Latent space dimensions; 10–50 typical
n_layers=2, # Hidden layers in encoder/decoder; 1–3
n_hidden=128, # Neurons per hidden layer; 64–256
gene_likelihood="zinb", # "zinb" (zero-inflated NB), "nb", or "poisson"
dispersion="gene", # "gene" or "gene-batch" for batch-specific dispersion
)
model.train(max_epochs=200, early_stopping=True)
# Extract results
latent = model.get_latent_representation() # ndarray (n_cells, n_latent)
normalized = model.get_normalized_expression(
library_size=1e4, n_samples=25, return_mean=True
)
print(f"Latent: {latent.shape}")
print(f"Denoised expression: {normalized.shape}")
# Latent: (45000, 30)
# Denoised expression: (45000, 2000)
adata.obsm["X_scVI"] = latent
adata.layers["scvi_normalized"] = normalized
```
### 3. scANVI — Semi-Supervised Cell Annotation
Extends scVI with label supervision for cell type transfer learning. Accepts partially labeled data (unannotated cells labeled `"Unknown"`) and predicts cell types for unlabeled cells.
```python
import scvi
# Register with cell type labels; mark unannotated cells as "Unknown"
scvi.model.SCANVI.setup_anndata(
adata,
layer="counts",
batch_key="batch",
labels_key="cell_type", # Column in adata.obs with known labels|
Opentrons Protocol API v2 for OT-2/Flex: Python protocols for pipetting, serial dilutions, PCR, plate replication; control thermocycler, heater-shaker, magnetic, temperature modules. Use pylabrobot for multi-vendor.
Interactive visualization with Plotly. 40+ chart types (scatter, line, heatmap, 3D, geographic) with hover, zoom, pan. Two APIs: Plotly Express (DataFrame) and Graph Objects (fine control). For static publication figures use matplotlib; for statistical grammar use seaborn.
Statistical visualization on matplotlib + pandas. Distributions (histplot, kdeplot, violin, box), relational (scatter, line), categorical, regression, correlation heatmaps. Auto aggregation/CIs. Use plotly for interactive; matplotlib for low-level.
Best practices for single-cell RNA-seq cell type annotation including marker-based, reference-based, and automated classification approaches.
Bayesian modeling with PyMC 5: priors, likelihood, NUTS/ADVI sampling, diagnostics (R-hat, ESS), LOO/WAIC comparison, prediction. Hierarchical, logistic, GP variants; predictive checks.
Time-to-event modeling with scikit-survival: Cox PH (elastic net), Random Survival Forests, Boosting, SVMs for censored data. C-index, Brier, time-dependent AUC; Kaplan-Meier, Nelson-Aalen, competing risks. Pipeline/GridSearchCV compatible. Use statsmodels for frequentist, pymc for Bayesian, lifelines for parametric.
>-