Skip to main content
ClaudeWave
Skill199 repo starsupdated 16d ago

scvi-tools-single-cell

Deep generative models for single-cell omics: probabilistic batch correction (scVI), semi-supervised annotation (scANVI), CITE-seq RNA+protein (totalVI), transfer learning (scARCHES), and DE with uncertainty. Unified setup→train→extract API on AnnData. Use harmony-batch-correction for fast linear correction without deep learning; muon for multi-modal MuData workflows.

Install in Claude Code
Copy
git clone --depth 1 https://github.com/jaechang-hits/SciAgent-Skills /tmp/scvi-tools-single-cell && cp -r /tmp/scvi-tools-single-cell/skills/genomics-bioinformatics/single-cell/scvi-tools-single-cell ~/.claude/skills/scvi-tools-single-cell
Then start a new Claude Code session; the skill loads automatically.

SKILL.md

# scvi-tools — Single-Cell Deep Generative Models

## Overview

scvi-tools is a probabilistic modeling framework for single-cell genomics built on PyTorch. It implements variational autoencoders (VAEs) that learn low-dimensional latent representations of cells while explicitly modeling batch effects, count noise distributions, and multi-modal data. All models share a unified API: `setup_anndata()` to register data, instantiate the model, `train()`, then extract latent representations, normalized expression, or differential expression results. Models operate on raw count data in AnnData format and return statistically grounded outputs with uncertainty estimates.

## When to Use

- Integrating multiple scRNA-seq batches or studies with probabilistic batch correction that preserves biological variation
- Performing differential expression with uncertainty quantification and composite hypotheses (not just fold-change thresholding)
- Annotating cell types via semi-supervised transfer learning from a partially-labeled reference (scANVI)
- Jointly modeling CITE-seq protein and RNA data to obtain denoised protein estimates and joint embeddings (totalVI)
- Adapting a pretrained model to a new query dataset without full retraining (scARCHES transfer learning)
- Deconvolving spatial transcriptomics spots into cell type proportions using a matched scRNA-seq reference (DestVI)
- Detecting doublets in scRNA-seq data as a QC preprocessing step (Solo)
- Use **harmony-batch-correction** instead when you need fast linear batch correction (seconds vs minutes) without deep learning overhead
- For **multi-modal MuData workflows** (joint RNA+ATAC Multiome, combined modality objects), use **muon** instead
- For **standard clustering and visualization** without batch effects or probabilistic DE, use **scanpy-scrna-seq**

## Prerequisites

- **Python packages**: `scvi-tools>=1.1`, `scanpy`, `anndata`
- **Data requirements**: AnnData (`.h5ad`) with **raw counts** — not log-normalized. Store counts in `adata.layers["counts"]` if `adata.X` has been normalized.
- **Hardware**: CPU works for <50k cells; GPU (8GB+ VRAM) recommended for larger datasets. Training time: 5–30 minutes on GPU for 100k cells.

```bash
pip install scvi-tools scanpy
# GPU acceleration (recommended for >50k cells)
pip install "scvi-tools[cuda12]"   # or scvi-tools[cuda11]
```

## Quick Start

Minimal scVI batch integration on a built-in example dataset:

```python
import scvi
import scanpy as sc

# Load example data with batch labels
adata = scvi.data.heart_cell_atlas_subsampled()

# Preprocessing: filter genes, select HVGs (subset to save training time)
sc.pp.filter_genes(adata, min_counts=3)
adata.layers["counts"] = adata.X.copy()  # preserve raw counts
sc.pp.normalize_total(adata, target_sum=1e4)
sc.pp.log1p(adata)
sc.pp.highly_variable_genes(adata, n_top_genes=2000, batch_key="cell_source", subset=True)

# Register data, train, extract
scvi.model.SCVI.setup_anndata(adata, layer="counts", batch_key="cell_source")
model = scvi.model.SCVI(adata, n_latent=30)
model.train(max_epochs=200, early_stopping=True)

adata.obsm["X_scVI"] = model.get_latent_representation()
sc.pp.neighbors(adata, use_rep="X_scVI")
sc.tl.umap(adata)
sc.tl.leiden(adata, resolution=0.5)
sc.pl.umap(adata, color=["cell_source", "leiden"])
print(f"Latent shape: {adata.obsm['X_scVI'].shape}")
# Latent shape: (14000, 30)
```

## Core API

### 1. Data Registration (setup_anndata)

All models share the same registration pattern. Call `setup_anndata()` on the model class before instantiation to tell scvi-tools where to find counts, batch labels, and covariates.

```python
import scvi

# Minimal: counts in a layer, batch column in obs
scvi.model.SCVI.setup_anndata(
    adata,
    layer="counts",        # Key in adata.layers with raw counts; None = adata.X
    batch_key="batch",     # Column in adata.obs for technical batch
)

# Extended: additional categorical and continuous covariates
scvi.model.SCVI.setup_anndata(
    adata,
    layer="counts",
    batch_key="batch",
    categorical_covariate_keys=["donor", "protocol"],   # Discrete biological/technical vars
    continuous_covariate_keys=["percent_mito", "log_n_counts"],  # Continuous covariates
)

# Inspect registered summary
print(adata.uns["_scvi"]["summary_stats"])
# {'n_vars': 2000, 'n_cells': 45000, 'n_batch': 6, 'n_extra_categorical_covs': 2, ...}
```

### 2. scVI — Batch Correction and Integration

The core unsupervised model. Learns a batch-corrected latent space and a denoised expression layer. Starting point for any multi-batch scRNA-seq analysis.

```python
import scvi

model = scvi.model.SCVI(
    adata,
    n_latent=30,              # Latent space dimensions; 10–50 typical
    n_layers=2,               # Hidden layers in encoder/decoder; 1–3
    n_hidden=128,             # Neurons per hidden layer; 64–256
    gene_likelihood="zinb",   # "zinb" (zero-inflated NB), "nb", or "poisson"
    dispersion="gene",        # "gene" or "gene-batch" for batch-specific dispersion
)
model.train(max_epochs=200, early_stopping=True)

# Extract results
latent = model.get_latent_representation()        # ndarray (n_cells, n_latent)
normalized = model.get_normalized_expression(
    library_size=1e4, n_samples=25, return_mean=True
)
print(f"Latent: {latent.shape}")
print(f"Denoised expression: {normalized.shape}")
# Latent: (45000, 30)
# Denoised expression: (45000, 2000)

adata.obsm["X_scVI"] = latent
adata.layers["scvi_normalized"] = normalized
```

### 3. scANVI — Semi-Supervised Cell Annotation

Extends scVI with label supervision for cell type transfer learning. Accepts partially labeled data (unannotated cells labeled `"Unknown"`) and predicts cell types for unlabeled cells.

```python
import scvi

# Register with cell type labels; mark unannotated cells as "Unknown"
scvi.model.SCANVI.setup_anndata(
    adata,
    layer="counts",
    batch_key="batch",
    labels_key="cell_type",           # Column in adata.obs with known labels
sciagent-skill-creatorSkill

|

opentrons-integrationSkill

Opentrons Protocol API v2 for OT-2/Flex: Python protocols for pipetting, serial dilutions, PCR, plate replication; control thermocycler, heater-shaker, magnetic, temperature modules. Use pylabrobot for multi-vendor.

plotly-interactive-visualizationSkill

Interactive visualization with Plotly. 40+ chart types (scatter, line, heatmap, 3D, geographic) with hover, zoom, pan. Two APIs: Plotly Express (DataFrame) and Graph Objects (fine control). For static publication figures use matplotlib; for statistical grammar use seaborn.

seaborn-statistical-visualizationSkill

Statistical visualization on matplotlib + pandas. Distributions (histplot, kdeplot, violin, box), relational (scatter, line), categorical, regression, correlation heatmaps. Auto aggregation/CIs. Use plotly for interactive; matplotlib for low-level.

single-cell-annotationSkill

Best practices for single-cell RNA-seq cell type annotation including marker-based, reference-based, and automated classification approaches.

pymc-bayesian-modelingSkill

Bayesian modeling with PyMC 5: priors, likelihood, NUTS/ADVI sampling, diagnostics (R-hat, ESS), LOO/WAIC comparison, prediction. Hierarchical, logistic, GP variants; predictive checks.

scikit-survival-analysisSkill

Time-to-event modeling with scikit-survival: Cox PH (elastic net), Random Survival Forests, Boosting, SVMs for censored data. C-index, Brier, time-dependent AUC; Kaplan-Meier, Nelson-Aalen, competing risks. Pipeline/GridSearchCV compatible. Use statsmodels for frequentist, pymc for Bayesian, lifelines for parametric.

statistical-analysisSkill

>-