popv-cell-annotation
Consensus cell type annotation: runs 10+ algorithms (KNN-Harmony/BBKNN/Scanorama/scVI, CellTypist, ONCLASS, Random Forest, SCANVI, SVM, XGBoost) on a labeled reference and transfers labels via majority voting. Outputs per-method labels, consensus, agreement score. Use when single-method annotation is insufficient or you need ensemble uncertainty for novel states.
git clone --depth 1 https://github.com/jaechang-hits/SciAgent-Skills /tmp/popv-cell-annotation && cp -r /tmp/popv-cell-annotation/skills/genomics-bioinformatics/single-cell/popv-cell-annotation ~/.claude/skills/popv-cell-annotationSKILL.md
# popV Multi-Method Cell Type Transfer
## Overview
popV (Population Voting for single-cell annotation) annotates a query scRNA-seq dataset by running 10+ independent classification algorithms against a labeled reference atlas and aggregating results via majority voting. Each method produces its own label; the final `popv_prediction` is the consensus across all methods, and the `popv_agreement` score quantifies how many methods agree. This ensemble strategy is robust to individual method failures on unusual datasets and provides a principled uncertainty estimate: low agreement highlights novel cell states or annotation gaps.
## When to Use
- Annotating a query dataset by transferring labels from a well-curated reference atlas when you want a consensus rather than a single model's judgment
- Identifying novel or ambiguous cell states as cells where methods disagree (low `popv_agreement` score)
- Benchmarking annotation reliability by comparing per-method labels to detect systematic disagreements
- Annotating large atlas datasets (>100k cells) where batch effects between reference and query are substantial
- Producing annotation for downstream analyses that require high-confidence labels (clinical data, regulatory submissions)
- Use **CellTypist** (celltypist-cell-annotation) instead when speed matters and a pre-trained model matches your tissue; popV is slower because it trains multiple models on your reference
- Use **scANVI** (scvi-tools-single-cell) instead when you need a single probabilistic deep generative model with formal uncertainty quantification and do not require the ensemble
## Prerequisites
- **Python packages**: `popv>=0.6`, `scanpy>=1.9`, `anndata`, `scvi-tools>=1.0`, `harmonypy`, `bbknn`, `celltypist`
- **Data requirements**: Two AnnData objects — a labeled reference (`adata_ref`) with cell type labels in `obs`, and an unlabeled query (`adata_query`). Both must be from the same species and have overlapping gene sets. Raw counts in `adata.X` (popV applies its own normalization internally)
- **Environment**: Python 3.9+; GPU recommended for scVI/SCANVI methods (falls back to CPU); 32 GB RAM recommended for >200k reference cells
```bash
pip install popv scvi-tools harmonypy bbknn celltypist
```
## Quick Start
Minimal pipeline from labeled reference and unlabeled query to annotated result:
```python
import popv
import scanpy as sc
# Load reference (labeled) and query (unlabeled) AnnData objects
adata_ref = sc.read_h5ad("reference_atlas.h5ad") # adata_ref.obs["cell_type"] exists
adata_query = sc.read_h5ad("query_dataset.h5ad")
# Prepare combined object with popV preprocessing
adata = popv.preprocessing.Process_Query(
adata_ref,
adata_query,
ref_labels_key="cell_type",
ref_batch_key="batch",
query_batch_key="batch",
unknown_celltype_label="unknown",
save_path_trained_models="./popv_models/",
n_epochs_unsupervised=50,
)
# Run all annotation methods
popv.annotation.annotate_data(adata)
# Inspect consensus results for query cells
query_mask = adata.obs["_dataset"] == "query"
print(adata[query_mask].obs[["popv_prediction", "popv_agreement"]].head(10))
```
## Core API
### Module 1: Reference and Query Data Setup
Both AnnData objects must share a gene space and have required metadata columns. popV will subset to the intersection of genes automatically.
```python
import anndata as ad
import scanpy as sc
import numpy as np
# Reference: must have cell type labels and (optionally) batch metadata
adata_ref = sc.read_h5ad("reference_atlas.h5ad")
print(f"Reference: {adata_ref.n_obs} cells x {adata_ref.n_vars} genes")
print(f"Cell types: {adata_ref.obs['cell_type'].nunique()} unique labels")
print(f"Reference cell type counts:\n{adata_ref.obs['cell_type'].value_counts().head(10)}")
# Query: no labels required; batch metadata optional
adata_query = sc.read_h5ad("query_dataset.h5ad")
print(f"\nQuery: {adata_query.n_obs} cells x {adata_query.n_vars} genes")
# Check gene overlap (popV will handle subsetting but >70% overlap is recommended)
shared_genes = adata_ref.var_names.intersection(adata_query.var_names)
pct_shared = len(shared_genes) / adata_ref.n_vars
print(f"\nShared genes: {len(shared_genes)} ({pct_shared:.1%} of reference genes)")
if pct_shared < 0.5:
print("WARNING: <50% gene overlap — annotation quality may be reduced")
```
```python
# Verify required fields before popV setup
assert "cell_type" in adata_ref.obs.columns, "Reference needs cell type labels"
# Add batch column if absent (popV requires it even for single-batch data)
if "batch" not in adata_ref.obs.columns:
adata_ref.obs["batch"] = "ref_batch"
if "batch" not in adata_query.obs.columns:
adata_query.obs["batch"] = "query_batch"
print("Reference obs columns:", adata_ref.obs.columns.tolist())
print("Query obs columns: ", adata_query.obs.columns.tolist())
```
### Module 2: POPV Object Creation (Process_Query)
`Process_Query` combines reference and query, normalizes counts, selects HVGs, and prepares the joint embedding needed by all annotation methods.
```python
import popv
# Create processed combined AnnData
adata = popv.preprocessing.Process_Query(
adata_ref,
adata_query,
ref_labels_key="cell_type", # obs column with reference labels
ref_batch_key="batch", # obs column with reference batch info
query_batch_key="batch", # obs column with query batch info
unknown_celltype_label="unknown",# label to use for query cells before annotation
save_path_trained_models="./popv_models/", # directory for scVI/SCANVI model checkpoints
n_epochs_unsupervised=50, # scVI training epochs (increase to 100–200 for large datasets)
n_epochs_semisupervised=20, # scANVI fine-tuning epochs
use_gpu=True, # GPU for scVI/SCANVI (falls back to CPU if unavailable)
hvg=4000, # number of highly variable genes to use
)
print(f"Combined object: {adata.n_obs} cells|
Opentrons Protocol API v2 for OT-2/Flex: Python protocols for pipetting, serial dilutions, PCR, plate replication; control thermocycler, heater-shaker, magnetic, temperature modules. Use pylabrobot for multi-vendor.
Interactive visualization with Plotly. 40+ chart types (scatter, line, heatmap, 3D, geographic) with hover, zoom, pan. Two APIs: Plotly Express (DataFrame) and Graph Objects (fine control). For static publication figures use matplotlib; for statistical grammar use seaborn.
Statistical visualization on matplotlib + pandas. Distributions (histplot, kdeplot, violin, box), relational (scatter, line), categorical, regression, correlation heatmaps. Auto aggregation/CIs. Use plotly for interactive; matplotlib for low-level.
Best practices for single-cell RNA-seq cell type annotation including marker-based, reference-based, and automated classification approaches.
Bayesian modeling with PyMC 5: priors, likelihood, NUTS/ADVI sampling, diagnostics (R-hat, ESS), LOO/WAIC comparison, prediction. Hierarchical, logistic, GP variants; predictive checks.
Time-to-event modeling with scikit-survival: Cox PH (elastic net), Random Survival Forests, Boosting, SVMs for censored data. C-index, Brier, time-dependent AUC; Kaplan-Meier, Nelson-Aalen, competing risks. Pipeline/GridSearchCV compatible. Use statsmodels for frequentist, pymc for Bayesian, lifelines for parametric.
>-