Skill284 repo starsupdated 4d ago

popv-cell-annotation

PopV-cell-annotation runs 10+ ensemble classification algorithms (KNN-based, CellTypist, ONCLASS, Random Forest, SCANVI, SVM, XGBoost) against a labeled reference dataset and transfers cell type labels to a query dataset via majority voting. Use this skill when single-method annotation is insufficient, you need robust consensus predictions across novel or ambiguous cell states, or require uncertainty quantification through per-method agreement scores for high-confidence downstream analyses.

View source Repository: SciAgent-Skills

Install in Claude Code

Copy

git clone --depth 1 https://github.com/jaechang-hits/SciAgent-Skills /tmp/popv-cell-annotation && cp -r /tmp/popv-cell-annotation/skills/genomics-bioinformatics/single-cell/popv-cell-annotation ~/.claude/skills/popv-cell-annotation

Then start a new Claude Code session; the skill loads automatically.

Definition

SKILL.md

# popV Multi-Method Cell Type Transfer

## Overview

popV (Population Voting for single-cell annotation) annotates a query scRNA-seq dataset by running 10+ independent classification algorithms against a labeled reference atlas and aggregating results via majority voting. Each method produces its own label; the final `popv_prediction` is the consensus across all methods, and the `popv_agreement` score quantifies how many methods agree. This ensemble strategy is robust to individual method failures on unusual datasets and provides a principled uncertainty estimate: low agreement highlights novel cell states or annotation gaps.

## When to Use

- Annotating a query dataset by transferring labels from a well-curated reference atlas when you want a consensus rather than a single model's judgment
- Identifying novel or ambiguous cell states as cells where methods disagree (low `popv_agreement` score)
- Benchmarking annotation reliability by comparing per-method labels to detect systematic disagreements
- Annotating large atlas datasets (>100k cells) where batch effects between reference and query are substantial
- Producing annotation for downstream analyses that require high-confidence labels (clinical data, regulatory submissions)
- Use **CellTypist** (celltypist-cell-annotation) instead when speed matters and a pre-trained model matches your tissue; popV is slower because it trains multiple models on your reference
- Use **scANVI** (scvi-tools-single-cell) instead when you need a single probabilistic deep generative model with formal uncertainty quantification and do not require the ensemble

## Prerequisites

- **Python packages**: `popv>=0.6`, `scanpy>=1.9`, `anndata`, `scvi-tools>=1.0`, `harmonypy`, `bbknn`, `celltypist`
- **Data requirements**: Two AnnData objects — a labeled reference (`adata_ref`) with cell type labels in `obs`, and an unlabeled query (`adata_query`). Both must be from the same species and have overlapping gene sets. Raw counts in `adata.X` (popV applies its own normalization internally)
- **Environment**: Python 3.9+; GPU recommended for scVI/SCANVI methods (falls back to CPU); 32 GB RAM recommended for >200k reference cells

```bash
pip install popv scvi-tools harmonypy bbknn celltypist
```

## Quick Start

Minimal pipeline from labeled reference and unlabeled query to annotated result:

```python
import popv
import scanpy as sc

# Load reference (labeled) and query (unlabeled) AnnData objects
adata_ref = sc.read_h5ad("reference_atlas.h5ad")  # adata_ref.obs["cell_type"] exists
adata_query = sc.read_h5ad("query_dataset.h5ad")

# Prepare combined object with popV preprocessing
adata = popv.preprocessing.Process_Query(
    adata_ref,
    adata_query,
    ref_labels_key="cell_type",
    ref_batch_key="batch",
    query_batch_key="batch",
    unknown_celltype_label="unknown",
    save_path_trained_models="./popv_models/",
    n_epochs_unsupervised=50,
)

# Run all annotation methods
popv.annotation.annotate_data(adata)

# Inspect consensus results for query cells
query_mask = adata.obs["_dataset"] == "query"
print(adata[query_mask].obs[["popv_prediction", "popv_agreement"]].head(10))
```

## Core API

### Module 1: Reference and Query Data Setup

Both AnnData objects must share a gene space and have required metadata columns. popV will subset to the intersection of genes automatically.

```python
import anndata as ad
import scanpy as sc
import numpy as np

# Reference: must have cell type labels and (optionally) batch metadata
adata_ref = sc.read_h5ad("reference_atlas.h5ad")
print(f"Reference: {adata_ref.n_obs} cells x {adata_ref.n_vars} genes")
print(f"Cell types: {adata_ref.obs['cell_type'].nunique()} unique labels")
print(f"Reference cell type counts:\n{adata_ref.obs['cell_type'].value_counts().head(10)}")

# Query: no labels required; batch metadata optional
adata_query = sc.read_h5ad("query_dataset.h5ad")
print(f"\nQuery: {adata_query.n_obs} cells x {adata_query.n_vars} genes")

# Check gene overlap (popV will handle subsetting but >70% overlap is recommended)
shared_genes = adata_ref.var_names.intersection(adata_query.var_names)
pct_shared = len(shared_genes) / adata_ref.n_vars
print(f"\nShared genes: {len(shared_genes)} ({pct_shared:.1%} of reference genes)")
if pct_shared < 0.5:
    print("WARNING: <50% gene overlap — annotation quality may be reduced")
```

```python
# Verify required fields before popV setup
assert "cell_type" in adata_ref.obs.columns, "Reference needs cell type labels"

# Add batch column if absent (popV requires it even for single-batch data)
if "batch" not in adata_ref.obs.columns:
    adata_ref.obs["batch"] = "ref_batch"
if "batch" not in adata_query.obs.columns:
    adata_query.obs["batch"] = "query_batch"

print("Reference obs columns:", adata_ref.obs.columns.tolist())
print("Query obs columns:    ", adata_query.obs.columns.tolist())
```

### Module 2: POPV Object Creation (Process_Query)

`Process_Query` combines reference and query, normalizes counts, selects HVGs, and prepares the joint embedding needed by all annotation methods.

```python
import popv

# Create processed combined AnnData
adata = popv.preprocessing.Process_Query(
    adata_ref,
    adata_query,
    ref_labels_key="cell_type",      # obs column with reference labels
    ref_batch_key="batch",           # obs column with reference batch info
    query_batch_key="batch",         # obs column with query batch info
    unknown_celltype_label="unknown",# label to use for query cells before annotation
    save_path_trained_models="./popv_models/",  # directory for scVI/SCANVI model checkpoints
    n_epochs_unsupervised=50,        # scVI training epochs (increase to 100–200 for large datasets)
    n_epochs_semisupervised=20,      # scANVI fine-tuning epochs
    use_gpu=True,                    # GPU for scVI/SCANVI (falls back to CPU if unavailable)
    hvg=4000,                        # number of highly variable genes to use
)

print(f"Combined object: {adata.n_obs} cells