Skip to main content
ClaudeWave
Skill199 repo starsupdated 16d ago

single-cell-annotation

Best practices for single-cell RNA-seq cell type annotation including marker-based, reference-based, and automated classification approaches.

Install in Claude Code
Copy
git clone --depth 1 https://github.com/jaechang-hits/SciAgent-Skills /tmp/single-cell-annotation && cp -r /tmp/single-cell-annotation/legacy/single-cell-annotation ~/.claude/skills/single-cell-annotation
Then start a new Claude Code session; the skill loads automatically.

SKILL.md

# Single Cell RNA-seq Cell Type Annotation

---

## Metadata

**Short Description**: Best practices for annotating cell types in single-cell RNA-seq data using marker-based, automated, and reference-based approaches.

**Authors**: Distilled from "Single-cell best practices" by Luecken, M.D. et al.

**Affiliations**: Helmholtz Munich, Wellcome Sanger Institute, Harvard Medical School, and contributors

**Version**: 1.0

**Last Updated**: January 2025

**License**: CC BY 4.0

**Commercial Use**: ✅ Allowed

**Source**: https://www.sc-best-practices.org/cellular_structure/annotation.html

**Citation**: Luecken, M.D., Theis, F.J. et al. (2023). Current best practices in single-cell RNA-seq analysis: a tutorial. Molecular Systems Biology.

---

## Overview

Cell type annotation is the process of assigning cell type labels to clusters or individual cells in single-cell RNA-seq data. This guide covers three main approaches and their practical implementation.

## Key Concepts

### Cell Type vs. Cell State
A **cell type** is a stable identity defined by a developmental trajectory and core marker gene program (e.g., CD4+ T cell, hepatocyte). A **cell state** is a transient condition (activated, cycling, stressed) overlaid on a cell type. Annotation should target cell types first; states are attributes that may further subdivide a type but should not be conflated with type identity.

### Marker Genes and Marker Panels
Marker genes are genes whose expression is enriched in a specific cell type relative to other cells in the same tissue context. Reliable annotation uses **panels of multiple markers** (typically 3-5 per type) rather than a single gene, because expression is noisy in droplet-based scRNA-seq and many markers are shared across related types. Markers come in two flavors: **canonical** (literature-derived, e.g., CD3D for T cells) and **data-derived** (from differential expression on the dataset).

### Reference Atlases and Label Transfer
A reference atlas is a previously annotated dataset (e.g., Human Cell Atlas, Tabula Sapiens) used to project labels onto a new "query" dataset. Label transfer methods (scArches, scANVI, Azimuth, SingleR) align query cells into the reference latent space and assign the nearest neighbor's label. Quality of transfer depends on tissue match, technology match (e.g., 10x v3 vs. Smart-seq2), and species match.

## Decision Framework

Use this tree to choose an annotation approach:

```
                Do you have a well-characterized tissue
                with a high-quality reference atlas?
                            │
              ┌─────────────┴─────────────┐
              │                           │
             YES                          NO
              │                           │
              ▼                           ▼
   Is this a standard tissue       Are you studying
   (PBMC, lung, gut) with a       novel cell types or
   pre-trained classifier?         exploratory data?
              │                           │
        ┌─────┴─────┐               ┌─────┴─────┐
        │           │               │           │
       YES          NO             YES          NO
        │           │               │           │
        ▼           ▼               ▼           ▼
   Automated    Reference-     Manual marker  Manual +
   (CellTypist) based          based          automated
                (scArches,     (Scanpy,       cross-check
                 Azimuth,      Seurat)
                 SingleR)
```

### Decision Table

| Scenario | Approach | Primary Tool | Validation |
|----------|----------|--------------|------------|
| Standard human PBMC, large dataset (>100k cells) | Automated | CellTypist | Spot-check with manual markers |
| Well-characterized tissue (lung, kidney, brain) | Reference-based label transfer | scArches / Azimuth | Marker consistency on top clusters |
| Novel/rare tissue, no good reference | Manual marker-based | Scanpy / Seurat | Hierarchical, broad-to-fine |
| Cross-species (e.g., zebrafish) | Manual markers + ortholog mapping | Scanpy + custom panel | Compare to closest reference species |
| Developmental / continuous trajectory | Reference-based with state-aware model | scANVI / scArches | Trajectory coherence + markers |
| Disease tissue with known perturbation | Manual + automated cross-check | CellTypist + Scanpy | Confirm disease-specific states separately |

## Three Annotation Approaches

### 1. Manual Marker-Based Annotation
Identify cell types by examining expression of known marker genes in each cluster.

**Tools**: Scanpy, Seurat
**Best for**: Small datasets, novel cell types, high confidence needs

### 2. Automated Annotation
Use pre-trained classifiers to automatically assign cell type labels.

**Tools**: CellTypist, scAnnotate
**Best for**: Standard tissues, quick preliminary annotation, large datasets

### 3. Reference-Based Label Transfer
Transfer labels from annotated reference datasets to your query data.

**Tools**: scArches, scANVI, Azimuth, SingleR
**Best for**: Well-characterized tissues, integration with public data

## Recommended Workflow

### Step 1: Quality Control First
- **Remove low-quality cells before annotation**
- Filter doublets (expected doublet rate: 0.8% per 1000 cells)
- Check for ambient RNA contamination
- Verify cluster quality and resolution

### Step 2: Initial Marker-Based Assessment

```python
# Scanpy example
import scanpy as sc

# Calculate marker genes for clusters
sc.tl.rank_genes_groups(adata, 'leiden', method='wilcoxon')

# Visualize top markers
sc.pl.rank_genes_groups(adata, n_genes=25, sharey=False)

# Plot known markers
markers = {
    'T cells': ['CD3D', 'CD3E', 'CD4', 'CD8A'],
    'B cells': ['CD19', 'MS4A1', 'CD79A'],
    'Monocytes': ['CD14', 'FCGR3A', 'LYZ'],
    'NK cells': ['NCAM1', 'NKG7', 'GNLY']
}

sc.pl.dotplot(adata, markers, groupby='leiden')
```

### Step 3: Use Automated Tools for Validation

```python
# CellTypist example (fast, accurate for immune cell
sciagent-skill-creatorSkill

|

opentrons-integrationSkill

Opentrons Protocol API v2 for OT-2/Flex: Python protocols for pipetting, serial dilutions, PCR, plate replication; control thermocycler, heater-shaker, magnetic, temperature modules. Use pylabrobot for multi-vendor.

plotly-interactive-visualizationSkill

Interactive visualization with Plotly. 40+ chart types (scatter, line, heatmap, 3D, geographic) with hover, zoom, pan. Two APIs: Plotly Express (DataFrame) and Graph Objects (fine control). For static publication figures use matplotlib; for statistical grammar use seaborn.

seaborn-statistical-visualizationSkill

Statistical visualization on matplotlib + pandas. Distributions (histplot, kdeplot, violin, box), relational (scatter, line), categorical, regression, correlation heatmaps. Auto aggregation/CIs. Use plotly for interactive; matplotlib for low-level.

pymc-bayesian-modelingSkill

Bayesian modeling with PyMC 5: priors, likelihood, NUTS/ADVI sampling, diagnostics (R-hat, ESS), LOO/WAIC comparison, prediction. Hierarchical, logistic, GP variants; predictive checks.

scikit-survival-analysisSkill

Time-to-event modeling with scikit-survival: Cox PH (elastic net), Random Survival Forests, Boosting, SVMs for censored data. C-index, Brier, time-dependent AUC; Kaplan-Meier, Nelson-Aalen, competing risks. Pipeline/GridSearchCV compatible. Use statsmodels for frequentist, pymc for Bayesian, lifelines for parametric.

statistical-analysisSkill

>-

statsmodels-statistical-modelingSkill

Python statistical modeling: regression (OLS, WLS, GLM), discrete (Logit, Poisson, NegBin), time series (ARIMA, SARIMAX, VAR), with rigorous inference, diagnostics, and hypothesis tests. Use scikit-learn for ML; statistical-analysis for test choice.