Skip to main content
ClaudeWave
Skill199 estrellas del repoactualizado 16d ago

single-cell-annotation-guide

Decision framework for manual marker-based, automated (CellTypist), and reference-based (popV) cell type annotation in scRNA-seq. Three-tier strategy: Tier 1 manual markers, Tier 2 CellTypist, Tier 3 popV ensemble transfer. Use when planning or troubleshooting annotation.

Instalar en Claude Code
Copiar
git clone --depth 1 https://github.com/jaechang-hits/SciAgent-Skills /tmp/single-cell-annotation-guide && cp -r /tmp/single-cell-annotation-guide/skills/genomics-bioinformatics/single-cell/single-cell-annotation-guide ~/.claude/skills/single-cell-annotation-guide
Después abre una sesión nueva de Claude Code; el skill carga automáticamente.

SKILL.md

# Single-Cell RNA-seq Cell Type Annotation Guide

## Overview

Cell type annotation is the process of assigning biological identities to computationally defined clusters in single-cell RNA-seq data. It is one of the most consequential analytical decisions in a scRNA-seq project: annotation errors propagate into downstream analyses of differential expression, trajectory inference, and cell-cell communication. This guide presents a three-tier decision strategy — manual marker-based annotation first, automated reference-free classification second, and ensemble reference-based label transfer third — and explains when each approach is most appropriate.

The guide synthesizes community know-how on CellTypist (Dominguez Conde et al., Science 2022), popV (Luecken et al., Nature Methods 2024), and classical marker-based approaches, following standards established by the Human Cell Atlas project.

The three tiers represent a progression from effort-intensive but transparent (manual) to efficient and scalable (automated). They are not mutually exclusive: best practice is to run automated annotation first to generate hypotheses and then validate with manual marker inspection. For high-stakes biological claims — rare cell types, novel disease states, clinical applications — all three tiers should be used in parallel and discordant results resolved explicitly before publication.

## Key Concepts

### Cell Type Markers and Cluster Identity

A cell type marker is a gene whose expression distinguishes one cell population from all others in a dataset. Canonical markers are those validated across many studies and tissues — for example, CD3E for T cells, CD19 and MS4A1 (CD20) for B cells, CD14 for classical monocytes, and EPCAM for epithelial cells. Effective markers fulfill three criteria: they are highly expressed in the target cell type (high sensitivity), they are absent or very low in all other cell types (high specificity), and their identity is confirmed by at least two independent markers.

Clusters produced by algorithms such as Leiden or Louvain represent groups of transcriptionally similar cells; they do not inherently correspond to biological cell types. A single true cell type may appear as multiple clusters if the resolution parameter is too high (overclustering), and biologically distinct cell types may be merged if resolution is too low. Annotation quality depends on both the quality of the clustering and the quality of the evidence used to assign identities.

### Reference Atlases and Label Transfer

Reference atlases are large, curated scRNA-seq datasets with expert-validated cell type labels that serve as a "ground truth" for annotation of new query datasets. Prominent examples include the Human Cell Atlas (HCA), the Human Lung Cell Atlas (HLCA, 2.4M cells), Tabula Sapiens (500K cells across 28 tissues), and the Human BM Atlas for bone marrow. Label transfer is the computational process of projecting query cells onto the reference space and assigning labels based on proximity.

Label transfer quality depends critically on biological match between query and reference. Adult tissue query data annotated with a fetal reference will produce systematic errors for cell types that differ between developmental stages. Similarly, a blood-derived reference will poorly annotate tumor-infiltrating immune cells, which have altered transcriptional programs compared to their circulating counterparts.

### Annotation Confidence and Validation

Automated annotation tools produce confidence scores (probability or agreement rate) that quantify the certainty of each cell's label. These scores should never be ignored. Cells with low confidence scores may represent novel cell types absent from the reference, doublets (two cells captured together), or transitional states along a differentiation continuum.

Validation refers to verifying that automated labels are consistent with independent biological evidence — typically canonical marker gene expression. Even when an automated method reports high confidence, a few minutes spent visualizing marker genes in a UMAP or dotplot is essential to catch systematic misannotations, especially for rare or disease-specific cell populations absent from training data.

### Doublets and Technical Artifacts

Doublets are droplets containing two or more cells, which appear as a single cell with an anomalously high gene count and mixed transcriptional identity. Annotating doublets before removing them produces spurious "cell types" that combine markers from two real populations (e.g., an apparent NK-B cell hybrid). Tools such as Scrublet and DoubletFinder should be applied before annotation; doublets should be marked and excluded from the annotation workflow.

Batch effects — systematic technical differences between samples processed at different times, with different reagents, or on different platforms — can mimic cell type differences. If one batch contains more stressed cells (higher stress gene expression), they may cluster separately and be misannotated as a distinct cell type. Batch correction with Harmony, scVI, or BBKNN should be applied before annotation when multiple batches are present.

### Cell Ontology and Annotation Hierarchies

Cell ontologies are formal vocabularies that define cell type names, synonyms, and parent-child relationships. The Cell Ontology (CL), maintained by the OBO Foundry, is the standard used by the Human Cell Atlas and most major atlases. Using ontology-compliant cell type names (e.g., "CD8-positive, alpha-beta T cell" rather than "CD8 T cell") enables cross-dataset comparison and interoperability. CellTypist and popV return labels that map to the Cell Ontology when tissue-matched models are used.

Annotation hierarchies reflect that cell identities exist at multiple levels of granularity. At the coarsest level, cells are classified as broad lineages (immune, epithelial, stromal). At intermediate granularity, they are classified as cell types (
sciagent-skill-creatorSkill

|

opentrons-integrationSkill

Opentrons Protocol API v2 for OT-2/Flex: Python protocols for pipetting, serial dilutions, PCR, plate replication; control thermocycler, heater-shaker, magnetic, temperature modules. Use pylabrobot for multi-vendor.

plotly-interactive-visualizationSkill

Interactive visualization with Plotly. 40+ chart types (scatter, line, heatmap, 3D, geographic) with hover, zoom, pan. Two APIs: Plotly Express (DataFrame) and Graph Objects (fine control). For static publication figures use matplotlib; for statistical grammar use seaborn.

seaborn-statistical-visualizationSkill

Statistical visualization on matplotlib + pandas. Distributions (histplot, kdeplot, violin, box), relational (scatter, line), categorical, regression, correlation heatmaps. Auto aggregation/CIs. Use plotly for interactive; matplotlib for low-level.

single-cell-annotationSkill

Best practices for single-cell RNA-seq cell type annotation including marker-based, reference-based, and automated classification approaches.

pymc-bayesian-modelingSkill

Bayesian modeling with PyMC 5: priors, likelihood, NUTS/ADVI sampling, diagnostics (R-hat, ESS), LOO/WAIC comparison, prediction. Hierarchical, logistic, GP variants; predictive checks.

scikit-survival-analysisSkill

Time-to-event modeling with scikit-survival: Cox PH (elastic net), Random Survival Forests, Boosting, SVMs for censored data. C-index, Brier, time-dependent AUC; Kaplan-Meier, Nelson-Aalen, competing risks. Pipeline/GridSearchCV compatible. Use statsmodels for frequentist, pymc for Bayesian, lifelines for parametric.

statistical-analysisSkill

>-