scanpy-scrna-seq
scRNA-seq with Scanpy: QC, normalization, HVG selection, PCA, neighborhood graph, UMAP/t-SNE, Leiden clustering, markers, cell annotation, trajectory inference. Standard scRNA-seq exploration.
git clone --depth 1 https://github.com/jaechang-hits/SciAgent-Skills /tmp/scanpy-scrna-seq && cp -r /tmp/scanpy-scrna-seq/skills/genomics-bioinformatics/single-cell/scanpy-scrna-seq ~/.claude/skills/scanpy-scrna-seqSKILL.md
# Scanpy Single-Cell RNA-seq Analysis
## Overview
Scanpy is a scalable Python toolkit for analyzing single-cell RNA-seq data built on the AnnData format. This skill covers the end-to-end standard workflow: quality control, normalization, highly variable gene selection, dimensionality reduction, clustering, marker gene identification, and cell type annotation. It produces annotated datasets and publication-quality visualizations.
## When to Use
- Analyzing single-cell RNA-seq count matrices (10X Genomics, h5ad, CSV, loom)
- Performing quality control filtering on scRNA-seq datasets (mitochondrial %, gene counts)
- Running dimensionality reduction: PCA, UMAP, t-SNE
- Identifying cell clusters via Leiden community detection
- Finding differentially expressed marker genes per cluster (Wilcoxon, t-test, logistic regression)
- Annotating cell types from known marker gene panels
- Conducting trajectory inference and pseudotime analysis (PAGA, diffusion pseudotime)
- Generating publication-quality single-cell plots (dot plots, heatmaps, stacked violins)
- Comparing gene expression across experimental conditions within cell types
- Use **Seurat** (R/Bioconductor) instead for scRNA-seq analysis in an existing R workflow or when Seurat-specific assay types are required
## Prerequisites
- **Python packages**: `scanpy>=1.10`, `leidenalg`, `igraph`, `anndata`
- **Data requirements**: Gene expression count matrix (cells x genes). Common formats: 10X Cell Ranger output (`filtered_feature_bc_matrix/`), `.h5ad`, `.h5`, `.csv`
- **Environment**: Python 3.9+; 16GB+ RAM recommended for >50k cells
```bash
pip install "scanpy[leiden]" anndata
```
## Workflow
### Step 1: Setup and Data Loading
Configure scanpy settings and read the count matrix into an AnnData object. Ensure gene names are unique.
```python
import scanpy as sc
import numpy as np
import pandas as pd
# Configure settings
sc.settings.verbosity = 3 # errors=0, warnings=1, info=2, hints=3
sc.settings.set_figure_params(dpi=80, facecolor="white")
# Load data — pick the reader matching your format
adata = sc.read_10x_mtx(
"path/to/filtered_feature_bc_matrix/",
var_names="gene_symbols",
cache=True,
)
# Alternatives:
# adata = sc.read_h5ad("data.h5ad")
# adata = sc.read_10x_h5("data.h5")
adata.var_names_make_unique()
print(f"Loaded: {adata.n_obs} cells x {adata.n_vars} genes")
print(f"Sparsity: {1 - adata.X.nnz / (adata.n_obs * adata.n_vars):.1%}")
```
### Step 2: Quality Control
Annotate mitochondrial/ribosomal genes, compute QC metrics, visualize distributions, and filter low-quality cells.
```python
# Annotate gene groups
adata.var["mt"] = adata.var_names.str.startswith("MT-") # human; use "mt-" for mouse
adata.var["ribo"] = adata.var_names.str.startswith(("RPS", "RPL"))
# Calculate QC metrics
sc.pp.calculate_qc_metrics(
adata, qc_vars=["mt", "ribo"], percent_top=None, log1p=False, inplace=True
)
# Visualize QC distributions
sc.pl.violin(
adata,
["n_genes_by_counts", "total_counts", "pct_counts_mt"],
jitter=0.4,
multi_panel=True,
)
sc.pl.scatter(adata, x="total_counts", y="pct_counts_mt")
sc.pl.scatter(adata, x="total_counts", y="n_genes_by_counts")
# Filter — adjust thresholds based on the QC plots above
sc.pp.filter_cells(adata, min_genes=200)
sc.pp.filter_genes(adata, min_cells=3)
adata = adata[adata.obs.n_genes_by_counts < 5000, :] # remove potential doublets
adata = adata[adata.obs.pct_counts_mt < 20, :] # remove dying cells
print(f"After QC: {adata.n_obs} cells x {adata.n_vars} genes")
```
### Step 3: Normalization and Feature Selection
Normalize library sizes, log-transform, and identify highly variable genes (HVGs). Store raw counts for later differential expression.
```python
# Preserve raw counts in a layer
adata.layers["counts"] = adata.X.copy()
# Normalize to 10,000 counts per cell, then log-transform
sc.pp.normalize_total(adata, target_sum=1e4)
sc.pp.log1p(adata)
# Save normalized data as .raw for visualization
adata.raw = adata
# Identify highly variable genes
sc.pp.highly_variable_genes(adata, n_top_genes=2000, flavor="seurat_v3", layer="counts")
sc.pl.highly_variable_genes(adata)
# Subset to HVGs
adata = adata[:, adata.var.highly_variable]
print(f"HVG subset: {adata.n_obs} cells x {adata.n_vars} genes")
```
### Step 4: Scaling and Regression
Regress out confounders and scale gene expression values. This prepares data for PCA.
```python
# Regress out unwanted sources of variation
sc.pp.regress_out(adata, ["total_counts", "pct_counts_mt"])
# Scale to unit variance, clip extreme values
sc.pp.scale(adata, max_value=10)
print(f"Scaled matrix: mean={adata.X.mean():.4f}, std={adata.X.std():.4f}")
```
### Step 5: Dimensionality Reduction
Compute PCA, inspect the variance ratio elbow plot, then build a neighborhood graph and embed with UMAP.
```python
# PCA
sc.tl.pca(adata, svd_solver="arpack", n_comps=50)
sc.pl.pca_variance_ratio(adata, log=True, n_pcs=50)
# Determine n_pcs from elbow plot (typically 30-50)
n_pcs = 40
# Compute k-nearest neighbor graph
sc.pp.neighbors(adata, n_neighbors=15, n_pcs=n_pcs)
# UMAP embedding
sc.tl.umap(adata)
sc.pl.umap(adata, color=["n_genes_by_counts", "total_counts", "pct_counts_mt"])
print(f"UMAP shape: {adata.obsm['X_umap'].shape}")
```
### Step 6: Clustering
Run Leiden clustering at multiple resolutions and select the best granularity by visual inspection.
```python
# Test multiple resolutions
for res in [0.3, 0.5, 0.8, 1.0, 1.2]:
sc.tl.leiden(adata, resolution=res, key_added=f"leiden_res{res}", flavor="igraph", n_iterations=2)
# Compare resolutions side-by-side
sc.pl.umap(
adata,
color=["leiden_res0.3", "leiden_res0.5", "leiden_res0.8", "leiden_res1.0"],
ncols=2,
legend_loc="on data",
)
# Pick final resolution based on biological plausibility
adata.obs["leiden"] = adata.obs["leiden_res0.5"]
n_clusters = adata.obs["leiden"].nunique()
print(f"Selected resolution=0.5: {n_clusters} clusters")
```
### S|
Opentrons Protocol API v2 for OT-2/Flex: Python protocols for pipetting, serial dilutions, PCR, plate replication; control thermocycler, heater-shaker, magnetic, temperature modules. Use pylabrobot for multi-vendor.
Interactive visualization with Plotly. 40+ chart types (scatter, line, heatmap, 3D, geographic) with hover, zoom, pan. Two APIs: Plotly Express (DataFrame) and Graph Objects (fine control). For static publication figures use matplotlib; for statistical grammar use seaborn.
Statistical visualization on matplotlib + pandas. Distributions (histplot, kdeplot, violin, box), relational (scatter, line), categorical, regression, correlation heatmaps. Auto aggregation/CIs. Use plotly for interactive; matplotlib for low-level.
Best practices for single-cell RNA-seq cell type annotation including marker-based, reference-based, and automated classification approaches.
Bayesian modeling with PyMC 5: priors, likelihood, NUTS/ADVI sampling, diagnostics (R-hat, ESS), LOO/WAIC comparison, prediction. Hierarchical, logistic, GP variants; predictive checks.
Time-to-event modeling with scikit-survival: Cox PH (elastic net), Random Survival Forests, Boosting, SVMs for censored data. C-index, Brier, time-dependent AUC; Kaplan-Meier, Nelson-Aalen, competing risks. Pipeline/GridSearchCV compatible. Use statsmodels for frequentist, pymc for Bayesian, lifelines for parametric.
>-