Skill284 repo starsupdated 4d ago

scanpy-scrna-seq

This Scanpy skill provides an end-to-end single-cell RNA-seq analysis workflow covering quality control, normalization, dimensionality reduction via PCA and UMAP/t-SNE, Leiden clustering, and marker gene identification through statistical tests like Wilcoxon and logistic regression. Use it when analyzing count matrices from 10X Genomics or similar platforms to filter cells by mitochondrial content and gene counts, identify cell clusters, annotate cell types from marker panels, and generate publication-quality visualizations including dot plots and heatmaps.

View source Repository: SciAgent-Skills

Install in Claude Code

Copy

git clone --depth 1 https://github.com/jaechang-hits/SciAgent-Skills /tmp/scanpy-scrna-seq && cp -r /tmp/scanpy-scrna-seq/skills/genomics-bioinformatics/single-cell/scanpy-scrna-seq ~/.claude/skills/scanpy-scrna-seq

Then start a new Claude Code session; the skill loads automatically.

Definition

SKILL.md

# Scanpy Single-Cell RNA-seq Analysis

## Overview

Scanpy is a scalable Python toolkit for analyzing single-cell RNA-seq data built on the AnnData format. This skill covers the end-to-end standard workflow: quality control, normalization, highly variable gene selection, dimensionality reduction, clustering, marker gene identification, and cell type annotation. It produces annotated datasets and publication-quality visualizations.

## When to Use

- Analyzing single-cell RNA-seq count matrices (10X Genomics, h5ad, CSV, loom)
- Performing quality control filtering on scRNA-seq datasets (mitochondrial %, gene counts)
- Running dimensionality reduction: PCA, UMAP, t-SNE
- Identifying cell clusters via Leiden community detection
- Finding differentially expressed marker genes per cluster (Wilcoxon, t-test, logistic regression)
- Annotating cell types from known marker gene panels
- Conducting trajectory inference and pseudotime analysis (PAGA, diffusion pseudotime)
- Generating publication-quality single-cell plots (dot plots, heatmaps, stacked violins)
- Comparing gene expression across experimental conditions within cell types
- Use **Seurat** (R/Bioconductor) instead for scRNA-seq analysis in an existing R workflow or when Seurat-specific assay types are required

## Prerequisites

- **Python packages**: `scanpy>=1.10`, `leidenalg`, `igraph`, `anndata`
- **Data requirements**: Gene expression count matrix (cells x genes). Common formats: 10X Cell Ranger output (`filtered_feature_bc_matrix/`), `.h5ad`, `.h5`, `.csv`
- **Environment**: Python 3.9+; 16GB+ RAM recommended for >50k cells

```bash
pip install "scanpy[leiden]" anndata
```

## Workflow

### Step 1: Setup and Data Loading

Configure scanpy settings and read the count matrix into an AnnData object. Ensure gene names are unique.

```python
import scanpy as sc
import numpy as np
import pandas as pd

# Configure settings
sc.settings.verbosity = 3  # errors=0, warnings=1, info=2, hints=3
sc.settings.set_figure_params(dpi=80, facecolor="white")

# Load data — pick the reader matching your format
adata = sc.read_10x_mtx(
    "path/to/filtered_feature_bc_matrix/",
    var_names="gene_symbols",
    cache=True,
)
# Alternatives:
# adata = sc.read_h5ad("data.h5ad")
# adata = sc.read_10x_h5("data.h5")

adata.var_names_make_unique()
print(f"Loaded: {adata.n_obs} cells x {adata.n_vars} genes")
print(f"Sparsity: {1 - adata.X.nnz / (adata.n_obs * adata.n_vars):.1%}")
```

### Step 2: Quality Control

Annotate mitochondrial/ribosomal genes, compute QC metrics, visualize distributions, and filter low-quality cells.

```python
# Annotate gene groups
adata.var["mt"] = adata.var_names.str.startswith("MT-")    # human; use "mt-" for mouse
adata.var["ribo"] = adata.var_names.str.startswith(("RPS", "RPL"))

# Calculate QC metrics
sc.pp.calculate_qc_metrics(
    adata, qc_vars=["mt", "ribo"], percent_top=None, log1p=False, inplace=True
)

# Visualize QC distributions
sc.pl.violin(
    adata,
    ["n_genes_by_counts", "total_counts", "pct_counts_mt"],
    jitter=0.4,
    multi_panel=True,
)
sc.pl.scatter(adata, x="total_counts", y="pct_counts_mt")
sc.pl.scatter(adata, x="total_counts", y="n_genes_by_counts")

# Filter — adjust thresholds based on the QC plots above
sc.pp.filter_cells(adata, min_genes=200)
sc.pp.filter_genes(adata, min_cells=3)
adata = adata[adata.obs.n_genes_by_counts < 5000, :]  # remove potential doublets
adata = adata[adata.obs.pct_counts_mt < 20, :]         # remove dying cells

print(f"After QC: {adata.n_obs} cells x {adata.n_vars} genes")
```

### Step 3: Normalization and Feature Selection

Normalize library sizes, log-transform, and identify highly variable genes (HVGs). Store raw counts for later differential expression.

```python
# Preserve raw counts in a layer
adata.layers["counts"] = adata.X.copy()

# Normalize to 10,000 counts per cell, then log-transform
sc.pp.normalize_total(adata, target_sum=1e4)
sc.pp.log1p(adata)

# Save normalized data as .raw for visualization
adata.raw = adata

# Identify highly variable genes
sc.pp.highly_variable_genes(adata, n_top_genes=2000, flavor="seurat_v3", layer="counts")
sc.pl.highly_variable_genes(adata)

# Subset to HVGs
adata = adata[:, adata.var.highly_variable]
print(f"HVG subset: {adata.n_obs} cells x {adata.n_vars} genes")
```

### Step 4: Scaling and Regression

Regress out confounders and scale gene expression values. This prepares data for PCA.

```python
# Regress out unwanted sources of variation
sc.pp.regress_out(adata, ["total_counts", "pct_counts_mt"])

# Scale to unit variance, clip extreme values
sc.pp.scale(adata, max_value=10)

print(f"Scaled matrix: mean={adata.X.mean():.4f}, std={adata.X.std():.4f}")
```

### Step 5: Dimensionality Reduction

Compute PCA, inspect the variance ratio elbow plot, then build a neighborhood graph and embed with UMAP.

```python
# PCA
sc.tl.pca(adata, svd_solver="arpack", n_comps=50)
sc.pl.pca_variance_ratio(adata, log=True, n_pcs=50)

# Determine n_pcs from elbow plot (typically 30-50)
n_pcs = 40

# Compute k-nearest neighbor graph
sc.pp.neighbors(adata, n_neighbors=15, n_pcs=n_pcs)

# UMAP embedding
sc.tl.umap(adata)
sc.pl.umap(adata, color=["n_genes_by_counts", "total_counts", "pct_counts_mt"])

print(f"UMAP shape: {adata.obsm['X_umap'].shape}")
```

### Step 6: Clustering

Run Leiden clustering at multiple resolutions and select the best granularity by visual inspection.

```python
# Test multiple resolutions
for res in [0.3, 0.5, 0.8, 1.0, 1.2]:
    sc.tl.leiden(adata, resolution=res, key_added=f"leiden_res{res}", flavor="igraph", n_iterations=2)

# Compare resolutions side-by-side
sc.pl.umap(
    adata,
    color=["leiden_res0.3", "leiden_res0.5", "leiden_res0.8", "leiden_res1.0"],
    ncols=2,
    legend_loc="on data",
)

# Pick final resolution based on biological plausibility
adata.obs["leiden"] = adata.obs["leiden_res0.5"]
n_clusters = adata.obs["leiden"].nunique()
print(f"Selected resolution=0.5: {n_clusters} clusters")
```

### S