Skip to main content
ClaudeWave
Skill199 estrellas del repoactualizado 16d ago

harmony-batch-correction

Harmony batch correction for scRNA-seq and other omics. Removes batch effects from PCA embeddings while preserving biology. Run after PCA, before UMAP. Scales to millions of cells. Python (harmonypy, scanpy) and R (Seurat).

Instalar en Claude Code
Copiar
git clone --depth 1 https://github.com/jaechang-hits/SciAgent-Skills /tmp/harmony-batch-correction && cp -r /tmp/harmony-batch-correction/skills/genomics-bioinformatics/single-cell/harmony-batch-correction ~/.claude/skills/harmony-batch-correction
Después abre una sesión nueva de Claude Code; el skill carga automáticamente.

SKILL.md

# Harmony Batch Correction

## Overview

Harmony is a fast, scalable algorithm for batch integration in single-cell data. It takes a PCA embedding (cells × PCs) as input and returns a corrected embedding from which batch effects have been regressed out via iterative soft-clustering and per-cluster linear regression. The corrected embedding is then used to compute neighbors, UMAP, and downstream clustering — the raw count matrix is never modified. Harmony works for single-cell RNA-seq, ATAC-seq, and other omics modalities where a PCA-like embedding is available.

## When to Use

- Integrating scRNA-seq datasets from different samples, donors, sequencing runs, or experimental batches that should contain the same cell types
- Removing technical variation (library preparation protocol, 10x chemistry version, sequencing depth, sequencing platform) while preserving biological differences between cell types and conditions
- Performing fast, scalable batch correction on datasets with millions of cells where deep generative model training would be prohibitively slow
- Correcting for multiple confounding variables simultaneously (batch, donor, sequencing platform, tissue processing protocol)
- Preparing a corrected embedding for UMAP visualization, Leiden clustering, or label transfer without modifying the gene expression count matrix
- Use **scVI/scvi-tools** instead when you need probabilistic batch correction with a variational autoencoder (deep learning), differential expression with uncertainty estimates, or multi-modal integration (RNA + protein)
- Use **BBKNN** instead when you want graph-based integration that avoids constructing a corrected embedding altogether and directly builds a cross-batch nearest-neighbor graph
- Use **Seurat Integration / CCA** (R) instead when you are already in a Seurat workflow and prefer anchor-based integration methods

## Prerequisites

- **Python packages**: `harmonypy`, `scanpy>=1.10`, `leidenalg`, `igraph`, `anndata`, `pandas`, `matplotlib`
- **R packages** (optional): `harmony`, `Seurat>=4.0` (for R workflow in Step 7)
- **Data requirements**: AnnData object with raw counts, batch/sample metadata in `adata.obs`, and PCA embedding already computed (`adata.obsm["X_pca"]`)
- **Environment**: Python 3.9+; 8 GB RAM sufficient for up to ~500k cells; linear memory scaling

```bash
pip install harmonypy "scanpy[leiden]" anndata pandas matplotlib
```

```r
# R installation (for Step 7 / Seurat integration)
install.packages("harmony")
# If using Seurat:
install.packages("Seurat")
```

## Quick Start

Minimal pipeline — load preprocessed data, run Harmony via scanpy, produce UMAP:

```python
import scanpy as sc

# Load preprocessed AnnData (counts normalized, HVGs selected, PCA run)
adata = sc.read_h5ad("preprocessed.h5ad")   # must have adata.obs["batch"]

# Run Harmony batch correction (corrects adata.obsm["X_pca"] → "X_pca_harmony")
sc.external.pp.harmony_integrate(adata, key="batch")

# Build neighborhood graph on corrected embedding, then UMAP
sc.pp.neighbors(adata, use_rep="X_pca_harmony")
sc.tl.umap(adata)
sc.tl.leiden(adata, resolution=0.5)

# Visualize
sc.pl.umap(adata, color=["batch", "leiden"], ncols=2)
print(f"Clusters: {adata.obs['leiden'].nunique()} | Cells: {adata.n_obs}")
```

## Workflow

### Step 1: Quality Control and Preprocessing

Filter, normalize, select highly variable genes (HVGs), and run PCA. Harmony is applied to the PCA embedding produced here.

```python
import scanpy as sc
import numpy as np

sc.settings.set_figure_params(dpi=80, facecolor="white")

# Load multi-batch data — adata.obs must contain a batch column
adata = sc.read_h5ad("multi_batch_counts.h5ad")
# Alternatively, concatenate multiple AnnData objects:
# adata = sc.concat([adata1, adata2, adata3], label="batch",
#                   keys=["batch1", "batch2", "batch3"])

print(f"Loaded: {adata.n_obs} cells × {adata.n_vars} genes")
print(f"Batches: {adata.obs['batch'].value_counts().to_dict()}")

# QC: annotate mitochondrial genes and filter low-quality cells
adata.var["mt"] = adata.var_names.str.startswith("MT-")
sc.pp.calculate_qc_metrics(adata, qc_vars=["mt"], inplace=True)
sc.pp.filter_cells(adata, min_genes=200)
sc.pp.filter_cells(adata, max_genes=6000)
sc.pp.filter_genes(adata, min_cells=3)
adata = adata[adata.obs["pct_counts_mt"] < 20].copy()
print(f"After QC: {adata.n_obs} cells × {adata.n_vars} genes")

# Normalize and log-transform (store raw counts first)
adata.layers["counts"] = adata.X.copy()
sc.pp.normalize_total(adata, target_sum=1e4)
sc.pp.log1p(adata)

# Select HVGs and compute PCA
sc.pp.highly_variable_genes(adata, n_top_genes=3000, batch_key="batch")
sc.pp.pca(adata, n_comps=50, use_highly_variable=True)
print(f"PCA shape: {adata.obsm['X_pca'].shape}")  # (n_cells, 50)
```

### Step 2: Run Harmony via Scanpy Integration

The scanpy wrapper calls `harmonypy` under the hood and stores the corrected embedding in `adata.obsm["X_pca_harmony"]`.

```python
import scanpy as sc

# Run Harmony — corrects adata.obsm["X_pca"] in-place and stores result as "X_pca_harmony"
sc.external.pp.harmony_integrate(
    adata,
    key="batch",           # obs column with batch labels
    basis="X_pca",         # source embedding (default)
    adjusted_basis="X_pca_harmony",  # destination key (default)
    max_iter_harmony=20,   # maximum Harmony iterations (default 10; increase for complex batches)
    theta=2.0,             # diversity penalty per batch variable (default 2)
    sigma=0.1,             # width of soft-clustering Gaussian kernel
    random_state=42,
)

print(f"Corrected embedding: {adata.obsm['X_pca_harmony'].shape}")
# Expected: (n_cells, 50) — same shape as X_pca
```

### Step 3: Run Harmony via harmonypy Directly

Use the lower-level `harmonypy` API when you need finer control, access to the HarmonyObject, or integration outside of scanpy.

```python
import harmonypy
import pandas as pd
import numpy as np

# Extract PCA embedding and metadata
pca_embeddings = ad
sciagent-skill-creatorSkill

|

opentrons-integrationSkill

Opentrons Protocol API v2 for OT-2/Flex: Python protocols for pipetting, serial dilutions, PCR, plate replication; control thermocycler, heater-shaker, magnetic, temperature modules. Use pylabrobot for multi-vendor.

plotly-interactive-visualizationSkill

Interactive visualization with Plotly. 40+ chart types (scatter, line, heatmap, 3D, geographic) with hover, zoom, pan. Two APIs: Plotly Express (DataFrame) and Graph Objects (fine control). For static publication figures use matplotlib; for statistical grammar use seaborn.

seaborn-statistical-visualizationSkill

Statistical visualization on matplotlib + pandas. Distributions (histplot, kdeplot, violin, box), relational (scatter, line), categorical, regression, correlation heatmaps. Auto aggregation/CIs. Use plotly for interactive; matplotlib for low-level.

single-cell-annotationSkill

Best practices for single-cell RNA-seq cell type annotation including marker-based, reference-based, and automated classification approaches.

pymc-bayesian-modelingSkill

Bayesian modeling with PyMC 5: priors, likelihood, NUTS/ADVI sampling, diagnostics (R-hat, ESS), LOO/WAIC comparison, prediction. Hierarchical, logistic, GP variants; predictive checks.

scikit-survival-analysisSkill

Time-to-event modeling with scikit-survival: Cox PH (elastic net), Random Survival Forests, Boosting, SVMs for censored data. C-index, Brier, time-dependent AUC; Kaplan-Meier, Nelson-Aalen, competing risks. Pipeline/GridSearchCV compatible. Use statsmodels for frequentist, pymc for Bayesian, lifelines for parametric.

statistical-analysisSkill

>-