pydeseq2-differential-expression
Bulk RNA-seq DE with PyDESeq2: load counts, normalize, fit negative binomial models, Wald test (BH-FDR), LFC shrinkage, volcano/MA plots. Use for two-group comparisons, multi-factor designs with batch correction, multiple contrasts.
git clone --depth 1 https://github.com/jaechang-hits/SciAgent-Skills /tmp/pydeseq2-differential-expression && cp -r /tmp/pydeseq2-differential-expression/skills/genomics-bioinformatics/rnaseq/pydeseq2-differential-expression ~/.claude/skills/pydeseq2-differential-expressionSKILL.md
# PyDESeq2 Differential Expression Analysis
## Overview
PyDESeq2 is a Python reimplementation of the R DESeq2 package for differential gene expression analysis from bulk RNA-seq count data. It fits negative binomial generalized linear models per gene, estimates dispersion with empirical Bayes shrinkage, and performs Wald tests with Benjamini-Hochberg FDR correction. This skill covers the full pipeline from raw counts to publication-ready result tables and visualizations.
## When to Use
- Identifying differentially expressed genes between two or more experimental conditions from bulk RNA-seq
- Performing two-group comparisons (e.g., treated vs control) with proper statistical testing
- Running multi-factor designs that account for batch effects or covariates (e.g., `~batch + condition`)
- Applying log2 fold change shrinkage (apeGLM) for ranking and visualization
- Generating volcano plots, MA plots, and heatmaps from differential expression results
- Converting R-based DESeq2 workflows to a pure Python environment
- Integrating DE analysis into larger Python bioinformatics pipelines (e.g., with scanpy, pandas)
- Use **DESeq2** (R/Bioconductor) or **edgeR** instead for the reference R implementations with the broadest method support and community validation
## Prerequisites
- **Python packages**: `pydeseq2>=0.4`, `pandas>=1.4`, `numpy>=1.23`, `scipy>=1.11`, `scikit-learn>=1.1`, `anndata>=0.8`
- **Data requirements**: Raw (unnormalized) integer count matrix (samples x genes) + sample metadata DataFrame
- **Environment**: Python 3.10+; optional `matplotlib`, `seaborn` for visualization
```bash
pip install pydeseq2 matplotlib seaborn
```
## Workflow
### Step 1: Data Loading and Validation
Load the count matrix and metadata. PyDESeq2 expects counts as a samples x genes DataFrame with non-negative integers, and metadata as a samples x variables DataFrame with matching indices.
```python
import pandas as pd
# Load data — typical CSV has genes as rows, samples as columns
counts_raw = pd.read_csv("counts.csv", index_col=0)
metadata = pd.read_csv("metadata.csv", index_col=0)
# Transpose if needed: PyDESeq2 requires samples x genes
if counts_raw.shape[0] > counts_raw.shape[1]:
counts_df = counts_raw.T # genes x samples → samples x genes
else:
counts_df = counts_raw
# Validate alignment
common_samples = counts_df.index.intersection(metadata.index)
counts_df = counts_df.loc[common_samples]
metadata = metadata.loc[common_samples]
print(f"Samples: {counts_df.shape[0]}, Genes: {counts_df.shape[1]}")
print(f"Metadata columns: {list(metadata.columns)}")
print(f"Condition counts:\n{metadata['condition'].value_counts()}")
```
### Step 2: Gene Filtering
Remove lowly expressed genes to improve statistical power and reduce multiple testing burden.
```python
# Filter genes with total counts below threshold
min_total_counts = 10
gene_counts = counts_df.sum(axis=0)
genes_to_keep = gene_counts[gene_counts >= min_total_counts].index
counts_df = counts_df[genes_to_keep]
# Optional: require minimum counts in a minimum number of samples
min_count_per_sample = 5
min_samples = 3
genes_expressed = (counts_df >= min_count_per_sample).sum(axis=0) >= min_samples
counts_df = counts_df.loc[:, genes_expressed]
print(f"Genes after filtering: {counts_df.shape[1]}")
```
### Step 3: DeseqDataSet Initialization and Fitting
Create the DESeq dataset object, specify the design formula, and run the full pipeline (size factor estimation, dispersion estimation, model fitting).
```python
from pydeseq2.dds import DeseqDataSet
dds = DeseqDataSet(
counts=counts_df,
metadata=metadata,
design="~condition", # Wilkinson-style formula
refit_cooks=True, # Refit after Cook's outlier removal
n_cpus=4 # Parallel threads
)
# Run: size factors → dispersions → trend → MAP shrinkage → LFC fitting
dds.deseq2()
# Inspect normalization
print(f"Size factors (first 5): {dds.obsm['size_factors'][:5]}")
print(f"Size factor range: {dds.obsm['size_factors'].min():.2f} - {dds.obsm['size_factors'].max():.2f}")
```
### Step 4: Statistical Testing (Wald Test)
Perform Wald tests to identify differentially expressed genes. Specify the contrast as `[variable, test_level, reference_level]`.
```python
from pydeseq2.ds import DeseqStats
ds = DeseqStats(
dds,
contrast=["condition", "treated", "control"],
alpha=0.05, # FDR threshold
cooks_filter=True, # Filter Cook's outliers
independent_filter=True # Independent filtering for power
)
ds.summary()
# Access full results
results = ds.results_df
print(f"Total genes tested: {len(results)}")
print(f"Significant (padj < 0.05): {(results.padj < 0.05).sum()}")
```
### Step 5: LFC Shrinkage (Optional)
Apply apeGLM shrinkage to reduce noise in log2 fold change estimates. Use shrunk values for visualization and ranking, not for significance calls.
```python
# Apply shrinkage — modifies results_df.log2FoldChange in place
ds.lfc_shrink()
# Compare pre/post shrinkage effect
print(f"Max |LFC| after shrinkage: {results.log2FoldChange.abs().max():.2f}")
print(f"Genes with |LFC| > 2: {(results.log2FoldChange.abs() > 2).sum()}")
```
### Step 6: Result Filtering and Export
Filter significant genes and export results for downstream analysis.
```python
import numpy as np
# Significance + effect size filter
significant = results[
(results.padj < 0.05) &
(results.log2FoldChange.abs() > 1.0)
].copy()
# Separate up/down-regulated
up = significant[significant.log2FoldChange > 0].sort_values("padj")
down = significant[significant.log2FoldChange < 0].sort_values("padj")
print(f"Upregulated: {len(up)}, Downregulated: {len(down)}")
# Export
results.to_csv("deseq2_all_results.csv")
significant.to_csv("deseq2_significant.csv")
up.to_csv("deseq2_upregulated.csv")
down.to_csv("deseq2_downregulated.csv")
print("Results exported to CSV files")
```
### Step 7: Visualization — Volcano Plot
```python
import matplotlib.pyplo|
Opentrons Protocol API v2 for OT-2/Flex: Python protocols for pipetting, serial dilutions, PCR, plate replication; control thermocycler, heater-shaker, magnetic, temperature modules. Use pylabrobot for multi-vendor.
Interactive visualization with Plotly. 40+ chart types (scatter, line, heatmap, 3D, geographic) with hover, zoom, pan. Two APIs: Plotly Express (DataFrame) and Graph Objects (fine control). For static publication figures use matplotlib; for statistical grammar use seaborn.
Statistical visualization on matplotlib + pandas. Distributions (histplot, kdeplot, violin, box), relational (scatter, line), categorical, regression, correlation heatmaps. Auto aggregation/CIs. Use plotly for interactive; matplotlib for low-level.
Best practices for single-cell RNA-seq cell type annotation including marker-based, reference-based, and automated classification approaches.
Bayesian modeling with PyMC 5: priors, likelihood, NUTS/ADVI sampling, diagnostics (R-hat, ESS), LOO/WAIC comparison, prediction. Hierarchical, logistic, GP variants; predictive checks.
Time-to-event modeling with scikit-survival: Cox PH (elastic net), Random Survival Forests, Boosting, SVMs for censored data. C-index, Brier, time-dependent AUC; Kaplan-Meier, Nelson-Aalen, competing risks. Pipeline/GridSearchCV compatible. Use statsmodels for frequentist, pymc for Bayesian, lifelines for parametric.
>-