Skip to main content
ClaudeWave
Skill199 repo starsupdated 16d ago

pydeseq2-differential-expression

Bulk RNA-seq DE with PyDESeq2: load counts, normalize, fit negative binomial models, Wald test (BH-FDR), LFC shrinkage, volcano/MA plots. Use for two-group comparisons, multi-factor designs with batch correction, multiple contrasts.

Install in Claude Code
Copy
git clone --depth 1 https://github.com/jaechang-hits/SciAgent-Skills /tmp/pydeseq2-differential-expression && cp -r /tmp/pydeseq2-differential-expression/skills/genomics-bioinformatics/rnaseq/pydeseq2-differential-expression ~/.claude/skills/pydeseq2-differential-expression
Then start a new Claude Code session; the skill loads automatically.

SKILL.md

# PyDESeq2 Differential Expression Analysis

## Overview

PyDESeq2 is a Python reimplementation of the R DESeq2 package for differential gene expression analysis from bulk RNA-seq count data. It fits negative binomial generalized linear models per gene, estimates dispersion with empirical Bayes shrinkage, and performs Wald tests with Benjamini-Hochberg FDR correction. This skill covers the full pipeline from raw counts to publication-ready result tables and visualizations.

## When to Use

- Identifying differentially expressed genes between two or more experimental conditions from bulk RNA-seq
- Performing two-group comparisons (e.g., treated vs control) with proper statistical testing
- Running multi-factor designs that account for batch effects or covariates (e.g., `~batch + condition`)
- Applying log2 fold change shrinkage (apeGLM) for ranking and visualization
- Generating volcano plots, MA plots, and heatmaps from differential expression results
- Converting R-based DESeq2 workflows to a pure Python environment
- Integrating DE analysis into larger Python bioinformatics pipelines (e.g., with scanpy, pandas)
- Use **DESeq2** (R/Bioconductor) or **edgeR** instead for the reference R implementations with the broadest method support and community validation

## Prerequisites

- **Python packages**: `pydeseq2>=0.4`, `pandas>=1.4`, `numpy>=1.23`, `scipy>=1.11`, `scikit-learn>=1.1`, `anndata>=0.8`
- **Data requirements**: Raw (unnormalized) integer count matrix (samples x genes) + sample metadata DataFrame
- **Environment**: Python 3.10+; optional `matplotlib`, `seaborn` for visualization

```bash
pip install pydeseq2 matplotlib seaborn
```

## Workflow

### Step 1: Data Loading and Validation

Load the count matrix and metadata. PyDESeq2 expects counts as a samples x genes DataFrame with non-negative integers, and metadata as a samples x variables DataFrame with matching indices.

```python
import pandas as pd

# Load data — typical CSV has genes as rows, samples as columns
counts_raw = pd.read_csv("counts.csv", index_col=0)
metadata = pd.read_csv("metadata.csv", index_col=0)

# Transpose if needed: PyDESeq2 requires samples x genes
if counts_raw.shape[0] > counts_raw.shape[1]:
    counts_df = counts_raw.T  # genes x samples → samples x genes
else:
    counts_df = counts_raw

# Validate alignment
common_samples = counts_df.index.intersection(metadata.index)
counts_df = counts_df.loc[common_samples]
metadata = metadata.loc[common_samples]

print(f"Samples: {counts_df.shape[0]}, Genes: {counts_df.shape[1]}")
print(f"Metadata columns: {list(metadata.columns)}")
print(f"Condition counts:\n{metadata['condition'].value_counts()}")
```

### Step 2: Gene Filtering

Remove lowly expressed genes to improve statistical power and reduce multiple testing burden.

```python
# Filter genes with total counts below threshold
min_total_counts = 10
gene_counts = counts_df.sum(axis=0)
genes_to_keep = gene_counts[gene_counts >= min_total_counts].index
counts_df = counts_df[genes_to_keep]

# Optional: require minimum counts in a minimum number of samples
min_count_per_sample = 5
min_samples = 3
genes_expressed = (counts_df >= min_count_per_sample).sum(axis=0) >= min_samples
counts_df = counts_df.loc[:, genes_expressed]

print(f"Genes after filtering: {counts_df.shape[1]}")
```

### Step 3: DeseqDataSet Initialization and Fitting

Create the DESeq dataset object, specify the design formula, and run the full pipeline (size factor estimation, dispersion estimation, model fitting).

```python
from pydeseq2.dds import DeseqDataSet

dds = DeseqDataSet(
    counts=counts_df,
    metadata=metadata,
    design="~condition",   # Wilkinson-style formula
    refit_cooks=True,      # Refit after Cook's outlier removal
    n_cpus=4               # Parallel threads
)

# Run: size factors → dispersions → trend → MAP shrinkage → LFC fitting
dds.deseq2()

# Inspect normalization
print(f"Size factors (first 5): {dds.obsm['size_factors'][:5]}")
print(f"Size factor range: {dds.obsm['size_factors'].min():.2f} - {dds.obsm['size_factors'].max():.2f}")
```

### Step 4: Statistical Testing (Wald Test)

Perform Wald tests to identify differentially expressed genes. Specify the contrast as `[variable, test_level, reference_level]`.

```python
from pydeseq2.ds import DeseqStats

ds = DeseqStats(
    dds,
    contrast=["condition", "treated", "control"],
    alpha=0.05,              # FDR threshold
    cooks_filter=True,       # Filter Cook's outliers
    independent_filter=True  # Independent filtering for power
)

ds.summary()

# Access full results
results = ds.results_df
print(f"Total genes tested: {len(results)}")
print(f"Significant (padj < 0.05): {(results.padj < 0.05).sum()}")
```

### Step 5: LFC Shrinkage (Optional)

Apply apeGLM shrinkage to reduce noise in log2 fold change estimates. Use shrunk values for visualization and ranking, not for significance calls.

```python
# Apply shrinkage — modifies results_df.log2FoldChange in place
ds.lfc_shrink()

# Compare pre/post shrinkage effect
print(f"Max |LFC| after shrinkage: {results.log2FoldChange.abs().max():.2f}")
print(f"Genes with |LFC| > 2: {(results.log2FoldChange.abs() > 2).sum()}")
```

### Step 6: Result Filtering and Export

Filter significant genes and export results for downstream analysis.

```python
import numpy as np

# Significance + effect size filter
significant = results[
    (results.padj < 0.05) &
    (results.log2FoldChange.abs() > 1.0)
].copy()

# Separate up/down-regulated
up = significant[significant.log2FoldChange > 0].sort_values("padj")
down = significant[significant.log2FoldChange < 0].sort_values("padj")

print(f"Upregulated: {len(up)}, Downregulated: {len(down)}")

# Export
results.to_csv("deseq2_all_results.csv")
significant.to_csv("deseq2_significant.csv")
up.to_csv("deseq2_upregulated.csv")
down.to_csv("deseq2_downregulated.csv")
print("Results exported to CSV files")
```

### Step 7: Visualization — Volcano Plot

```python
import matplotlib.pyplo
sciagent-skill-creatorSkill

|

opentrons-integrationSkill

Opentrons Protocol API v2 for OT-2/Flex: Python protocols for pipetting, serial dilutions, PCR, plate replication; control thermocycler, heater-shaker, magnetic, temperature modules. Use pylabrobot for multi-vendor.

plotly-interactive-visualizationSkill

Interactive visualization with Plotly. 40+ chart types (scatter, line, heatmap, 3D, geographic) with hover, zoom, pan. Two APIs: Plotly Express (DataFrame) and Graph Objects (fine control). For static publication figures use matplotlib; for statistical grammar use seaborn.

seaborn-statistical-visualizationSkill

Statistical visualization on matplotlib + pandas. Distributions (histplot, kdeplot, violin, box), relational (scatter, line), categorical, regression, correlation heatmaps. Auto aggregation/CIs. Use plotly for interactive; matplotlib for low-level.

single-cell-annotationSkill

Best practices for single-cell RNA-seq cell type annotation including marker-based, reference-based, and automated classification approaches.

pymc-bayesian-modelingSkill

Bayesian modeling with PyMC 5: priors, likelihood, NUTS/ADVI sampling, diagnostics (R-hat, ESS), LOO/WAIC comparison, prediction. Hierarchical, logistic, GP variants; predictive checks.

scikit-survival-analysisSkill

Time-to-event modeling with scikit-survival: Cox PH (elastic net), Random Survival Forests, Boosting, SVMs for censored data. C-index, Brier, time-dependent AUC; Kaplan-Meier, Nelson-Aalen, competing risks. Pipeline/GridSearchCV compatible. Use statsmodels for frequentist, pymc for Bayesian, lifelines for parametric.

statistical-analysisSkill

>-