deseq2-differential-expression
Bulk RNA-seq DE with R/Bioconductor DESeq2. Negative binomial GLM, empirical Bayes shrinkage, Wald/LRT tests, multi-factor designs, Salmon tximeta import, apeglm LFC shrinkage, MA/volcano/heatmap viz. R gold standard. Use pydeseq2-differential-expression for Python; use edgeR for TMM normalization.
git clone --depth 1 https://github.com/jaechang-hits/SciAgent-Skills /tmp/deseq2-differential-expression && cp -r /tmp/deseq2-differential-expression/skills/genomics-bioinformatics/rnaseq/deseq2-differential-expression ~/.claude/skills/deseq2-differential-expressionSKILL.md
# DESeq2 Differential Expression Analysis (R/Bioconductor)
## Overview
DESeq2 is the Bioconductor R package for differential gene expression analysis from bulk RNA-seq count data. It fits a negative binomial generalized linear model per gene, estimates dispersion parameters using empirical Bayes shrinkage across genes, and tests differential expression using Wald tests (two-group) or likelihood ratio tests (complex designs). DESeq2 is the R gold standard for RNA-seq DE analysis, with native Bioconductor integration for seamless import from Salmon (tximeta/tximport), featureCounts, or HTSeq.
## When to Use
- Identifying differentially expressed genes between two experimental conditions (treated vs. control, disease vs. healthy) from bulk RNA-seq count data
- Analyzing multi-factor designs that account for batch effects or covariates (e.g., `~ batch + condition`)
- Testing complex hypotheses with interaction terms (e.g., time × treatment) or reduced models using likelihood ratio tests (LRT)
- Importing Salmon pseudoalignment output via tximeta or tximport for transcript-level uncertainty propagation
- Performing LFC shrinkage with apeglm for ranked gene lists, volcano plots, and downstream pathway analysis
- Conducting time-series experiments or any design with more than two levels requiring model comparison
- Working in an R/Bioconductor ecosystem where integration with SummarizedExperiment, clusterProfiler, or EnhancedVolcano is needed
- Use **pydeseq2-differential-expression** instead for Python-based pipelines with the same statistical model
- Use **edgeR** for negative binomial DE with TMM normalization, quasi-likelihood F-tests, or TREAT testing
- Use **gseapy-gene-enrichment** after DE to interpret results at the pathway level
## Prerequisites
- **R packages**: `DESeq2` (Bioconductor), `tximeta` or `tximport` (Salmon import), `apeglm` (LFC shrinkage), `pheatmap`, `ggplot2`, `EnhancedVolcano`
- **Data requirements**: Raw (unnormalized) integer count matrix — gene rows × sample columns — plus a sample metadata data frame with matching column names. If using Salmon: per-sample `quant.sf` files and a transcript-to-gene mapping
- **Environment**: R ≥ 4.2, Bioconductor ≥ 3.16
```r
# Install Bioconductor packages
if (!require("BiocManager", quietly = TRUE))
install.packages("BiocManager")
BiocManager::install(c("DESeq2", "tximeta", "tximport", "apeglm",
"EnhancedVolcano"))
install.packages(c("pheatmap", "ggplot2", "dplyr"))
```
## Quick Start
Complete two-group comparison from a count matrix in under 20 lines.
```r
library(DESeq2)
# Load count matrix (genes x samples) and metadata
counts <- as.matrix(read.csv("counts.csv", row.names = 1))
coldata <- read.csv("metadata.csv", row.names = 1)
coldata$condition <- factor(coldata$condition)
# Build DESeqDataSet, run full pipeline
dds <- DESeqDataSetFromMatrix(countData = counts,
colData = coldata,
design = ~ condition)
dds <- dds[rowSums(counts(dds)) >= 10, ] # pre-filter
dds <- DESeq(dds)
# Extract results (treated vs. control)
res <- results(dds, contrast = c("condition", "treated", "control"),
alpha = 0.05)
summary(res)
# LFC shrinkage for visualization
res_shrunk <- lfcShrink(dds, contrast = c("condition", "treated", "control"),
type = "apeglm", coef = 2)
```
## Workflow
### Step 1: Prepare Count Matrix and Sample Metadata
Build a `DESeqDataSet` from a gene × sample count matrix and a colData data frame. Column names of the count matrix must match row names of colData.
```r
library(DESeq2)
library(dplyr)
# Load raw count matrix (genes as rows, samples as columns)
counts <- as.matrix(read.csv("featureCounts_matrix.csv", row.names = 1))
# Or from featureCounts tab-delimited output — skip first comment line
# fc <- read.table("featurecounts_output.txt", header = TRUE, skip = 1, row.names = 1)
# counts <- as.matrix(fc[, 7:ncol(fc)]) # columns 7+ are sample counts
# Load sample metadata
coldata <- data.frame(
condition = factor(c("control", "control", "control",
"treated", "treated", "treated")),
batch = factor(c("A", "A", "B", "A", "B", "B")),
row.names = colnames(counts)
)
# Build DESeqDataSet
dds <- DESeqDataSetFromMatrix(countData = counts,
colData = coldata,
design = ~ condition)
cat("Samples:", ncol(dds), "\n")
cat("Genes:", nrow(dds), "\n")
cat("Condition levels:", levels(coldata$condition), "\n")
```
### Step 2: Import from Salmon via tximeta
When reads were quantified with Salmon, use `tximeta` to import transcript-level estimates with proper offset correction that accounts for transcript length and GC bias.
```r
library(tximeta)
# Build a coldata with a "files" column pointing to quant.sf files
quant_dirs <- file.path("salmon_output", coldata$sample_id, "quant.sf")
coldata$files <- quant_dirs
coldata$names <- coldata$sample_id
# Import with tximeta — automatically fetches transcript metadata from Ensembl
se <- tximeta(coldata) # SummarizedExperiment with transcript-level data
# Summarize to gene level (requires Ensembl or custom txdb)
gse <- summarizeToGene(se) # gene-level SummarizedExperiment
# Build DESeqDataSet from the SummarizedExperiment
dds <- DESeqDataSet(gse, design = ~ condition)
cat("Gene-level SE dimensions:", dim(gse), "\n")
cat("Assay names:", assayNames(gse), "\n")
# Assay "counts" = estimated counts; "abundance" = TPM; "length" = eff. lengths
```
### Step 3: Pre-Filtering and Quality Control
Remove genes with very low counts to improve statistical power and reduce the multiple testing burden. Explore sample quality with PCA on variance-stabilized counts.
```r
library(ggplot2)
# Pre-filter: keep genes with at least 10 reads total
keep <- rowSums(counts(dds)) >= 10
dds <- dds[keep, ]
cat("Genes after pre-filtering:", nrow(dds), "|
Opentrons Protocol API v2 for OT-2/Flex: Python protocols for pipetting, serial dilutions, PCR, plate replication; control thermocycler, heater-shaker, magnetic, temperature modules. Use pylabrobot for multi-vendor.
Interactive visualization with Plotly. 40+ chart types (scatter, line, heatmap, 3D, geographic) with hover, zoom, pan. Two APIs: Plotly Express (DataFrame) and Graph Objects (fine control). For static publication figures use matplotlib; for statistical grammar use seaborn.
Statistical visualization on matplotlib + pandas. Distributions (histplot, kdeplot, violin, box), relational (scatter, line), categorical, regression, correlation heatmaps. Auto aggregation/CIs. Use plotly for interactive; matplotlib for low-level.
Best practices for single-cell RNA-seq cell type annotation including marker-based, reference-based, and automated classification approaches.
Bayesian modeling with PyMC 5: priors, likelihood, NUTS/ADVI sampling, diagnostics (R-hat, ESS), LOO/WAIC comparison, prediction. Hierarchical, logistic, GP variants; predictive checks.
Time-to-event modeling with scikit-survival: Cox PH (elastic net), Random Survival Forests, Boosting, SVMs for censored data. C-index, Brier, time-dependent AUC; Kaplan-Meier, Nelson-Aalen, competing risks. Pipeline/GridSearchCV compatible. Use statsmodels for frequentist, pymc for Bayesian, lifelines for parametric.
>-