Skill284 repo starsupdated 4d ago

deseq2-differential-expression

DESeq2 is an R/Bioconductor package for identifying differentially expressed genes in bulk RNA-seq data using negative binomial generalized linear models with empirical Bayes shrinkage estimation. Use it when analyzing raw count matrices between experimental conditions, accounting for batch effects or complex designs, importing Salmon pseudoalignment output, or performing log-fold-change shrinkage for ranked gene lists and visualization. Choose pydeseq2-differential-expression for Python pipelines or edgeR for TMM normalization alternatives.

View source Repository: SciAgent-Skills

Install in Claude Code

Copy

git clone --depth 1 https://github.com/jaechang-hits/SciAgent-Skills /tmp/deseq2-differential-expression && cp -r /tmp/deseq2-differential-expression/skills/genomics-bioinformatics/rnaseq/deseq2-differential-expression ~/.claude/skills/deseq2-differential-expression

Then start a new Claude Code session; the skill loads automatically.

Definition

SKILL.md

# DESeq2 Differential Expression Analysis (R/Bioconductor)

## Overview

DESeq2 is the Bioconductor R package for differential gene expression analysis from bulk RNA-seq count data. It fits a negative binomial generalized linear model per gene, estimates dispersion parameters using empirical Bayes shrinkage across genes, and tests differential expression using Wald tests (two-group) or likelihood ratio tests (complex designs). DESeq2 is the R gold standard for RNA-seq DE analysis, with native Bioconductor integration for seamless import from Salmon (tximeta/tximport), featureCounts, or HTSeq.

## When to Use

- Identifying differentially expressed genes between two experimental conditions (treated vs. control, disease vs. healthy) from bulk RNA-seq count data
- Analyzing multi-factor designs that account for batch effects or covariates (e.g., `~ batch + condition`)
- Testing complex hypotheses with interaction terms (e.g., time × treatment) or reduced models using likelihood ratio tests (LRT)
- Importing Salmon pseudoalignment output via tximeta or tximport for transcript-level uncertainty propagation
- Performing LFC shrinkage with apeglm for ranked gene lists, volcano plots, and downstream pathway analysis
- Conducting time-series experiments or any design with more than two levels requiring model comparison
- Working in an R/Bioconductor ecosystem where integration with SummarizedExperiment, clusterProfiler, or EnhancedVolcano is needed
- Use **pydeseq2-differential-expression** instead for Python-based pipelines with the same statistical model
- Use **edgeR** for negative binomial DE with TMM normalization, quasi-likelihood F-tests, or TREAT testing
- Use **gseapy-gene-enrichment** after DE to interpret results at the pathway level

## Prerequisites

- **R packages**: `DESeq2` (Bioconductor), `tximeta` or `tximport` (Salmon import), `apeglm` (LFC shrinkage), `pheatmap`, `ggplot2`, `EnhancedVolcano`
- **Data requirements**: Raw (unnormalized) integer count matrix — gene rows × sample columns — plus a sample metadata data frame with matching column names. If using Salmon: per-sample `quant.sf` files and a transcript-to-gene mapping
- **Environment**: R ≥ 4.2, Bioconductor ≥ 3.16

```r
# Install Bioconductor packages
if (!require("BiocManager", quietly = TRUE))
    install.packages("BiocManager")

BiocManager::install(c("DESeq2", "tximeta", "tximport", "apeglm",
                        "EnhancedVolcano"))
install.packages(c("pheatmap", "ggplot2", "dplyr"))
```

## Quick Start

Complete two-group comparison from a count matrix in under 20 lines.

```r
library(DESeq2)

# Load count matrix (genes x samples) and metadata
counts <- as.matrix(read.csv("counts.csv", row.names = 1))
coldata <- read.csv("metadata.csv", row.names = 1)
coldata$condition <- factor(coldata$condition)

# Build DESeqDataSet, run full pipeline
dds <- DESeqDataSetFromMatrix(countData = counts,
                               colData   = coldata,
                               design    = ~ condition)
dds <- dds[rowSums(counts(dds)) >= 10, ]   # pre-filter
dds <- DESeq(dds)

# Extract results (treated vs. control)
res <- results(dds, contrast = c("condition", "treated", "control"),
               alpha = 0.05)
summary(res)
# LFC shrinkage for visualization
res_shrunk <- lfcShrink(dds, contrast = c("condition", "treated", "control"),
                         type = "apeglm", coef = 2)
```

## Workflow

### Step 1: Prepare Count Matrix and Sample Metadata

Build a `DESeqDataSet` from a gene × sample count matrix and a colData data frame. Column names of the count matrix must match row names of colData.

```r
library(DESeq2)
library(dplyr)

# Load raw count matrix (genes as rows, samples as columns)
counts <- as.matrix(read.csv("featureCounts_matrix.csv", row.names = 1))
# Or from featureCounts tab-delimited output — skip first comment line
# fc <- read.table("featurecounts_output.txt", header = TRUE, skip = 1, row.names = 1)
# counts <- as.matrix(fc[, 7:ncol(fc)])  # columns 7+ are sample counts

# Load sample metadata
coldata <- data.frame(
    condition = factor(c("control", "control", "control",
                          "treated", "treated", "treated")),
    batch     = factor(c("A", "A", "B", "A", "B", "B")),
    row.names = colnames(counts)
)

# Build DESeqDataSet
dds <- DESeqDataSetFromMatrix(countData = counts,
                               colData   = coldata,
                               design    = ~ condition)

cat("Samples:", ncol(dds), "\n")
cat("Genes:", nrow(dds), "\n")
cat("Condition levels:", levels(coldata$condition), "\n")
```

### Step 2: Import from Salmon via tximeta

When reads were quantified with Salmon, use `tximeta` to import transcript-level estimates with proper offset correction that accounts for transcript length and GC bias.

```r
library(tximeta)

# Build a coldata with a "files" column pointing to quant.sf files
quant_dirs <- file.path("salmon_output", coldata$sample_id, "quant.sf")
coldata$files <- quant_dirs
coldata$names <- coldata$sample_id

# Import with tximeta — automatically fetches transcript metadata from Ensembl
se <- tximeta(coldata)           # SummarizedExperiment with transcript-level data

# Summarize to gene level (requires Ensembl or custom txdb)
gse <- summarizeToGene(se)       # gene-level SummarizedExperiment

# Build DESeqDataSet from the SummarizedExperiment
dds <- DESeqDataSet(gse, design = ~ condition)

cat("Gene-level SE dimensions:", dim(gse), "\n")
cat("Assay names:", assayNames(gse), "\n")
# Assay "counts" = estimated counts; "abundance" = TPM; "length" = eff. lengths
```

### Step 3: Pre-Filtering and Quality Control

Remove genes with very low counts to improve statistical power and reduce the multiple testing burden. Explore sample quality with PCA on variance-stabilized counts.

```r
library(ggplot2)

# Pre-filter: keep genes with at least 10 reads total
keep <- rowSums(counts(dds)) >= 10
dds <- dds[keep, ]
cat("Genes after pre-filtering:", nrow(dds), "