Skip to main content
ClaudeWave
Skill199 repo starsupdated 16d ago

deseq2-differential-expression

Bulk RNA-seq DE with R/Bioconductor DESeq2. Negative binomial GLM, empirical Bayes shrinkage, Wald/LRT tests, multi-factor designs, Salmon tximeta import, apeglm LFC shrinkage, MA/volcano/heatmap viz. R gold standard. Use pydeseq2-differential-expression for Python; use edgeR for TMM normalization.

Install in Claude Code
Copy
git clone --depth 1 https://github.com/jaechang-hits/SciAgent-Skills /tmp/deseq2-differential-expression && cp -r /tmp/deseq2-differential-expression/skills/genomics-bioinformatics/rnaseq/deseq2-differential-expression ~/.claude/skills/deseq2-differential-expression
Then start a new Claude Code session; the skill loads automatically.

SKILL.md

# DESeq2 Differential Expression Analysis (R/Bioconductor)

## Overview

DESeq2 is the Bioconductor R package for differential gene expression analysis from bulk RNA-seq count data. It fits a negative binomial generalized linear model per gene, estimates dispersion parameters using empirical Bayes shrinkage across genes, and tests differential expression using Wald tests (two-group) or likelihood ratio tests (complex designs). DESeq2 is the R gold standard for RNA-seq DE analysis, with native Bioconductor integration for seamless import from Salmon (tximeta/tximport), featureCounts, or HTSeq.

## When to Use

- Identifying differentially expressed genes between two experimental conditions (treated vs. control, disease vs. healthy) from bulk RNA-seq count data
- Analyzing multi-factor designs that account for batch effects or covariates (e.g., `~ batch + condition`)
- Testing complex hypotheses with interaction terms (e.g., time × treatment) or reduced models using likelihood ratio tests (LRT)
- Importing Salmon pseudoalignment output via tximeta or tximport for transcript-level uncertainty propagation
- Performing LFC shrinkage with apeglm for ranked gene lists, volcano plots, and downstream pathway analysis
- Conducting time-series experiments or any design with more than two levels requiring model comparison
- Working in an R/Bioconductor ecosystem where integration with SummarizedExperiment, clusterProfiler, or EnhancedVolcano is needed
- Use **pydeseq2-differential-expression** instead for Python-based pipelines with the same statistical model
- Use **edgeR** for negative binomial DE with TMM normalization, quasi-likelihood F-tests, or TREAT testing
- Use **gseapy-gene-enrichment** after DE to interpret results at the pathway level

## Prerequisites

- **R packages**: `DESeq2` (Bioconductor), `tximeta` or `tximport` (Salmon import), `apeglm` (LFC shrinkage), `pheatmap`, `ggplot2`, `EnhancedVolcano`
- **Data requirements**: Raw (unnormalized) integer count matrix — gene rows × sample columns — plus a sample metadata data frame with matching column names. If using Salmon: per-sample `quant.sf` files and a transcript-to-gene mapping
- **Environment**: R ≥ 4.2, Bioconductor ≥ 3.16

```r
# Install Bioconductor packages
if (!require("BiocManager", quietly = TRUE))
    install.packages("BiocManager")

BiocManager::install(c("DESeq2", "tximeta", "tximport", "apeglm",
                        "EnhancedVolcano"))
install.packages(c("pheatmap", "ggplot2", "dplyr"))
```

## Quick Start

Complete two-group comparison from a count matrix in under 20 lines.

```r
library(DESeq2)

# Load count matrix (genes x samples) and metadata
counts <- as.matrix(read.csv("counts.csv", row.names = 1))
coldata <- read.csv("metadata.csv", row.names = 1)
coldata$condition <- factor(coldata$condition)

# Build DESeqDataSet, run full pipeline
dds <- DESeqDataSetFromMatrix(countData = counts,
                               colData   = coldata,
                               design    = ~ condition)
dds <- dds[rowSums(counts(dds)) >= 10, ]   # pre-filter
dds <- DESeq(dds)

# Extract results (treated vs. control)
res <- results(dds, contrast = c("condition", "treated", "control"),
               alpha = 0.05)
summary(res)
# LFC shrinkage for visualization
res_shrunk <- lfcShrink(dds, contrast = c("condition", "treated", "control"),
                         type = "apeglm", coef = 2)
```

## Workflow

### Step 1: Prepare Count Matrix and Sample Metadata

Build a `DESeqDataSet` from a gene × sample count matrix and a colData data frame. Column names of the count matrix must match row names of colData.

```r
library(DESeq2)
library(dplyr)

# Load raw count matrix (genes as rows, samples as columns)
counts <- as.matrix(read.csv("featureCounts_matrix.csv", row.names = 1))
# Or from featureCounts tab-delimited output — skip first comment line
# fc <- read.table("featurecounts_output.txt", header = TRUE, skip = 1, row.names = 1)
# counts <- as.matrix(fc[, 7:ncol(fc)])  # columns 7+ are sample counts

# Load sample metadata
coldata <- data.frame(
    condition = factor(c("control", "control", "control",
                          "treated", "treated", "treated")),
    batch     = factor(c("A", "A", "B", "A", "B", "B")),
    row.names = colnames(counts)
)

# Build DESeqDataSet
dds <- DESeqDataSetFromMatrix(countData = counts,
                               colData   = coldata,
                               design    = ~ condition)

cat("Samples:", ncol(dds), "\n")
cat("Genes:", nrow(dds), "\n")
cat("Condition levels:", levels(coldata$condition), "\n")
```

### Step 2: Import from Salmon via tximeta

When reads were quantified with Salmon, use `tximeta` to import transcript-level estimates with proper offset correction that accounts for transcript length and GC bias.

```r
library(tximeta)

# Build a coldata with a "files" column pointing to quant.sf files
quant_dirs <- file.path("salmon_output", coldata$sample_id, "quant.sf")
coldata$files <- quant_dirs
coldata$names <- coldata$sample_id

# Import with tximeta — automatically fetches transcript metadata from Ensembl
se <- tximeta(coldata)           # SummarizedExperiment with transcript-level data

# Summarize to gene level (requires Ensembl or custom txdb)
gse <- summarizeToGene(se)       # gene-level SummarizedExperiment

# Build DESeqDataSet from the SummarizedExperiment
dds <- DESeqDataSet(gse, design = ~ condition)

cat("Gene-level SE dimensions:", dim(gse), "\n")
cat("Assay names:", assayNames(gse), "\n")
# Assay "counts" = estimated counts; "abundance" = TPM; "length" = eff. lengths
```

### Step 3: Pre-Filtering and Quality Control

Remove genes with very low counts to improve statistical power and reduce the multiple testing burden. Explore sample quality with PCA on variance-stabilized counts.

```r
library(ggplot2)

# Pre-filter: keep genes with at least 10 reads total
keep <- rowSums(counts(dds)) >= 10
dds <- dds[keep, ]
cat("Genes after pre-filtering:", nrow(dds), "
sciagent-skill-creatorSkill

|

opentrons-integrationSkill

Opentrons Protocol API v2 for OT-2/Flex: Python protocols for pipetting, serial dilutions, PCR, plate replication; control thermocycler, heater-shaker, magnetic, temperature modules. Use pylabrobot for multi-vendor.

plotly-interactive-visualizationSkill

Interactive visualization with Plotly. 40+ chart types (scatter, line, heatmap, 3D, geographic) with hover, zoom, pan. Two APIs: Plotly Express (DataFrame) and Graph Objects (fine control). For static publication figures use matplotlib; for statistical grammar use seaborn.

seaborn-statistical-visualizationSkill

Statistical visualization on matplotlib + pandas. Distributions (histplot, kdeplot, violin, box), relational (scatter, line), categorical, regression, correlation heatmaps. Auto aggregation/CIs. Use plotly for interactive; matplotlib for low-level.

single-cell-annotationSkill

Best practices for single-cell RNA-seq cell type annotation including marker-based, reference-based, and automated classification approaches.

pymc-bayesian-modelingSkill

Bayesian modeling with PyMC 5: priors, likelihood, NUTS/ADVI sampling, diagnostics (R-hat, ESS), LOO/WAIC comparison, prediction. Hierarchical, logistic, GP variants; predictive checks.

scikit-survival-analysisSkill

Time-to-event modeling with scikit-survival: Cox PH (elastic net), Random Survival Forests, Boosting, SVMs for censored data. C-index, Brier, time-dependent AUC; Kaplan-Meier, Nelson-Aalen, competing risks. Pipeline/GridSearchCV compatible. Use statsmodels for frequentist, pymc for Bayesian, lifelines for parametric.

statistical-analysisSkill

>-