gseapy-gene-enrichment
GSEA and over-representation analysis (ORA) for RNA-seq and proteomics. Wraps Enrichr for ORA against MSigDB, KEGG, GO, and 200+ databases; runs preranked GSEA on ranked DE gene lists. Outputs enrichment tables and running-score plots. Use after DESeq2 or edgeR for pathway-level interpretation.
git clone --depth 1 https://github.com/jaechang-hits/SciAgent-Skills /tmp/gseapy-gene-enrichment && cp -r /tmp/gseapy-gene-enrichment/skills/genomics-bioinformatics/rnaseq/gseapy-gene-enrichment ~/.claude/skills/gseapy-gene-enrichmentSKILL.md
# GSEApy — Gene Set Enrichment Analysis in Python
## Overview
GSEApy provides Python implementations of GSEA and over-representation analysis (ORA) for interpreting gene expression changes at the pathway level. The `enrich` module queries the Enrichr API to test a gene list against 200+ databases (GO, KEGG, MSigDB Hallmarks, Reactome, WikiPathways). The `prerank` and `gsea` modules run the GSEA algorithm on a pre-ranked gene list or expression matrix — computing normalized enrichment scores (NES) and FDR values for each gene set. GSEApy integrates directly with pandas DataFrames from DESeq2 or scanpy differential expression output, making it the standard Python tool for pathway analysis in RNA-seq workflows.
## When to Use
- Interpreting DESeq2 or edgeR differential expression results at pathway/GO-term level
- Running fast ORA (over-representation analysis) against Enrichr's 200+ databases including GO, KEGG, and MSigDB Hallmarks
- Performing GSEA prerank analysis on a log2-fold-change-ranked gene list without an expression matrix
- Identifying enriched pathways in scRNA-seq cluster marker genes
- Generating publication-ready enrichment dot plots and GSEA running-score plots
- Use **GSEA Java application** for the official GUI-based analysis with full GSEA desktop interface
- Use **fgsea** (R) as an alternative with fast permutation-based p-values; GSEApy is preferred for Python-native pipelines
## Prerequisites
- **Python packages**: `gseapy`, `pandas`, `matplotlib`
- **Internet access**: `enrich` module queries the Enrichr API (requires connection)
```bash
pip install gseapy
# Verify
python -c "import gseapy; print(gseapy.__version__)"
# 1.1.3
```
## Quick Start
```python
import gseapy as gp
# ORA: test a gene list against GO Biological Process
gene_list = ["TP53", "BRCA1", "CDK2", "CCND1", "MYC", "EGFR", "KRAS", "PTEN"]
enr = gp.enrichr(gene_list=gene_list,
gene_sets=["GO_Biological_Process_2023"],
organism="human",
outdir=None)
print(enr.results.head(5)[["Term", "P-value", "Adjusted P-value", "Genes"]])
```
## Workflow
### Step 1: Over-Representation Analysis with Enrichr (ORA)
Test a gene list against pathway databases via the Enrichr API.
```python
import gseapy as gp
import pandas as pd
# Gene list from DESeq2 (significant upregulated genes)
sig_genes = ["TP53", "BRCA1", "CDK2", "CCND1", "MYC", "EGFR",
"KRAS", "PTEN", "RB1", "AKT1", "PIK3CA", "MDM2"]
# Run ORA against multiple databases
enr = gp.enrichr(
gene_list=sig_genes,
gene_sets=[
"GO_Biological_Process_2023",
"KEGG_2021_Human",
"MSigDB_Hallmark_2020",
"Reactome_2022",
],
organism="human",
outdir="enrichr_results/",
cutoff=0.05,
)
# Display top results
results = enr.results
print(f"Enriched terms: {len(results[results['Adjusted P-value'] < 0.05])}")
print(results[results["Adjusted P-value"] < 0.05].sort_values("Adjusted P-value")
.head(10)[["Gene_set", "Term", "Adjusted P-value", "Combined Score"]])
```
### Step 2: List Available Gene Set Databases
Discover the 200+ databases available through Enrichr.
```python
import gseapy as gp
# List all available gene set libraries
libraries = gp.get_library_name(organism="human")
print(f"Available databases: {len(libraries)}")
print("Selected databases:")
for lib in sorted(libraries):
if any(kw in lib for kw in ["GO_Bio", "KEGG", "Hallmark", "Reactome"]):
print(f" {lib}")
# Mouse databases
mouse_libs = gp.get_library_name(organism="mouse")
print(f"\nMouse databases: {len(mouse_libs)}")
```
### Step 3: GSEA Prerank — Ranked Gene List Analysis
Run GSEA on a log2 fold-change ranked gene list from differential expression.
```python
import gseapy as gp
import pandas as pd
import numpy as np
# Load DESeq2 results (or create example ranked list)
# deseq_results = pd.read_csv("deseq2_results.tsv", sep="\t", index_col=0)
# ranked = deseq_results["log2FoldChange"].dropna().sort_values(ascending=False)
# Example ranked gene list (gene → log2FC)
np.random.seed(42)
gene_names = [f"GENE_{i}" for i in range(1000)]
log2fc = np.random.normal(0, 2, 1000)
ranked = pd.Series(log2fc, index=gene_names).sort_values(ascending=False)
# Run preranked GSEA against MSigDB Hallmarks
pre_res = gp.prerank(
rnk=ranked,
gene_sets="MSigDB_Hallmark_2020",
threads=4,
min_size=15,
max_size=500,
permutation_num=1000,
outdir="gsea_results/prerank/",
seed=42,
verbose=True,
)
# View results
res_df = pre_res.res2d
sig = res_df[res_df["FDR q-val"] < 0.25]
print(f"Significant gene sets (FDR < 0.25): {len(sig)}")
print(sig.sort_values("NES", ascending=False)[["Term", "NES", "NOM p-val", "FDR q-val"]].head(10))
```
### Step 4: Plot GSEA Running Score
Visualize the enrichment score curve for a specific gene set.
```python
import gseapy as gp
from gseapy.plot import gseaplot
import matplotlib.pyplot as plt
# Re-use pre_res from Step 3 (or load saved results)
# Select the top enriched gene set
top_term = pre_res.res2d.sort_values("NES", ascending=False).index[0]
print(f"Top enriched gene set: {top_term}")
# Plot running enrichment score
ax = gseaplot(
rank_metric=pre_res.ranking,
term=top_term,
**pre_res.results[top_term],
ofname="gsea_results/top_geneset_enrichment.pdf",
)
plt.tight_layout()
plt.savefig("gsea_enrichment_plot.png", dpi=150)
print("Saved: gsea_enrichment_plot.png")
```
### Step 5: Enrichment Dot Plot for Multiple Terms
Generate a dot plot showing enrichment significance and gene ratio across top pathways.
```python
import gseapy as gp
import matplotlib.pyplot as plt
from gseapy.plot import dotplot
# Run ORA and plot results
enr = gp.enrichr(
gene_list=["TP53", "BRCA1", "CDK2", "CCND1", "MYC", "EGFR",
"KRAS", "PTEN", "RB1", "AKT1", "PIK3CA", "MDM2",
"BCL2", "CDKN1A", "E2F1", "CCNE1"],
gene_sets=["KEGG_2021_Human"],
organism="human",
outdir=None|
Opentrons Protocol API v2 for OT-2/Flex: Python protocols for pipetting, serial dilutions, PCR, plate replication; control thermocycler, heater-shaker, magnetic, temperature modules. Use pylabrobot for multi-vendor.
Interactive visualization with Plotly. 40+ chart types (scatter, line, heatmap, 3D, geographic) with hover, zoom, pan. Two APIs: Plotly Express (DataFrame) and Graph Objects (fine control). For static publication figures use matplotlib; for statistical grammar use seaborn.
Statistical visualization on matplotlib + pandas. Distributions (histplot, kdeplot, violin, box), relational (scatter, line), categorical, regression, correlation heatmaps. Auto aggregation/CIs. Use plotly for interactive; matplotlib for low-level.
Best practices for single-cell RNA-seq cell type annotation including marker-based, reference-based, and automated classification approaches.
Bayesian modeling with PyMC 5: priors, likelihood, NUTS/ADVI sampling, diagnostics (R-hat, ESS), LOO/WAIC comparison, prediction. Hierarchical, logistic, GP variants; predictive checks.
Time-to-event modeling with scikit-survival: Cox PH (elastic net), Random Survival Forests, Boosting, SVMs for censored data. C-index, Brier, time-dependent AUC; Kaplan-Meier, Nelson-Aalen, competing risks. Pipeline/GridSearchCV compatible. Use statsmodels for frequentist, pymc for Bayesian, lifelines for parametric.
>-