Skip to main content
ClaudeWave
Skill1.4k estrellas del repoactualizado today

tooluniverse-gene-enrichment

The tooluniverse-gene-enrichment skill performs gene-set enrichment analysis using GO terms (Biological Process, Molecular Function, Cellular Component), KEGG, Reactome, and other pathway databases through gseapy and clusterProfiler. Use it to interpret differentially expressed gene lists, screening hits, or any gene-to-pathway queries by running deterministic CLI scripts that handle simplification cutoffs and denominator conventions, checking for pre-computed results first to avoid recalculation differences from the published analysis.

Instalar en Claude Code
Copiar
git clone --depth 1 https://github.com/mims-harvard/ToolUniverse /tmp/tooluniverse-gene-enrichment && cp -r /tmp/tooluniverse-gene-enrichment/plugin/skills/tooluniverse-gene-enrichment ~/.claude/skills/tooluniverse-gene-enrichment
Después abre una sesión nueva de Claude Code; el skill carga automáticamente.

SKILL.md

## COMPUTE, DON'T DESCRIBE
When analysis requires computation (statistics, data processing, scoring, enrichment), write and run Python code via Bash. Don't describe what you would do — execute it and report actual results. Use ToolUniverse tools to retrieve data, then Python (pandas, scipy, statsmodels, matplotlib) to analyze it.

# Gene Enrichment and Pathway Analysis

## RULE ZERO — Check for pre-computed results FIRST

Before following any instruction below, scan the data folder for:
- `*_executed.ipynb` → read with `tu run read_executed_notebook '{"data_folder":"<path>","search":"<keyword>"}'` and cite its cell outputs as the authoritative answer
- Pre-computed enrichment files (CSV/TSV named `*enrich*`, `*go*`, `*kegg*`, `*reactome*`, `*ego*`, `*_simplified.csv`) → read directly
- Canonical analysis scripts (`analysis.R`, `run_*.py`, `find_*.R`, `*.Rmd`) → execute as-is and read the output

Only follow this skill's re-analysis recipe below if **none** of the above exist. Re-running enrichment from raw DEG lists produces different numbers than the published answer due to subtle filter differences upstream, and is much slower.

---

## PRIMARY SCRIPTS — use these FIRST

Three deterministic CLI scripts cover the bulk of enrichment questions.
Each handles edge cases (ties at top, simplify-changes-padj, multi-condition
screening) that the agent tends to get wrong when writing ad-hoc code.
**Always write outputs to `/tmp/...` — never into the data folder.**

### 1. `scripts/gseapy_enrichment_runner.py` — gseapy enrichr / prerank

**When to use**: the question references `gseapy`, `enrichr`, "Enrichr library", or any GO BP/MF/CC, KEGG, Reactome, WikiPathways, MSigDB enrichment via the gseapy package.

```bash
python skills/tooluniverse-gene-enrichment/scripts/gseapy_enrichment_runner.py \
    --gene-list /tmp/sig_symbols.txt \
    --library GO_Biological_Process_2021,Reactome_2022 \
    --organism Human \
    --top 5 \
    --candidate "negative regulation of epithelial cell proliferation" \
    --workdir /tmp/gseapy_run
```

What it reports (parseable lines):
- `# TOP_BY_ADJ_PVALUE: <term>` — what `df.sort_values('Adjusted P-value').iloc[0]` returns (this is what published notebooks usually print)
- `# TIES_AT_TOP: n=K` — number of terms tied at the lowest Adjusted P-value
- `# TOP_TIE_BROKEN: <term>` — deterministic tie-break (adj_p, raw_p, overlap desc, alphabetic)
- `# TOPN_BY_ADJ_PVALUE:` — full top N listing
- `# CANDIDATE_RANK '<term>': rank=R adj_p=...` — for any `--candidate` substring you pass
- `# SUBSTRING_COUNT_TOPN '<sub>': K` — for `--count-substring` queries (e.g., "how many top-20 terms contain 'Oxidative'")

Pass `--mode prerank --ranked-list /tmp/lfc.tsv` for GSEA preranked.

### 2. `scripts/enrichgo_runner.py` — clusterProfiler::enrichGO + simplify

**When to use**: the question references `enrichGO`, `clusterProfiler`, `simplify`, `simplify(cutoff=0.7)`, or the data folder contains an `analysis.R` / `find_*.R` that uses these. This is the canonical R workflow — gseapy does NOT reproduce it faithfully because `simplify` changes the multiple-testing denominator and thus the p.adjust values for surviving terms.

```bash
python skills/tooluniverse-gene-enrichment/scripts/enrichgo_runner.py \
    --gene-list /tmp/sig_ensembl.txt \
    --background /tmp/bg_ensembl.txt \
    --keytype ENSEMBL \
    --ontology BP \
    --simplify-cutoff 0.7 \
    --candidate "regulation of T cell activation" \
    --candidate "potassium ion transmembrane transport" \
    --workdir /tmp/enrichgo_run
```

What it reports:
- `# TOP10_RAW:` — top 10 from `as.data.frame(ego)` (BEFORE simplify; raw p.adjust)
- `# TOP10_SIMPLIFIED:` — top 10 from `as.data.frame(simplify(ego, cutoff=0.7))` (AFTER simplify; p.adjust differs)
- `# CANDIDATE '<term>': raw_rank=R raw_padj=... simp_rank=R simp_padj=...` — both pre- and post-simplify ranks for each candidate. `simp_rank=NA (collapsed by simplify)` means the term was redundant with a more-significant parent/sibling and was dropped.

When a question says "in the simplified results" or "after simplify", read **simp_padj**. When it just says "the most enriched" without mentioning simplify, default to the simplified frame anyway IF the canonical `analysis.R` calls `simplify`.

Requires R packages `clusterProfiler`, `org.Hs.eg.db` (or `org.Mm.eg.db` for mouse). Install via `Rscript skills/evals/install_r_packages.R` if missing.

### 3. `scripts/condition_enrichment_screen.py` — per-condition enrichment

**When to use**: the question asks "what fraction/percentage of conditions/screens/timepoints/groups had significant enrichment of <category>", or you have an N-by-many gene table and need per-condition enrichment.

```bash
# Per-condition gene-list files:
python skills/tooluniverse-gene-enrichment/scripts/condition_enrichment_screen.py \
    --condition-genes acute=/tmp/acute_sig.txt \
    --condition-genes round1=/tmp/r1_sig.txt \
    --condition-genes round2=/tmp/r2_sig.txt \
    --condition-genes round3=/tmp/r3_sig.txt \
    --library /path/to/local_pathways.gmt \
    --background /tmp/expressed.txt \
    --keyword immune --keyword cytokine --keyword interferon \
    --workdir /tmp/cond_screen
```

Or pass a single 2-col TSV (`condition<TAB>gene`) via `--conditions-tsv`.

What it reports:
- Per condition: `n_genes`, `sig_terms` (Adj P < cutoff), `sig_terms_keyword` (sig terms whose Term contains any --keyword)
- `# n_with_any_sig=N pct_with_any_sig=N%` — the fraction with any significant term
- `# n_with_keyword_sig=N pct_with_keyword_sig=N%` — the fraction whose sig terms include a category keyword

Notes:
- The `--library` can be either an Enrichr library name (online) or a path to a local `.gmt` file. **Prefer the local GMT if the data folder ships one** (avoids rate-limits and exactly reproduces published results).
- Use `--exclude-condition <label>` for "control" / "baseline" conditions that the question wants excluded from the denominator.
- When the que
setup-tooluniverseSkill

Install and configure ToolUniverse for any use case — MCP server (chat-based), CLI (command line with 9 subcommands), or Python SDK (Coding API with 3 calling patterns). Covers uv/uvx setup, MCP configuration for 12+ AI clients (Cursor, Claude Desktop, Windsurf, VS Code, Codex, Gemini CLI, Trae, Cline, etc.), full CLI reference (tu list/grep/find/info/run/test/status/build/serve), Coding API quickstart, agentic tools, code executor, API key walkthrough, skill installation, and upgrading. Use when user asks how to set up ToolUniverse, which access mode to use (MCP vs CLI vs SDK), configuring MCP servers, using the CLI, troubleshooting installation, upgrading, or mentions installing ToolUniverse or setting up scientific tools. Also triggers for "how do I use ToolUniverse", "what's the best way to access tools", "command line", "tu command", "coding API", "tu build".

tooluniverse-acmg-variant-classificationSkill

Systematic ACMG/AMP germline variant classification with all 28 criteria (PVS1, PS1-4, PM1-6, PP1-5, BA1, BS1-4, BP1-7) for clinical significance. Produces 5-tier verdict (Pathogenic / Likely Pathogenic / VUS / Likely Benign / Benign) with cited evidence per criterion. Use for variant interpretation, VUS resolution, and pathogenicity assessment. Combines ClinVar, gnomAD, computational predictors, and gene-mechanism context.

tooluniverse-admet-predictionSkill

Comprehensive ADMET (Absorption, Distribution, Metabolism, Excretion, Toxicity) profiling for drug candidates. Integrates ADMET-AI predictions, SwissADME drug-likeness, PubChemTox experimental toxicity, ChEMBL clinical data, Lipinski rule-of-five, and CYP interaction data. Use for drug-likeness assessment, BBB penetration, bioavailability, hepatotoxicity prediction, ADME/PK profiling, or screening compound libraries before lab testing.

tooluniverse-adverse-event-detectionSkill

Detect and analyze adverse drug event signals using FDA FAERS reports, drug labels, and disproportionality statistics (PRR, ROR, IC). Generates quantitative safety signal scores (0-100) with evidence grading. Use for post-market surveillance, pharmacovigilance, drug safety assessment, regulatory submissions, and detecting rare AE signals not visible in clinical trials.

tooluniverse-adverse-outcome-pathwaySkill

Map environmental and industrial chemicals to adverse outcome pathways (AOPs) — molecular initiating event to organ-level toxicity. Uses AOPWiki, GHS classification, IARC carcinogen status, and LD50 data. Use for environmental/industrial chemical risk assessment, regulatory-grade hazard characterization, and AOP stressor mapping. Distinct from drug-safety analysis (use tooluniverse-pharmacovigilance for drugs).

tooluniverse-aging-senescenceSkill

Aging biology, cellular senescence, and longevity research. Covers senescence markers (p16/CDKN2A, SASP, SA-beta-gal), aging hallmarks, senolytic drug discovery (dasatinib+quercetin, fisetin, navitoclax), epigenetic clocks, telomere biology, and longevity GWAS. Use for senescence-pathway analysis, age-related disease genetics, senolytic-target discovery, and centenarian-genetics queries. Distinguishes correlative vs causal evidence (knockout, intervention).

tooluniverse-antibody-engineeringSkill

Therapeutic antibody engineering and optimization, lead-to-clinical-candidate. Covers sequence humanization (germline alignment, framework retention), affinity maturation, developability (aggregation, stability, PTMs), structure modeling (AlphaFold/PDB CDR analysis), immunogenicity prediction, and manufacturing feasibility. Use for biologic-drug optimization, mAb design review, biosimilar engineering, and clinical-precedent comparison.

tooluniverse-binder-discoverySkill

Discover novel small-molecule binders for protein targets using structure-based and ligand-based screening. Covers druggability assessment, known-ligand mining (ChEMBL, BindingDB), similarity expansion, ADMET filtering, and synthesis feasibility. Use for hit identification, virtual screening, target-to-compounds workflows, and lead-finding before commit-to-medchem.