Skip to main content
ClaudeWave
Skill1.4k repo starsupdated today

tooluniverse-comparative-genomics

This Claude Code skill integrates multiple genomic databases (Ensembl Compara, NCBI Gene, UniProt, OLS, Monarch, OpenTargets) to identify orthologs, paralogs, and assess sequence and functional conservation across species. Use it when mapping genes between organisms, determining which model organism best represents human gene function, analyzing evolutionary conservation patterns, or tracing phylogenetic gene history for comparative genomics research.

Install in Claude Code
Copy
git clone --depth 1 https://github.com/mims-harvard/ToolUniverse /tmp/tooluniverse-comparative-genomics && cp -r /tmp/tooluniverse-comparative-genomics/plugin/skills/tooluniverse-comparative-genomics ~/.claude/skills/tooluniverse-comparative-genomics
Then start a new Claude Code session; the skill loads automatically.

SKILL.md

# Comparative Genomics & Ortholog Analysis

Cross-species gene comparison, ortholog identification, sequence retrieval, and functional conservation analysis integrating Ensembl Compara, NCBI, UniProt, OLS, Monarch, and OpenTargets.

## LOOK UP, DON'T GUESS
When uncertain about any scientific fact, SEARCH databases first (PubMed, UniProt, ChEMBL, ClinVar, etc.) rather than reasoning from memory. A database-verified answer is always more reliable than a guess.

## COMPUTE, DON'T DESCRIBE
When analysis requires computation (statistics, data processing, scoring, enrichment), write and run Python code via Bash. Don't describe what you would do — execute it and report actual results. Use ToolUniverse tools to retrieve data, then Python (pandas, scipy, statsmodels, matplotlib) to analyze it.

## When to Use This Skill

**Triggers**:
- "Find the mouse ortholog of [human gene]"
- "Compare [gene] across species"
- "Is [gene] conserved in [organism]?"
- "What are the orthologs of [gene]?"
- "Cross-species comparison of [gene/protein]"
- "Evolutionary conservation of [gene]"
- "Compare GO annotations between human and mouse [gene]"

**Use Cases**:
1. **Ortholog Discovery**: Find equivalent genes in other species for a human gene
2. **Conservation Analysis**: Assess how conserved a gene is across evolutionary distance
3. **Functional Comparison**: Compare GO terms, domains, and annotations across orthologs
4. **Model Organism Selection**: Determine which model organism best recapitulates human gene function
5. **Gene Tree Analysis**: Visualize evolutionary history of a gene family
6. **Cross-Species Phenotype Bridging**: Link human disease phenotypes to model organism phenotypes via orthologs

---

## Conservation Reasoning Framework

Understanding conservation requires distinguishing between types of evolutionary patterns and what they imply about function.

**High conservation signals functional constraint.** When a gene is maintained as a 1:1 ortholog from yeast to humans, purifying selection has prevented sequence divergence — the gene's function is essential and cannot be easily altered. Highly conserved positions within a protein sequence (high PhastCons scores > 0.8, or GERP RS > 4) are under strong constraint; mutations at these positions are disproportionately pathogenic. For non-coding regions, conservation in mammals at PhastCons > 0.5 suggests a candidate regulatory element.

**Low conservation in one lineage has two possible explanations: relaxed selection or positive selection.** Use the dN/dS ratio (nonsynonymous to synonymous substitution rate) to distinguish them. A dN/dS ratio near 1 suggests neutral evolution — the gene is no longer under purifying selection (relaxed constraint, possibly reflecting loss of function in that lineage). A dN/dS ratio > 1 indicates positive selection — the gene is diverging faster than neutral expectation, often because it is adapting to a new environment or function. A dN/dS ratio << 1 is the signature of purifying selection (functional constraint). When a vertebrate gene shows high divergence in a specific branch of the tree, ask which explanation applies before concluding that function is lost.

**Computing dN/dS** (no TU tool does this — use the bundled script). The data path: `ensembl_get_homology(...)` → the 1:1 ortholog IDs → `EnsemblSeq_get_id_sequence(id=..., type="cds")` for each → codon-align the two CDS (orthologous CDS are usually directly alignable; for divergent pairs align the protein and back-translate) → run `scripts/dnds.py`:

```bash
python scripts/dnds.py human_CDS.fasta mouse_CDS.fasta   # or --seq1 ATG... --seq2 ATG...
```
It implements the Nei-Gojobori estimator with Jukes-Cantor correction (dN validated against Biopython NG86) and returns dN, dS, dN/dS, and an interpretation (>1 positive, ~1 neutral/relaxed, <<1 purifying). `dN/dS` is `null` when dS is 0 or uncorrectable (too few/too many substitutions) — do not over-interpret a single high-divergence pair without enough synonymous sites.

**Ortholog relationship type shapes interpretation.** A 1:1 ortholog (one gene in human, one in mouse) is the highest-confidence functional equivalent — it has not been duplicated in either lineage, so it most likely performs the same ancestral role. A 1:many relationship (one gene in human, multiple in mouse) means the target species has duplicated the gene; the copies may have subfunctionalized (each copy performs a subset of the original roles) or neofunctionalized (one copy gained a new role). Do not assume both copies retain full ancestral function. A many:many relationship reflects complex duplication history in both species and requires analyzing each paralog pair individually.

**Conservation depth predicts essentiality.** A gene conserved across all vertebrates suggests a fundamental cellular process. A gene conserved only in mammals suggests a more specialized vertebrate innovation. A gene present only in primates or only in humans is likely a recent evolutionary acquisition, possibly involved in human-specific biology but often lacking the depth of functional characterization available for deeply conserved genes.

**Absence of an ortholog is a finding, not an error.** Lineage-specific genes exist and are biologically meaningful. Before concluding a gene is lineage-specific, check: (1) whether BLAST with relaxed thresholds finds distant homologs, (2) whether a highly divergent ortholog exists that Ensembl Compara missed, and (3) whether the gene belongs to a rapidly evolving family (immune genes, olfactory receptors, reproductive proteins) where turnover is expected.

---

## Workflow Overview

```
Input (gene symbol/ID + reference species)
  |
  v
Phase 1: Gene Identification & Validation
  |
  v
Phase 2: Ortholog Discovery (Ensembl Compara + OpenTargets)
  |
  v
Phase 3: Sequence Retrieval (NCBI + Ensembl)
  |
  v
Phase 4: Functional Annotation Comparison (UniProt + OLS GO terms)
  |
  v
Phase 5: Cross-Species Phenotype Bridging (
setup-tooluniverseSkill

Install and configure ToolUniverse for any use case — MCP server (chat-based), CLI (command line with 9 subcommands), or Python SDK (Coding API with 3 calling patterns). Covers uv/uvx setup, MCP configuration for 12+ AI clients (Cursor, Claude Desktop, Windsurf, VS Code, Codex, Gemini CLI, Trae, Cline, etc.), full CLI reference (tu list/grep/find/info/run/test/status/build/serve), Coding API quickstart, agentic tools, code executor, API key walkthrough, skill installation, and upgrading. Use when user asks how to set up ToolUniverse, which access mode to use (MCP vs CLI vs SDK), configuring MCP servers, using the CLI, troubleshooting installation, upgrading, or mentions installing ToolUniverse or setting up scientific tools. Also triggers for "how do I use ToolUniverse", "what's the best way to access tools", "command line", "tu command", "coding API", "tu build".

tooluniverse-acmg-variant-classificationSkill

Systematic ACMG/AMP germline variant classification with all 28 criteria (PVS1, PS1-4, PM1-6, PP1-5, BA1, BS1-4, BP1-7) for clinical significance. Produces 5-tier verdict (Pathogenic / Likely Pathogenic / VUS / Likely Benign / Benign) with cited evidence per criterion. Use for variant interpretation, VUS resolution, and pathogenicity assessment. Combines ClinVar, gnomAD, computational predictors, and gene-mechanism context.

tooluniverse-admet-predictionSkill

Comprehensive ADMET (Absorption, Distribution, Metabolism, Excretion, Toxicity) profiling for drug candidates. Integrates ADMET-AI predictions, SwissADME drug-likeness, PubChemTox experimental toxicity, ChEMBL clinical data, Lipinski rule-of-five, and CYP interaction data. Use for drug-likeness assessment, BBB penetration, bioavailability, hepatotoxicity prediction, ADME/PK profiling, or screening compound libraries before lab testing.

tooluniverse-adverse-event-detectionSkill

Detect and analyze adverse drug event signals using FDA FAERS reports, drug labels, and disproportionality statistics (PRR, ROR, IC). Generates quantitative safety signal scores (0-100) with evidence grading. Use for post-market surveillance, pharmacovigilance, drug safety assessment, regulatory submissions, and detecting rare AE signals not visible in clinical trials.

tooluniverse-adverse-outcome-pathwaySkill

Map environmental and industrial chemicals to adverse outcome pathways (AOPs) — molecular initiating event to organ-level toxicity. Uses AOPWiki, GHS classification, IARC carcinogen status, and LD50 data. Use for environmental/industrial chemical risk assessment, regulatory-grade hazard characterization, and AOP stressor mapping. Distinct from drug-safety analysis (use tooluniverse-pharmacovigilance for drugs).

tooluniverse-aging-senescenceSkill

Aging biology, cellular senescence, and longevity research. Covers senescence markers (p16/CDKN2A, SASP, SA-beta-gal), aging hallmarks, senolytic drug discovery (dasatinib+quercetin, fisetin, navitoclax), epigenetic clocks, telomere biology, and longevity GWAS. Use for senescence-pathway analysis, age-related disease genetics, senolytic-target discovery, and centenarian-genetics queries. Distinguishes correlative vs causal evidence (knockout, intervention).

tooluniverse-antibody-engineeringSkill

Therapeutic antibody engineering and optimization, lead-to-clinical-candidate. Covers sequence humanization (germline alignment, framework retention), affinity maturation, developability (aggregation, stability, PTMs), structure modeling (AlphaFold/PDB CDR analysis), immunogenicity prediction, and manufacturing feasibility. Use for biologic-drug optimization, mAb design review, biosimilar engineering, and clinical-precedent comparison.

tooluniverse-binder-discoverySkill

Discover novel small-molecule binders for protein targets using structure-based and ligand-based screening. Covers druggability assessment, known-ligand mining (ChEMBL, BindingDB), similarity expansion, ADMET filtering, and synthesis feasibility. Use for hit identification, virtual screening, target-to-compounds workflows, and lead-finding before commit-to-medchem.