Skill1.6k repo starsupdated today

tooluniverse-data-integration-analysis

This skill integrates statistical analysis results with biological context from ToolUniverse databases including UniProt, GO, Reactome, ClinVar, and OpenTargets. Use it after computational analyses identify significant genes, variants, or metabolites to interpret findings mechanistically through functional annotations, pathway membership, and disease associations rather than relying solely on p-values.

View source Repository: ToolUniverse

Install in Claude Code

Copy

git clone --depth 1 https://github.com/mims-harvard/ToolUniverse /tmp/tooluniverse-data-integration-analysis && cp -r /tmp/tooluniverse-data-integration-analysis/plugin/skills/tooluniverse-data-integration-analysis ~/.claude/skills/tooluniverse-data-integration-analysis

Then start a new Claude Code session; the skill loads automatically.

Definition

SKILL.md

## COMPUTE, DON'T DESCRIBE
When analysis requires computation (statistics, data processing, scoring, enrichment), write and run Python code via Bash. Don't describe what you would do -- execute it and report actual results. Use ToolUniverse tools to retrieve data, then Python (pandas, scipy, statsmodels, matplotlib) to analyze it.

# Data Integration Analysis

Bridge the gap between statistical results and biological understanding. After any computational analysis produces significant findings, this skill teaches how to interpret them using ToolUniverse's biological knowledge tools -- the key advantage over platforms that only do data analysis.

**IMPORTANT**: Always use English terms in tool calls (gene names, pathway names, organism names), even if the user writes in another language. Respond in the user's language.

---

## When to Use This Skill

Apply when:
- Statistical analysis produced a list of significant genes, variants, metabolites, or exposures
- Users want to go beyond p-values to understand WHY something is significant
- Combining computational results with published evidence
- Interpreting differential expression, GWAS hits, or association study results biologically
- Users ask "what does this result mean?" after running an analysis

**NOT for** (use other skills instead):
- Running the statistical analysis itself --> Use `tooluniverse-statistical-modeling` or `tooluniverse-rnaseq-deseq2`
- Pure gene enrichment without prior analysis --> Use `tooluniverse-gene-enrichment`
- Pure literature review --> Use `tooluniverse-literature-deep-research`
- Single variant interpretation --> Use `tooluniverse-variant-interpretation`

---

## Step 1: Statistical Results to Biological Questions

Map each type of significant finding to the right biological question:

| Finding Type | Biological Question | Tool Discovery Query |
|---|---|---|
| Significant gene list | What pathways are enriched? What functions converge? | `find_tools("gene enrichment pathway analysis")` |
| Significant variant (rsID) | What is the functional impact? Which gene is affected? | `find_tools("variant annotation functional impact")` |
| Significant exposure/chemical | What is the biological mechanism? Which pathways? | `find_tools("chemical gene pathway toxicology")` |
| Significant drug association | What is the molecular target? What is the MOA? | `find_tools("drug target mechanism action")` |
| Significant metabolite | Which metabolic pathway is perturbed? | `find_tools("metabolite pathway identification")` |

**Key principle**: Do not stop at "gene X is significant." Ask: significant in what context? Through what mechanism? With what downstream consequence?

---

## Step 2: Multi-Database Evidence Integration

For each significant finding, query multiple sources and synthesize. The pattern:

1. **Literature evidence**: Search PubMed/EuropePMC for published studies linking your finding to the phenotype. Look for meta-analyses and systematic reviews first.
2. **Genetic association evidence**: Query GWAS Catalog or OpenTargets to check whether genetic evidence independently supports the association.
3. **Pathway context**: Query KEGG, Reactome, or WikiPathways to place the finding in a biological pathway. Identify upstream regulators and downstream effectors.
4. **Interaction networks**: Query STRING or BioGRID for protein-protein interactions. Look for whether your significant genes cluster in the same network neighborhood.
5. **Clinical relevance**: Check ClinVar for variant clinical significance, DGIdb or ChEMBL for druggability, or ClinicalTrials.gov for ongoing interventions.

**Evidence grading** (grade each piece of evidence):

| Grade | Source Type | Example |
|---|---|---|
| T1 (Strong) | Randomized clinical trial, Mendelian randomization | "RCT showed drug X reduces outcome Y" |
| T2 (Moderate) | Large cohort study, GWAS with replication | "GWAS meta-analysis in 500k subjects" |
| T3 (Suggestive) | Case-control study, animal model | "Mouse knockout shows phenotype" |
| T4 (Hypothesis) | In silico prediction, pathway inference | "Network analysis suggests involvement" |

---

## Step 3: Causal Reasoning

Statistical association is not causation. Apply these reasoning frameworks:

**DAG construction**: Before interpreting, sketch the causal directed acyclic graph (DAG).
- Identify potential **confounders** (common causes of exposure and outcome) -- these must be adjusted for.
- Identify potential **mediators** (on the causal path) -- do NOT adjust for these if estimating total effect.
- Identify **colliders** (common effects) -- conditioning on colliders introduces bias.

**Triangulation**: The same finding supported by different methods with different biases strengthens causal inference.
- Observational association + Mendelian randomization + animal experiment = strong triangulated evidence
- If MR contradicts observational data, suspect confounding in the observational study

**Mendelian randomization logic**: Genetic variants (instruments) are assigned at conception, so they are not confounded by lifestyle or reverse causation. If a genetic variant that increases exposure X also increases disease Y, this supports X causing Y. Check instrument strength (F-statistic > 10), exclusion restriction (variant affects Y only through X), and pleiotropy (MR-Egger intercept).

**Mediation analysis**: If gene G is associated with both exposure and outcome, ask: does the exposure effect on outcome go through G? Use the finding's pathway context (Step 2) to propose mediators, then check if adjusting for the mediator attenuates the effect.

---

## Step 4: Cross-Validation

Before reporting a finding as robust, attempt to falsify it:

1. **Replication**: Search literature and datasets (DataCite, GEO, ArrayExpress) for independent datasets where the same finding can be tested. A finding that replicates in an independent cohort is much stronger.
2. **Biological plausibility**: Does the mechanism make biological sens

More from this repository

setup-tooluniverseSkill

Install and configure ToolUniverse for any use case — MCP server (chat-based), CLI (command line with 9 subcommands), or Python SDK (Coding API with 3 calling patterns). Covers uv/uvx setup, MCP configuration for 12+ AI clients (Cursor, Claude Desktop, Windsurf, VS Code, Codex, Gemini CLI, Trae, Cline, etc.), full CLI reference (tu list/grep/find/info/run/test/status/build/serve), Coding API quickstart, agentic tools, code executor, API key walkthrough, skill installation, and upgrading. Use when user asks how to set up ToolUniverse, which access mode to use (MCP vs CLI vs SDK), configuring MCP servers, using the CLI, troubleshooting installation, upgrading, or mentions installing ToolUniverse or setting up scientific tools. Also triggers for "how do I use ToolUniverse", "what's the best way to access tools", "command line", "tu command", "coding API", "tu build".

tooluniverse-acmg-variant-classificationSkill

Systematic ACMG/AMP germline variant classification with all 28 criteria (PVS1, PS1-4, PM1-6, PP1-5, BA1, BS1-4, BP1-7) for clinical significance. Produces 5-tier verdict (Pathogenic / Likely Pathogenic / VUS / Likely Benign / Benign) with cited evidence per criterion. Use for variant interpretation, VUS resolution, and pathogenicity assessment. Combines ClinVar, gnomAD, computational predictors, and gene-mechanism context.

tooluniverse-admet-predictionSkill

Comprehensive ADMET (Absorption, Distribution, Metabolism, Excretion, Toxicity) profiling for drug candidates. Integrates ADMET-AI predictions, SwissADME drug-likeness, PubChemTox experimental toxicity, ChEMBL clinical data, Lipinski rule-of-five, and CYP interaction data. Use for drug-likeness assessment, BBB penetration, bioavailability, hepatotoxicity prediction, ADME/PK profiling, or screening compound libraries before lab testing.

tooluniverse-adverse-event-detectionSkill

Detect and analyze adverse drug event signals using FDA FAERS reports, drug labels, and disproportionality statistics (PRR, ROR, IC). Generates quantitative safety signal scores (0-100) with evidence grading. Use for post-market surveillance, pharmacovigilance, drug safety assessment, regulatory submissions, and detecting rare AE signals not visible in clinical trials.

tooluniverse-adverse-outcome-pathwaySkill

Map environmental and industrial chemicals to adverse outcome pathways (AOPs) — molecular initiating event to organ-level toxicity. Uses AOPWiki, GHS classification, IARC carcinogen status, and LD50 data. Use for environmental/industrial chemical risk assessment, regulatory-grade hazard characterization, and AOP stressor mapping. Distinct from drug-safety analysis (use tooluniverse-pharmacovigilance for drugs).

tooluniverse-aging-senescenceSkill

Aging biology, cellular senescence, and longevity research. Covers senescence markers (p16/CDKN2A, SASP, SA-beta-gal), aging hallmarks, senolytic drug discovery (dasatinib+quercetin, fisetin, navitoclax), epigenetic clocks, telomere biology, and longevity GWAS. Use for senescence-pathway analysis, age-related disease genetics, senolytic-target discovery, and centenarian-genetics queries. Distinguishes correlative vs causal evidence (knockout, intervention).

tooluniverse-antibody-engineeringSkill

Therapeutic antibody engineering and optimization, lead-to-clinical-candidate. Covers sequence humanization (germline alignment, framework retention), affinity maturation, developability (aggregation, stability, PTMs), structure modeling (AlphaFold/PDB CDR analysis), immunogenicity prediction, and manufacturing feasibility. Use for biologic-drug optimization, mAb design review, biosimilar engineering, and clinical-precedent comparison.

tooluniverse-binder-discoverySkill

Discover novel small-molecule binders for protein targets using structure-based and ligand-based screening. Covers druggability assessment, known-ligand mining (ChEMBL, BindingDB), similarity expansion, ADMET filtering, and synthesis feasibility. Use for hit identification, virtual screening, target-to-compounds workflows, and lead-finding before commit-to-medchem.