Skip to main content
ClaudeWave
Skill1.4k repo starsupdated today

tooluniverse-expression-data-retrieval

The tooluniverse-expression-data-retrieval skill searches ArrayExpress and BioStudies databases to find gene expression and multi-omics datasets filtered by organism, tissue, and experimental design. Use this skill when locating RNA-seq or microarray studies for comparative analysis, identifying datasets with sufficient replicates and quality annotations before download, or disambiguating gene names to ensure accurate retrieval across different species databases.

Install in Claude Code
Copy
git clone --depth 1 https://github.com/mims-harvard/ToolUniverse /tmp/tooluniverse-expression-data-retrieval && cp -r /tmp/tooluniverse-expression-data-retrieval/plugin/skills/tooluniverse-expression-data-retrieval ~/.claude/skills/tooluniverse-expression-data-retrieval
Then start a new Claude Code session; the skill loads automatically.

SKILL.md

# Gene Expression & Omics Data Retrieval

Retrieve gene expression experiments and multi-omics datasets with disambiguation and quality assessment.

**IMPORTANT**: Always use English terms in tool calls. Respond in the user's language.

**LOOK UP DON'T GUESS**: Never assume which datasets exist or their accessions. Always search to confirm.

## Domain Reasoning

Before retrieving, determine: organism, tissue, experimental design (case-control/time-series/dose-response). These affect which database to search and how to interpret results. RNA-seq provides wider dynamic range; microarray has extensive legacy data. Prioritize experiments with >=3 biological replicates, complete annotations, and both raw+processed data.

## Workflow

```
Phase 0: Clarify (if ambiguous) → Phase 1: Disambiguate → Phase 2: Search & Retrieve → Phase 3: Report
```

---

## Phase 0: Clarification (When Needed)

Ask ONLY if: gene name ambiguous, tissue/condition unclear, organism not specified.
Skip for: specific accessions (E-MTAB-*, E-GEOD-*, S-BSST*), clear disease/tissue+organism, explicit platform requests.

---

## Phase 1: Query Disambiguation

Resolve official gene symbol (HGNC for human, MGI for mouse). Note common aliases for search expansion.

| User Query Type | Search Strategy |
|-----------------|-----------------|
| Specific accession | Direct retrieval |
| Gene + condition | "[gene] [condition]" + species filter |
| Disease only | "[disease]" + species filter |
| Technology-specific | Add platform keywords |

---

## Phase 2: Data Retrieval (Internal)

Search silently. Do NOT narrate the process.

```python
# ArrayExpress search
result = tu.tools.arrayexpress_search_experiments(keywords="[gene/disease]", species="[species]", limit=20)

# Get experiment details, samples, files
details = tu.tools.arrayexpress_get_experiment(accession=accession)
samples = tu.tools.arrayexpress_get_experiment_samples(accession=accession)
files = tu.tools.arrayexpress_get_experiment_files(accession=accession)

# BioStudies for multi-omics
biostudies = tu.tools.biostudies_search(query="[keywords]", limit=10)
study = tu.tools.biostudies_get_study(accession=study_accession)
study_files = tu.tools.biostudies_get_study_files(accession=study_accession)
```

### Fallback Chains

| Primary | Fallback |
|---------|----------|
| ArrayExpress search | BioStudies search |
| arrayexpress_get_experiment | biostudies_get_study |
| arrayexpress_get_experiment_files | Note "Files unavailable" |

---

## Phase 3: Report Dataset Profile

Present as a **Dataset Search Report**. Hide search process. Include:

1. **Search Summary**: query, databases searched, result count
2. **Top Experiments** (per experiment):
   - Accession, organism, type (RNA-seq/microarray), platform, sample count, date
   - Description, experimental design (conditions, replicates, tissue)
   - Sample groups table, data files table
   - Quality assessment (●●●/●●○/●○○)
3. **Multi-Omics Studies** (from BioStudies): accession, type, data types included
4. **Summary Table**: all experiments ranked
5. **Recommendations**: best dataset for user's purpose, integration notes
6. **Data Access**: download links, database URLs

---

## Data Quality Tiers

| Tier | Symbol | Criteria |
|------|--------|----------|
| High | ●●● | >=3 bio replicates, complete metadata, processed data available |
| Medium | ●●○ | 2-3 replicates OR some metadata gaps |
| Low | ●○○ | No replicates, sparse metadata, or access issues |
| Caution | ○○○ | Single sample, no replication, outdated platform |

---

## Reasoning Framework

**Dataset quality**: Prioritize >=3 biological replicates, complete annotations, both raw+processed data. Single-replicate experiments can inform but not be sole evidence.

**Platform comparison**: RNA-seq = wider dynamic range, novel transcripts. Microarray = probe-limited but extensive legacy data. Cross-platform combining requires batch correction.

**Metadata scoring**: Rate 0-5 on: (1) sample annotations, (2) design documented, (3) pipeline described, (4) raw data deposited, (5) publication linked. Score <=2 warrants caution.

**GEO vs ArrayExpress**: GEO has broader coverage (older studies); ArrayExpress enforces stricter metadata. BioStudies captures multi-omics. Search both.

### Synthesis Questions
1. Does the dataset have sufficient replication and metadata for the intended analysis?
2. Are there batch effects or confounding variables?
3. Do multiple datasets show concordant patterns, and can they be integrated?

---

## Error Handling

| Error | Response |
|-------|----------|
| "No experiments found" | Broaden keywords, remove species filter, try synonyms |
| "Accession not found" | Verify format, check if withdrawn |
| "Files not available" | Note: "Data files restricted by submitter" |
| "API timeout" | Retry once, note "(metadata retrieval incomplete)" |

---

## Tool Reference

**ArrayExpress**: `arrayexpress_search_experiments` (search), `arrayexpress_get_experiment` (metadata), `arrayexpress_get_experiment_files` (downloads), `arrayexpress_get_experiment_samples` (annotations)

**BioStudies**: `biostudies_search` (search), `biostudies_get_study` (metadata+sections), `biostudies_get_study_files` (files)

**Additional Sources**:
- `GEO_search_rnaseq_datasets` / `geo_search_datasets` -- GEO (largest RNA-seq repo)
- `OmicsDI_search_datasets` -- cross-repository aggregation (GEO+ArrayExpress+PRIDE+MassIVE)
- `GTEx_get_expression_summary` -- baseline tissue expression (54 normal tissues, param: `gene_symbol`)
- `ENAPortal_search_studies` -- sequencing studies (param: `query` with `description="..."`)
- `CxGDisc_search_datasets` -- single-cell datasets (needs exact disease ontology terms)
- `PubMed_search_articles` -- dataset discovery via publications

---

## Search Parameters

**ArrayExpress**: `keywords` (free text), `species` (scientific name), `array` (platform filter), `limit`
**BioStudies**: `query` (free text), `limit`
setup-tooluniverseSkill

Install and configure ToolUniverse for any use case — MCP server (chat-based), CLI (command line with 9 subcommands), or Python SDK (Coding API with 3 calling patterns). Covers uv/uvx setup, MCP configuration for 12+ AI clients (Cursor, Claude Desktop, Windsurf, VS Code, Codex, Gemini CLI, Trae, Cline, etc.), full CLI reference (tu list/grep/find/info/run/test/status/build/serve), Coding API quickstart, agentic tools, code executor, API key walkthrough, skill installation, and upgrading. Use when user asks how to set up ToolUniverse, which access mode to use (MCP vs CLI vs SDK), configuring MCP servers, using the CLI, troubleshooting installation, upgrading, or mentions installing ToolUniverse or setting up scientific tools. Also triggers for "how do I use ToolUniverse", "what's the best way to access tools", "command line", "tu command", "coding API", "tu build".

tooluniverse-acmg-variant-classificationSkill

Systematic ACMG/AMP germline variant classification with all 28 criteria (PVS1, PS1-4, PM1-6, PP1-5, BA1, BS1-4, BP1-7) for clinical significance. Produces 5-tier verdict (Pathogenic / Likely Pathogenic / VUS / Likely Benign / Benign) with cited evidence per criterion. Use for variant interpretation, VUS resolution, and pathogenicity assessment. Combines ClinVar, gnomAD, computational predictors, and gene-mechanism context.

tooluniverse-admet-predictionSkill

Comprehensive ADMET (Absorption, Distribution, Metabolism, Excretion, Toxicity) profiling for drug candidates. Integrates ADMET-AI predictions, SwissADME drug-likeness, PubChemTox experimental toxicity, ChEMBL clinical data, Lipinski rule-of-five, and CYP interaction data. Use for drug-likeness assessment, BBB penetration, bioavailability, hepatotoxicity prediction, ADME/PK profiling, or screening compound libraries before lab testing.

tooluniverse-adverse-event-detectionSkill

Detect and analyze adverse drug event signals using FDA FAERS reports, drug labels, and disproportionality statistics (PRR, ROR, IC). Generates quantitative safety signal scores (0-100) with evidence grading. Use for post-market surveillance, pharmacovigilance, drug safety assessment, regulatory submissions, and detecting rare AE signals not visible in clinical trials.

tooluniverse-adverse-outcome-pathwaySkill

Map environmental and industrial chemicals to adverse outcome pathways (AOPs) — molecular initiating event to organ-level toxicity. Uses AOPWiki, GHS classification, IARC carcinogen status, and LD50 data. Use for environmental/industrial chemical risk assessment, regulatory-grade hazard characterization, and AOP stressor mapping. Distinct from drug-safety analysis (use tooluniverse-pharmacovigilance for drugs).

tooluniverse-aging-senescenceSkill

Aging biology, cellular senescence, and longevity research. Covers senescence markers (p16/CDKN2A, SASP, SA-beta-gal), aging hallmarks, senolytic drug discovery (dasatinib+quercetin, fisetin, navitoclax), epigenetic clocks, telomere biology, and longevity GWAS. Use for senescence-pathway analysis, age-related disease genetics, senolytic-target discovery, and centenarian-genetics queries. Distinguishes correlative vs causal evidence (knockout, intervention).

tooluniverse-antibody-engineeringSkill

Therapeutic antibody engineering and optimization, lead-to-clinical-candidate. Covers sequence humanization (germline alignment, framework retention), affinity maturation, developability (aggregation, stability, PTMs), structure modeling (AlphaFold/PDB CDR analysis), immunogenicity prediction, and manufacturing feasibility. Use for biologic-drug optimization, mAb design review, biosimilar engineering, and clinical-precedent comparison.

tooluniverse-binder-discoverySkill

Discover novel small-molecule binders for protein targets using structure-based and ligand-based screening. Covers druggability assessment, known-ligand mining (ChEMBL, BindingDB), similarity expansion, ADMET filtering, and synthesis feasibility. Use for hit identification, virtual screening, target-to-compounds workflows, and lead-finding before commit-to-medchem.