Skill1.6k repo starsupdated today

tooluniverse-dataset-discovery

The Dataset Discovery skill maps research questions to required study designs (longitudinal vs cross-sectional, observational vs experimental) and identifies appropriate scientific repositories. Use when seeking specific datasets, cohorts, or surveys; the skill covers major repositories including GEO, dbGaP, NHANES, UK Biobank, ClinicalTrials.gov, GWAS Catalog, and 30+ additional scientific databases. It first identifies minimum data requirements before systematically searching through cross-repository tools, domain-specific repositories, and literature-based discovery methods.

View source Repository: ToolUniverse

Install in Claude Code

Copy

git clone --depth 1 https://github.com/mims-harvard/ToolUniverse /tmp/tooluniverse-dataset-discovery && cp -r /tmp/tooluniverse-dataset-discovery/plugin/skills/tooluniverse-dataset-discovery ~/.claude/skills/tooluniverse-dataset-discovery

Then start a new Claude Code session; the skill loads automatically.

Definition

SKILL.md

# Dataset Discovery

## When to Use
- User asks "find me data about X" or "where can I get data on Y"
- User wants to analyze a relationship between variables
- User needs specific study designs (longitudinal, cross-sectional, experimental)
- User asks about specific surveys or cohorts

## Step 1: Understand What the Research Question Requires

Before searching, determine the **minimum data requirements**:

**Study design needed:**
- "Does X predict CHANGES in Y over time?" → longitudinal (same people measured repeatedly). Cross-sectional data CANNOT answer this — don't settle for it.
- "Is X associated with Y?" → cross-sectional is sufficient (one-time measurement)
- "Does intervention X cause outcome Y?" → experimental (clinical trial with controls)
- "What genes/proteins are involved in X?" → omics (sequencing, expression, proteomics)

**Variables needed:**
- List the specific exposure, outcome, and confounder variables
- For each variable, note the measurement type (continuous, categorical, biomarker vs self-report)
- Identify minimum confounders needed (age, sex are almost always required; domain-specific confounders depend on the question)

**Population needed:**
- Age range, geography, clinical status, sample size requirements
- Power analysis: to detect a small effect (r=0.1), you need ~800 subjects at 80% power

## Step 2: Search Strategy

Search from broadest to most specific. Use `find_tools` to discover available dataset search tools — don't rely on memorized tool names.

**Layer 1 — Cross-repository search (cast wide net):**
Search tools that index datasets across thousands of repositories. These find datasets you didn't know existed.
- Search by: research topic keywords, variable names, population descriptors
- Look for: DOI-registered datasets, repository listings, government data portals

**Layer 2 — Domain-specific repositories:**
Search repositories specialized for your data type.
- Health surveys: CDC, NHANES (search by variable name, not topic keywords)
- Genomics: SRA, ENA, ArrayExpress, GEO
- Proteomics: PRIDE, MassIVE
- Metabolomics: MetaboLights, Metabolomics Workbench
- Clinical: ClinicalTrials.gov (for trial data with results)

**Layer 3 — Literature-based discovery:**
Many datasets aren't in any repository — they're described in paper methods sections.
- Search PubMed/EuropePMC for papers that analyzed the relationship you're interested in
- Read their methods: "We used data from [DATASET NAME]" tells you exactly what exists
- Check supplementary materials for deposited data (GEO/SRA accession numbers)
- This is often the MOST effective strategy for finding niche datasets

## Step 3: Evaluate Dataset Fitness

For each candidate dataset, assess these dimensions:

**Variables:**
- Does it contain your SPECIFIC exposure and outcome variables?
- Are they measured the way you need? (biomarker vs self-report, continuous vs categorical)
- Are key confounders available? (missing confounders = biased analysis)

**Design match:**
- If you need longitudinal: does it follow the SAME individuals over time? How many waves? What's the follow-up interval?
- Beware: "repeated cross-sections" (different people each wave) are NOT longitudinal
- If you need experimental: is there a proper control group? Randomization?

**Sample:**
- Is the sample large enough for your analysis? (logistic regression needs ~10 events per predictor)
- Does the population match? (age range, geography, clinical characteristics)
- Are there subgroups you need? (stratified by sex, race, disease status)

**Access:**
- Publicly downloadable (best) vs registration required (days) vs collaboration agreement (months) vs restricted (may be impossible)
- Data format: CSV/TSV (easy), XPT/SAS (need conversion), proprietary database (may need special software)

**Quality:**
- Is it from a well-known study with published methods? (NHANES, HRS, UK Biobank = high quality)
- Has it been used in peer-reviewed publications? (indicates data is usable)
- What's the response rate / missingness pattern?

## Step 4: Download and Analyze

Don't stop at finding datasets — download and analyze them. Write and run Python code via Bash. Never describe what you "would do" — execute it.

### Data Loading Cookbook

Choose the loader that matches your data source. When unsure of the format, download a small sample first and inspect.

```python
import requests, io, pandas as pd

# --- Tabular files (most common) ---
df = pd.read_csv("data.csv")                                # CSV / TSV (use sep="\t" for TSV)
df = pd.read_excel("data.xlsx")                              # Excel
df = pd.read_stata("data.dta")                               # Stata
df = pd.read_sas("data.xpt", format="xport")                # SAS transport (XPT)
df = pd.read_sas("data.sas7bdat", format="sas7bdat")        # SAS native
df = pd.read_parquet("data.parquet")                         # Parquet
df = pd.read_json("data.json")                               # JSON (records or columnar)
df = pd.read_fwf("data.dat")                                 # Fixed-width (some legacy surveys)

# --- Download from URL first, then parse ---
resp = requests.get(url, timeout=120)
content = resp.content
# Detect format from URL or content header
if url.endswith(".XPT") or url.endswith(".xpt"):
    df = pd.read_sas(io.BytesIO(content), format="xport")
elif url.endswith(".csv") or url.endswith(".csv.gz"):
    df = pd.read_csv(io.BytesIO(content))
elif url.endswith(".tsv") or url.endswith(".tsv.gz"):
    df = pd.read_csv(io.BytesIO(content), sep="\t")
elif url.endswith(".json"):
    df = pd.read_json(io.BytesIO(content))
else:
    # Try CSV first, then inspect
    df = pd.read_csv(io.BytesIO(content))

# --- REST API pagination (common for GDC, ClinicalTrials.gov, etc.) ---
import json
all_records = []
offset = 0
while True:
    resp = requests.get(f"{api_url}?offset={offset}&limit=100", timeout=30)
    batch = resp.json().get("data", [])
    if not batch:
        break
    all_record

More from this repository

setup-tooluniverseSkill

Install and configure ToolUniverse for any use case — MCP server (chat-based), CLI (command line with 9 subcommands), or Python SDK (Coding API with 3 calling patterns). Covers uv/uvx setup, MCP configuration for 12+ AI clients (Cursor, Claude Desktop, Windsurf, VS Code, Codex, Gemini CLI, Trae, Cline, etc.), full CLI reference (tu list/grep/find/info/run/test/status/build/serve), Coding API quickstart, agentic tools, code executor, API key walkthrough, skill installation, and upgrading. Use when user asks how to set up ToolUniverse, which access mode to use (MCP vs CLI vs SDK), configuring MCP servers, using the CLI, troubleshooting installation, upgrading, or mentions installing ToolUniverse or setting up scientific tools. Also triggers for "how do I use ToolUniverse", "what's the best way to access tools", "command line", "tu command", "coding API", "tu build".

tooluniverse-acmg-variant-classificationSkill

Systematic ACMG/AMP germline variant classification with all 28 criteria (PVS1, PS1-4, PM1-6, PP1-5, BA1, BS1-4, BP1-7) for clinical significance. Produces 5-tier verdict (Pathogenic / Likely Pathogenic / VUS / Likely Benign / Benign) with cited evidence per criterion. Use for variant interpretation, VUS resolution, and pathogenicity assessment. Combines ClinVar, gnomAD, computational predictors, and gene-mechanism context.

tooluniverse-admet-predictionSkill

Comprehensive ADMET (Absorption, Distribution, Metabolism, Excretion, Toxicity) profiling for drug candidates. Integrates ADMET-AI predictions, SwissADME drug-likeness, PubChemTox experimental toxicity, ChEMBL clinical data, Lipinski rule-of-five, and CYP interaction data. Use for drug-likeness assessment, BBB penetration, bioavailability, hepatotoxicity prediction, ADME/PK profiling, or screening compound libraries before lab testing.

tooluniverse-adverse-event-detectionSkill

Detect and analyze adverse drug event signals using FDA FAERS reports, drug labels, and disproportionality statistics (PRR, ROR, IC). Generates quantitative safety signal scores (0-100) with evidence grading. Use for post-market surveillance, pharmacovigilance, drug safety assessment, regulatory submissions, and detecting rare AE signals not visible in clinical trials.

tooluniverse-adverse-outcome-pathwaySkill

Map environmental and industrial chemicals to adverse outcome pathways (AOPs) — molecular initiating event to organ-level toxicity. Uses AOPWiki, GHS classification, IARC carcinogen status, and LD50 data. Use for environmental/industrial chemical risk assessment, regulatory-grade hazard characterization, and AOP stressor mapping. Distinct from drug-safety analysis (use tooluniverse-pharmacovigilance for drugs).

tooluniverse-aging-senescenceSkill

Aging biology, cellular senescence, and longevity research. Covers senescence markers (p16/CDKN2A, SASP, SA-beta-gal), aging hallmarks, senolytic drug discovery (dasatinib+quercetin, fisetin, navitoclax), epigenetic clocks, telomere biology, and longevity GWAS. Use for senescence-pathway analysis, age-related disease genetics, senolytic-target discovery, and centenarian-genetics queries. Distinguishes correlative vs causal evidence (knockout, intervention).

tooluniverse-antibody-engineeringSkill

Therapeutic antibody engineering and optimization, lead-to-clinical-candidate. Covers sequence humanization (germline alignment, framework retention), affinity maturation, developability (aggregation, stability, PTMs), structure modeling (AlphaFold/PDB CDR analysis), immunogenicity prediction, and manufacturing feasibility. Use for biologic-drug optimization, mAb design review, biosimilar engineering, and clinical-precedent comparison.

tooluniverse-binder-discoverySkill

Discover novel small-molecule binders for protein targets using structure-based and ligand-based screening. Covers druggability assessment, known-ligand mining (ChEMBL, BindingDB), similarity expansion, ADMET filtering, and synthesis feasibility. Use for hit identification, virtual screening, target-to-compounds workflows, and lead-finding before commit-to-medchem.