Skill1.6k repo starsupdated today

tooluniverse-epidemiological-analysis

# tooluniverse-epidemiological-analysis Complete workflow for observational epidemiology studies that generates executable Python code for every analytical step, from defining research questions using the PECO framework (Population, Exposure, Comparator, Outcome) through publication-ready statistical reports. Covers cohort, case-control, and cross-sectional designs with regression analysis, confounding adjustment, propensity score methods, and sensitivity testing. Use this skill when analyzing real epidemiological datasets like NHANES or UK Biobank, or when requiring end-to-end observational study analysis with full code implementation rather than conceptual guidance.

View source Repository: ToolUniverse

Install in Claude Code

Copy

git clone --depth 1 https://github.com/mims-harvard/ToolUniverse /tmp/tooluniverse-epidemiological-analysis && cp -r /tmp/tooluniverse-epidemiological-analysis/plugin/skills/tooluniverse-epidemiological-analysis ~/.claude/skills/tooluniverse-epidemiological-analysis

Then start a new Claude Code session; the skill loads automatically.

Definition

SKILL.md

# Epidemiological Data Analysis

Complete workflow for observational epidemiology — from research question to publication-ready report. Write and run Python code for every step. Never describe what you "would do" — do it.

## Step 1: Formulate the Research Question (PECO Framework)

Define **P**opulation, **E**xposure, **C**omparator, **O**utcome before touching data.

- **Population**: Who? (e.g., adults aged 20-79, cancer patients stage III+, ICU admissions)
- **Exposure**: What factor? (e.g., nutrient intake, drug treatment, gene mutation, environmental pollutant)
- **Comparator**: Vs. what? (e.g., lowest tertile, unexposed, wild-type, placebo)
- **Outcome**: What health event? (e.g., disease incidence, survival time, biomarker level, mortality)

**Study design check**: Does the question require temporality?
- Cross-sectional: prevalence, associations at one time point
- Longitudinal/cohort: incidence, causal inference, temporal relationships
- Case-control: rare outcomes, odds ratios (nested within cohort)
- Clinical trial: intervention effects with randomized controls

If the question implies causation ("does X cause Y?") but only cross-sectional data is available, state the limitation explicitly and proceed with association language.

## Step 2: Find and Evaluate Data

Use ToolUniverse to discover datasets and find what prior studies used:

```python
# Search for relevant datasets — use find_tools to discover what's available
find_tools("dataset search")
find_tools("your domain keywords")  # e.g., "cancer genomics", "clinical trial", "survey health"

# Search literature for study precedents — papers cite their data sources
execute_tool("PubMed_search_articles", {"query": "[exposure] [outcome] [study design]", "max_results": 5})
execute_tool("EuropePMC_search_articles", {"query": "[exposure] [outcome] cohort", "limit": 5})
```

**Evaluate dataset fitness**: Does it have the exposure variable? The outcome? Key confounders (age, sex, plus domain-specific)? Adequate sample size?

**Power analysis** (run before committing to a dataset):

```python
from scipy.stats import norm
import numpy as np

def sample_size_logistic(p0, OR, alpha=0.05, power=0.80):
    """Minimum N for logistic regression detecting OR at given power."""
    p1 = (p0 * OR) / (1 - p0 + p0 * OR)
    z_a, z_b = norm.ppf(1 - alpha/2), norm.ppf(power)
    n = ((z_a + z_b)**2 * (1/(p0*(1-p0)) + 1/(p1*(1-p1)))) / (np.log(OR))**2
    return int(np.ceil(n))

print(f"Need N={sample_size_logistic(0.10, 1.5)} for OR=1.5 with 10% baseline prevalence")
```

## Step 3: Download and Prepare Data

Download data programmatically. Adapt the loading code to your data source's format.

```python
import pandas as pd
import requests, io

# Generic download helper — adapt URL and format to your source
def download_and_parse(url, fmt="csv"):
    r = requests.get(url, timeout=120)
    content = io.BytesIO(r.content)
    if fmt == "xpt":
        return pd.read_sas(content, format="xport")
    elif fmt == "csv":
        return pd.read_csv(content)
    elif fmt == "tsv":
        return pd.read_csv(content, sep="\t")
    elif fmt == "stata":
        return pd.read_stata(content)
    elif fmt == "json":
        return pd.read_json(content)
    else:
        return pd.read_csv(content)  # default fallback

# Load and merge multiple files on shared ID column
df1 = download_and_parse(url1, fmt="xpt")
df2 = download_and_parse(url2, fmt="xpt")
df = df1.merge(df2, on="id_col", how="inner")

# Filter population (inclusion/exclusion criteria)
df = df[(df['age'] >= 20) & (df['age'] < 80)]

# Handle missing data
missing_pct = df.isnull().mean() * 100
print("Missing % per variable:\n", missing_pct[missing_pct > 0].sort_values(ascending=False))
# Decision: complete case if <5% missing; multiple imputation if 5-20%; drop variable if >20%

# Variable coding (adapt to your data)
df['age_group'] = pd.cut(df['age'], bins=[20,40,60,80], labels=['20-39','40-59','60-79'])
df['outcome_binary'] = (df['outcome_continuous'] >= threshold).astype(int)
```

**Survey weights**: Some surveys (NHANES, BRFSS, MEPS) require sampling weights for valid inference. Check the survey documentation. For weighted regression, use `statsmodels.stats.weightstats` or linearmodels.

**REST API data**: For sources like GDC (TCGA), ClinicalTrials.gov, or OpenTargets, paginate through the API:
```python
all_records = []
offset = 0
while True:
    resp = requests.get(f"{api_url}?offset={offset}&limit=500", timeout=30)
    batch = resp.json().get("data", [])
    if not batch:
        break
    all_records.extend(batch)
    offset += len(batch)
df = pd.DataFrame(all_records)
```

## Step 4: Descriptive Statistics (Table 1)

```python
# Table 1: mean +/- SD for continuous, N(%) for categorical, by exposure group
continuous_vars = ['age', 'bmi']  # adapt to your variables
for var in continuous_vars:
    print(df.groupby('exposure_group')[var].agg(['mean', 'std', 'count']))

categorical_vars = ['sex', 'race']  # adapt to your variables
for var in categorical_vars:
    print(pd.crosstab(df['exposure_group'], df[var], normalize='index') * 100)
```

Check distributions: `df[var].skew()`, `scipy.stats.shapiro()`, histograms for outliers.

## Step 5: Regression Analysis

**Sequential adjustment strategy** (build evidence for confounding):

```python
import statsmodels.formula.api as smf
import numpy as np

# Model 1: Unadjusted
m1 = smf.logit('outcome ~ exposure', data=df).fit(disp=0)

# Model 2: + demographics
m2 = smf.logit('outcome ~ exposure + age + sex + race', data=df).fit(disp=0)

# Model 3: + clinical factors
m3 = smf.logit('outcome ~ exposure + age + sex + race + bmi + smoking + alcohol', data=df).fit(disp=0)

# Report ORs with 95% CI
for name, model in [('Unadjusted', m1), ('Demographics', m2), ('Fully adjusted', m3)]:
    or_val = np.exp(model.params['exposure'])
    ci = np.exp(model.conf_int().loc['exposure'])
    print(f"{name}: OR={or_val:.2f} (95% CI: {ci[0]:.2f}-{ci[1]:.2f}), p={model.pv

More from this repository

setup-tooluniverseSkill

Install and configure ToolUniverse for any use case — MCP server (chat-based), CLI (command line with 9 subcommands), or Python SDK (Coding API with 3 calling patterns). Covers uv/uvx setup, MCP configuration for 12+ AI clients (Cursor, Claude Desktop, Windsurf, VS Code, Codex, Gemini CLI, Trae, Cline, etc.), full CLI reference (tu list/grep/find/info/run/test/status/build/serve), Coding API quickstart, agentic tools, code executor, API key walkthrough, skill installation, and upgrading. Use when user asks how to set up ToolUniverse, which access mode to use (MCP vs CLI vs SDK), configuring MCP servers, using the CLI, troubleshooting installation, upgrading, or mentions installing ToolUniverse or setting up scientific tools. Also triggers for "how do I use ToolUniverse", "what's the best way to access tools", "command line", "tu command", "coding API", "tu build".

tooluniverse-acmg-variant-classificationSkill

Systematic ACMG/AMP germline variant classification with all 28 criteria (PVS1, PS1-4, PM1-6, PP1-5, BA1, BS1-4, BP1-7) for clinical significance. Produces 5-tier verdict (Pathogenic / Likely Pathogenic / VUS / Likely Benign / Benign) with cited evidence per criterion. Use for variant interpretation, VUS resolution, and pathogenicity assessment. Combines ClinVar, gnomAD, computational predictors, and gene-mechanism context.

tooluniverse-admet-predictionSkill

Comprehensive ADMET (Absorption, Distribution, Metabolism, Excretion, Toxicity) profiling for drug candidates. Integrates ADMET-AI predictions, SwissADME drug-likeness, PubChemTox experimental toxicity, ChEMBL clinical data, Lipinski rule-of-five, and CYP interaction data. Use for drug-likeness assessment, BBB penetration, bioavailability, hepatotoxicity prediction, ADME/PK profiling, or screening compound libraries before lab testing.

tooluniverse-adverse-event-detectionSkill

Detect and analyze adverse drug event signals using FDA FAERS reports, drug labels, and disproportionality statistics (PRR, ROR, IC). Generates quantitative safety signal scores (0-100) with evidence grading. Use for post-market surveillance, pharmacovigilance, drug safety assessment, regulatory submissions, and detecting rare AE signals not visible in clinical trials.

tooluniverse-adverse-outcome-pathwaySkill

Map environmental and industrial chemicals to adverse outcome pathways (AOPs) — molecular initiating event to organ-level toxicity. Uses AOPWiki, GHS classification, IARC carcinogen status, and LD50 data. Use for environmental/industrial chemical risk assessment, regulatory-grade hazard characterization, and AOP stressor mapping. Distinct from drug-safety analysis (use tooluniverse-pharmacovigilance for drugs).

tooluniverse-aging-senescenceSkill

Aging biology, cellular senescence, and longevity research. Covers senescence markers (p16/CDKN2A, SASP, SA-beta-gal), aging hallmarks, senolytic drug discovery (dasatinib+quercetin, fisetin, navitoclax), epigenetic clocks, telomere biology, and longevity GWAS. Use for senescence-pathway analysis, age-related disease genetics, senolytic-target discovery, and centenarian-genetics queries. Distinguishes correlative vs causal evidence (knockout, intervention).

tooluniverse-antibody-engineeringSkill

Therapeutic antibody engineering and optimization, lead-to-clinical-candidate. Covers sequence humanization (germline alignment, framework retention), affinity maturation, developability (aggregation, stability, PTMs), structure modeling (AlphaFold/PDB CDR analysis), immunogenicity prediction, and manufacturing feasibility. Use for biologic-drug optimization, mAb design review, biosimilar engineering, and clinical-precedent comparison.

tooluniverse-binder-discoverySkill

Discover novel small-molecule binders for protein targets using structure-based and ligand-based screening. Covers druggability assessment, known-ligand mining (ChEMBL, BindingDB), similarity expansion, ADMET filtering, and synthesis feasibility. Use for hit identification, virtual screening, target-to-compounds workflows, and lead-finding before commit-to-medchem.