Skip to main content
ClaudeWave
Skill1.4k estrellas del repoactualizado today

tooluniverse-epigenomics

The tooluniverse-epigenomics skill processes genome-wide epigenomic datasets including DNA methylation, ChIP-seq peaks, ATAC-seq accessibility, and histone modifications using pandas, scipy, and pysam combined with ToolUniverse annotation tools. Use it for methylation analysis, chromatin state classification, multi-omics integration, and epigenomic statistics, particularly when working with long-format methylation data where row counts rather than unique positions matter for filtering questions.

Instalar en Claude Code
Copiar
git clone --depth 1 https://github.com/mims-harvard/ToolUniverse /tmp/tooluniverse-epigenomics && cp -r /tmp/tooluniverse-epigenomics/plugin/skills/tooluniverse-epigenomics ~/.claude/skills/tooluniverse-epigenomics
Después abre una sesión nueva de Claude Code; el skill carga automáticamente.

SKILL.md

# Genomics and Epigenomics Data Processing

## ⚠️ TOP-OF-MIND RULE: long-format methylation CSV — count ROWS, not unique positions

When the input is a long-format methylation CSV (one row per `(sample, CpG_position)`
e.g. columns `Pos, Chromosome, MethylationPercentage`), "how many sites are
removed when filtering" almost always means **rows removed**, NOT unique-position
removals. The two answers differ by a factor of ≈ `n_samples`.

| Question phrasing | What it means |
|---|---|
| "how many sites are removed when filtering …" | **rows removed** (= samples × positions failing the filter) |
| "how many unique CpG sites pass filter" | **unique positions** (dedupe by `Pos` then filter) |

❌ WRONG: `df.drop_duplicates(["Pos"]).query("MethylationPercentage<10 or >90")` then `len(filtered)` → counts unique positions (typically 100–1500)

✅ RIGHT: `df.query("MethylationPercentage<10 or MethylationPercentage>90")` then `len(df) - len(filtered)` → counts rows (typically 10k–30k)

If your answer is < 2000 when the data has 1000+ positions × 20+ samples, you
deduplicated too early. Re-read the question's noun before reporting.

---

## RULE ZERO — Check for pre-computed results FIRST

Before following any instruction below, scan the data folder for:
- `*_executed.ipynb` → read with `tu run read_executed_notebook '{"data_folder":"<path>","search":"<keyword>"}'` and cite its cell outputs as the authoritative answer
- Pre-computed result files (CSV/TSV with names like `*results*`, `*deseq*`, `*enrich*`, `*stats*`, `*_simplified.csv`) → read directly and report the requested value
- Canonical analysis scripts (`analysis.R`, `run_*.py`, `find_*.R`, `*.Rmd`) → execute as-is and read the output

Only follow this skill's re-analysis recipe below if **none** of the above exist. Re-running from raw data produces different numbers than the published answer and is much slower (often 5-10× turn count).

---

Production-ready skill combining Python computation (pandas, scipy, numpy, pysam, statsmodels) with ToolUniverse annotation tools for epigenomics analysis.

## LOOK UP, DON'T GUESS
When uncertain about any scientific fact, SEARCH databases first.

## When to Use

Methylation data, ChIP-seq peaks, ATAC-seq, multi-omics integration, genome-wide epigenomic statistics. Keywords: methylation, CpG, ChIP-seq, ATAC-seq, histone, chromatin, epigenetic.

**NOT for**: RNA-seq DEG, variant calling, gene enrichment, protein structure.

---

## Key Principles

1. **Data-first** - Load/inspect before analysis
2. **Question-driven** - Extract specific numeric answer
3. **Coordinate system awareness** - Track genome build (hg19/hg38/mm10), chr prefix
4. **Statistical rigor** - FDR correction, effect size filtering
5. **CpG identification** - Parse Illumina probe IDs, genomic coordinates

## PRIMARY SCRIPT — methylation_density.py (use FIRST for CpG-density questions)

For long-format methylation CSVs (`Pos, Chromosome, MethylationPercentage`)
paired with chromosome-length CSVs, ALWAYS run the bundled script before
hand-rolling pandas. It deterministically computes every common metric in one
pass and avoids the rows-vs-sites pitfall that produces silently-wrong answers.

```bash
python skills/tooluniverse-epigenomics/scripts/methylation_density.py \
  --cpg <CpG csv> --chr-lengths <chr lengths csv> \
  --filter-meth-extremes 90 10
```

The full JSON output contains every metric. Pick the one that matches the
question's wording (NOT a similar-looking one):

| Question phrasing                                              | Script field              |
|----------------------------------------------------------------|---------------------------|
| "how many sites are removed when filtering …"                  | `rows_removed`            |
| "how many unique CpG sites pass filter"                        | `unique_pos_after_filter` |
| "genome-wide AVERAGE chromosomal density"                      | `density_avg_per_chr`     |
| "density on chromosome X"                                      | `density_chromosome` (pass `--chromosome X`) |
| "total density across the genome"                              | `density_total_over_genome` |

The two density numbers (`density_avg_per_chr` vs `density_total_over_genome`)
typically differ by ~2× because CpGs are not uniformly distributed across
chromosomes; reporting one when the question asks for the other is the most
common failure mode here.

For "sites removed" questions, the long-format CSV has multiple rows per CpG
position (one per sample), so `rows_removed` is in the tens of thousands while
`unique_pos_removed` is in the hundreds. Match the granularity to the question.

## Distinguish "rows" vs "unique sites" — methylation CSVs are usually long-format

CpG methylation CSVs typically have ONE ROW PER (sample × CpG site) — so `len(df) >> n_unique_sites`. Before computing anything, decide which axis the question is asking about:

| Question phrasing | Axis | Operation |
|-------------------|------|-----------|
| "how many sites are removed when filtering" | sample-rows | filter then count rows; do NOT dedupe by `Pos`. The CSV is in long format; "sites" here is row-shaped. Subtract `len(df_filtered)` from `len(df)`. |
| "how many unique CpG sites pass filter" | unique positions | dedupe by position (or `Pos` column), then filter |
| **"genome-wide average chromosomal density"** | per-chromosome density | MEAN of per-chromosome densities: `(n_unique_per_chr / chr_length).mean()`. NOT `total_unique / total_genome` — that gives a different answer (typically ≈ ½ of the per-chr mean for unevenly distributed CpGs). |
| **"density on chromosome X"** | single chromosome | unique positions on X / length(X). Be careful which species — check the question text for "Zebra Finch" vs "Jackdaw". |
| "chi-square for uniform distribution across chromosomes" | unique positions per chromosome | filter rows first, then dedupe by `(Chromosome, Pos)`, then count per-chromosome unique positions for chi-sq
setup-tooluniverseSkill

Install and configure ToolUniverse for any use case — MCP server (chat-based), CLI (command line with 9 subcommands), or Python SDK (Coding API with 3 calling patterns). Covers uv/uvx setup, MCP configuration for 12+ AI clients (Cursor, Claude Desktop, Windsurf, VS Code, Codex, Gemini CLI, Trae, Cline, etc.), full CLI reference (tu list/grep/find/info/run/test/status/build/serve), Coding API quickstart, agentic tools, code executor, API key walkthrough, skill installation, and upgrading. Use when user asks how to set up ToolUniverse, which access mode to use (MCP vs CLI vs SDK), configuring MCP servers, using the CLI, troubleshooting installation, upgrading, or mentions installing ToolUniverse or setting up scientific tools. Also triggers for "how do I use ToolUniverse", "what's the best way to access tools", "command line", "tu command", "coding API", "tu build".

tooluniverse-acmg-variant-classificationSkill

Systematic ACMG/AMP germline variant classification with all 28 criteria (PVS1, PS1-4, PM1-6, PP1-5, BA1, BS1-4, BP1-7) for clinical significance. Produces 5-tier verdict (Pathogenic / Likely Pathogenic / VUS / Likely Benign / Benign) with cited evidence per criterion. Use for variant interpretation, VUS resolution, and pathogenicity assessment. Combines ClinVar, gnomAD, computational predictors, and gene-mechanism context.

tooluniverse-admet-predictionSkill

Comprehensive ADMET (Absorption, Distribution, Metabolism, Excretion, Toxicity) profiling for drug candidates. Integrates ADMET-AI predictions, SwissADME drug-likeness, PubChemTox experimental toxicity, ChEMBL clinical data, Lipinski rule-of-five, and CYP interaction data. Use for drug-likeness assessment, BBB penetration, bioavailability, hepatotoxicity prediction, ADME/PK profiling, or screening compound libraries before lab testing.

tooluniverse-adverse-event-detectionSkill

Detect and analyze adverse drug event signals using FDA FAERS reports, drug labels, and disproportionality statistics (PRR, ROR, IC). Generates quantitative safety signal scores (0-100) with evidence grading. Use for post-market surveillance, pharmacovigilance, drug safety assessment, regulatory submissions, and detecting rare AE signals not visible in clinical trials.

tooluniverse-adverse-outcome-pathwaySkill

Map environmental and industrial chemicals to adverse outcome pathways (AOPs) — molecular initiating event to organ-level toxicity. Uses AOPWiki, GHS classification, IARC carcinogen status, and LD50 data. Use for environmental/industrial chemical risk assessment, regulatory-grade hazard characterization, and AOP stressor mapping. Distinct from drug-safety analysis (use tooluniverse-pharmacovigilance for drugs).

tooluniverse-aging-senescenceSkill

Aging biology, cellular senescence, and longevity research. Covers senescence markers (p16/CDKN2A, SASP, SA-beta-gal), aging hallmarks, senolytic drug discovery (dasatinib+quercetin, fisetin, navitoclax), epigenetic clocks, telomere biology, and longevity GWAS. Use for senescence-pathway analysis, age-related disease genetics, senolytic-target discovery, and centenarian-genetics queries. Distinguishes correlative vs causal evidence (knockout, intervention).

tooluniverse-antibody-engineeringSkill

Therapeutic antibody engineering and optimization, lead-to-clinical-candidate. Covers sequence humanization (germline alignment, framework retention), affinity maturation, developability (aggregation, stability, PTMs), structure modeling (AlphaFold/PDB CDR analysis), immunogenicity prediction, and manufacturing feasibility. Use for biologic-drug optimization, mAb design review, biosimilar engineering, and clinical-precedent comparison.

tooluniverse-binder-discoverySkill

Discover novel small-molecule binders for protein targets using structure-based and ligand-based screening. Covers druggability assessment, known-ligand mining (ChEMBL, BindingDB), similarity expansion, ADMET filtering, and synthesis feasibility. Use for hit identification, virtual screening, target-to-compounds workflows, and lead-finding before commit-to-medchem.