Skip to main content
ClaudeWave
Skill1.4k estrellas del repoactualizado today

tooluniverse-data-wrangling

The tooluniverse-data-wrangling skill provides Python code patterns for downloading and parsing scientific data across multiple formats and APIs when dedicated ToolUniverse tools are unavailable, return only metadata, or cannot handle bulk operations. Use it for processing genomic formats like VCF, BAM, h5ad, and GCT files, executing multi-step API workflows from search through parsing, retrieving thousands of records simultaneously, or accessing data sources without existing tool coverage.

Instalar en Claude Code
Copiar
git clone --depth 1 https://github.com/mims-harvard/ToolUniverse /tmp/tooluniverse-data-wrangling && cp -r /tmp/tooluniverse-data-wrangling/plugin/skills/tooluniverse-data-wrangling ~/.claude/skills/tooluniverse-data-wrangling
Después abre una sesión nueva de Claude Code; el skill carga automáticamente.

SKILL.md

# Data Wrangling: Universal Access Patterns

Reference for downloading and parsing scientific data from any source. Write and run Python code via Bash for every step.

## When to Use

- ToolUniverse tool returned metadata/search results but you need **raw or bulk data**
- Data is in a format tools don't parse (VCF, h5ad, BAM, SDF, GCT)
- You need a **multi-step API workflow** (search -> filter -> download -> parse)
- The data source has **no ToolUniverse tool** at all
- You need **thousands of records**, not the 10-100 a tool returns

## Decision: Tool vs Code

| Situation | Use |
|-----------|-----|
| Single record lookup, simple search, <100 results | ToolUniverse tool (`execute_tool`) |
| Bulk download, custom filtering, format conversion | Write Python code |
| Tool exists but returns truncated results | Write code using the same API the tool wraps |
| No tool exists for this source | Write code directly |

---

## Section A: Format Cookbook

### Tabular
```python
import pandas as pd, io

df = pd.read_csv("data.csv")                                # CSV
df = pd.read_csv("data.tsv", sep="\t")                      # TSV
df = pd.read_sas(io.BytesIO(content), format="xport")       # SAS Transport (XPT) — NHANES, CDC
df = pd.read_sas("data.sas7bdat", format="sas7bdat")        # SAS native
df = pd.read_stata("data.dta")                               # Stata — ICPSR, HRS
df = pd.read_parquet("data.parquet")                         # Parquet — MIMIC-IV
df = pd.read_excel("data.xlsx")                              # Excel
df = pd.read_spss("data.sav")                                # SPSS
df = pd.read_fwf("data.dat")                                 # Fixed-width — legacy surveys
```

### Genomics
```python
from Bio import SeqIO
records = list(SeqIO.parse("seqs.fasta", "fasta"))           # FASTA
records = list(SeqIO.parse("reads.fastq", "fastq"))          # FASTQ

# VCF (no cyvcf2 needed)
vcf_lines = [l for l in open("vars.vcf") if not l.startswith("##")]
df = pd.read_csv(io.StringIO("".join(vcf_lines)), sep="\t")

df = pd.read_csv("genes.gff3", sep="\t", comment="#",        # GFF/GTF
     names=["seqid","source","type","start","end","score","strand","phase","attrs"])
df = pd.read_csv("regions.bed", sep="\t", header=None,       # BED
     names=["chrom","start","end","name","score","strand"])

import pysam                                                  # BAM (requires pysam)
bam = pysam.AlignmentFile("aligned.bam", "rb")
for read in bam.fetch("chr1", 1000, 2000): print(read.query_name)
```

### Structural
```python
from Bio.PDB import PDBParser, MMCIFParser
parser = PDBParser(QUIET=True)
structure = parser.get_structure("prot", "structure.pdb")     # PDB

parser = MMCIFParser(QUIET=True)
structure = parser.get_structure("prot", "structure.cif")     # mmCIF

from rdkit import Chem                                        # SDF/MOL (requires rdkit)
supplier = Chem.SDMolSupplier("compounds.sdf")
mols = [m for m in supplier if m is not None]
```

### Omics Matrices
```python
import anndata
adata = anndata.read_h5ad("expression.h5ad")                 # AnnData (scRNA-seq, spatial)

import scipy.io
mat = scipy.io.mmread("matrix.mtx")                          # 10X Genomics MTX
barcodes = pd.read_csv("barcodes.tsv", header=None)[0].tolist()
features = pd.read_csv("features.tsv", sep="\t", header=None)[1].tolist()

df = pd.read_csv("expression.gct", sep="\t", skiprows=2)     # GCT (gene expression)

import loompy                                                 # Loom (legacy single-cell)
ds = loompy.connect("data.loom")
```

### Mass Spectrometry & Flow Cytometry
```python
from pyteomics import mzml                                    # mzML (proteomics, requires pyteomics)
spectra = list(mzml.read("spectra.mzML"))

import fcsparser                                              # FCS (flow cytometry, requires fcsparser)
meta, data = fcsparser.parse("sample.fcs", reformat_meta=True)
```

### Neuroimaging
```python
import nibabel as nib                                         # NIfTI (requires nibabel)
img = nib.load("brain.nii.gz")
data = img.get_fdata()  # 3D/4D numpy array

# DICOM (requires pydicom)
import pydicom
dcm = pydicom.dcmread("scan.dcm")
pixel_data = dcm.pixel_array
```

### Phylogenetics & Systems Biology
```python
from Bio import Phylo                                         # Newick/Nexus (BioPython)
tree = Phylo.read("tree.nwk", "newick")
tree = Phylo.read("tree.nex", "nexus")

import libsbml                                                # SBML (systems biology, requires python-libsbml)
reader = libsbml.SBMLReader()
doc = reader.readSBML("model.xml")
model = doc.getModel()
```

### Serialized
```python
import json, xml.etree.ElementTree as ET, h5py

data = json.load(open("data.json"))                           # JSON
df = pd.read_json("records.json")                             # JSON -> DataFrame
tree = ET.parse("data.xml"); root = tree.getroot()            # XML
f = h5py.File("data.h5", "r"); dataset = f["group/data"][:]   # HDF5
```

### Compressed
```python
df = pd.read_csv("data.csv.gz")                              # gzip (pandas auto-detects)
df = pd.read_csv("data.tsv.gz", sep="\t")                    # gzip TSV

import zipfile
with zipfile.ZipFile(io.BytesIO(content)) as z:               # ZIP
    df = pd.read_csv(z.open(z.namelist()[0]))

import tarfile
with tarfile.open("archive.tar.gz") as t:                     # tar.gz
    f = t.extractfile(t.getnames()[0])
    df = pd.read_csv(f)
```

---

## Section B: API Patterns by Domain

Each category shows: which ToolUniverse tools exist, and how to go beyond them with direct API calls.

### 1. NCBI E-utilities (Gene, Nucleotide, Protein, SRA, GEO)
Tools: `NCBIGene_search`, `NCBI_search_nucleotide`, `SRA_search_experiments`, `geo_search_datasets`
```python
import requests
base = "https://eutils.ncbi.nlm.nih.gov/entrez/eutils"
# Search -> get IDs -> fetch records in batches
ids = requests.get(f"{base}/esearch.fcgi?db=gene&term=BRCA
setup-tooluniverseSkill

Install and configure ToolUniverse for any use case — MCP server (chat-based), CLI (command line with 9 subcommands), or Python SDK (Coding API with 3 calling patterns). Covers uv/uvx setup, MCP configuration for 12+ AI clients (Cursor, Claude Desktop, Windsurf, VS Code, Codex, Gemini CLI, Trae, Cline, etc.), full CLI reference (tu list/grep/find/info/run/test/status/build/serve), Coding API quickstart, agentic tools, code executor, API key walkthrough, skill installation, and upgrading. Use when user asks how to set up ToolUniverse, which access mode to use (MCP vs CLI vs SDK), configuring MCP servers, using the CLI, troubleshooting installation, upgrading, or mentions installing ToolUniverse or setting up scientific tools. Also triggers for "how do I use ToolUniverse", "what's the best way to access tools", "command line", "tu command", "coding API", "tu build".

tooluniverse-acmg-variant-classificationSkill

Systematic ACMG/AMP germline variant classification with all 28 criteria (PVS1, PS1-4, PM1-6, PP1-5, BA1, BS1-4, BP1-7) for clinical significance. Produces 5-tier verdict (Pathogenic / Likely Pathogenic / VUS / Likely Benign / Benign) with cited evidence per criterion. Use for variant interpretation, VUS resolution, and pathogenicity assessment. Combines ClinVar, gnomAD, computational predictors, and gene-mechanism context.

tooluniverse-admet-predictionSkill

Comprehensive ADMET (Absorption, Distribution, Metabolism, Excretion, Toxicity) profiling for drug candidates. Integrates ADMET-AI predictions, SwissADME drug-likeness, PubChemTox experimental toxicity, ChEMBL clinical data, Lipinski rule-of-five, and CYP interaction data. Use for drug-likeness assessment, BBB penetration, bioavailability, hepatotoxicity prediction, ADME/PK profiling, or screening compound libraries before lab testing.

tooluniverse-adverse-event-detectionSkill

Detect and analyze adverse drug event signals using FDA FAERS reports, drug labels, and disproportionality statistics (PRR, ROR, IC). Generates quantitative safety signal scores (0-100) with evidence grading. Use for post-market surveillance, pharmacovigilance, drug safety assessment, regulatory submissions, and detecting rare AE signals not visible in clinical trials.

tooluniverse-adverse-outcome-pathwaySkill

Map environmental and industrial chemicals to adverse outcome pathways (AOPs) — molecular initiating event to organ-level toxicity. Uses AOPWiki, GHS classification, IARC carcinogen status, and LD50 data. Use for environmental/industrial chemical risk assessment, regulatory-grade hazard characterization, and AOP stressor mapping. Distinct from drug-safety analysis (use tooluniverse-pharmacovigilance for drugs).

tooluniverse-aging-senescenceSkill

Aging biology, cellular senescence, and longevity research. Covers senescence markers (p16/CDKN2A, SASP, SA-beta-gal), aging hallmarks, senolytic drug discovery (dasatinib+quercetin, fisetin, navitoclax), epigenetic clocks, telomere biology, and longevity GWAS. Use for senescence-pathway analysis, age-related disease genetics, senolytic-target discovery, and centenarian-genetics queries. Distinguishes correlative vs causal evidence (knockout, intervention).

tooluniverse-antibody-engineeringSkill

Therapeutic antibody engineering and optimization, lead-to-clinical-candidate. Covers sequence humanization (germline alignment, framework retention), affinity maturation, developability (aggregation, stability, PTMs), structure modeling (AlphaFold/PDB CDR analysis), immunogenicity prediction, and manufacturing feasibility. Use for biologic-drug optimization, mAb design review, biosimilar engineering, and clinical-precedent comparison.

tooluniverse-binder-discoverySkill

Discover novel small-molecule binders for protein targets using structure-based and ligand-based screening. Covers druggability assessment, known-ligand mining (ChEMBL, BindingDB), similarity expansion, ADMET filtering, and synthesis feasibility. Use for hit identification, virtual screening, target-to-compounds workflows, and lead-finding before commit-to-medchem.