tooluniverse-natural-product-dereplication
This skill dereplicates putative natural products against the NPAtlas database of microbial metabolites to determine whether a compound is already known, which microorganism produces it, and its chemical taxonomy via ClassyFire. Use it to answer questions about whether a molecule is a documented natural product, identify its microbial producer and literature reference, classify it into the ChemOnt hierarchy, or resolve chemical identities from formulas, exact masses, InChIKeys, or SMILES strings.
git clone --depth 1 https://github.com/mims-harvard/ToolUniverse /tmp/tooluniverse-natural-product-dereplication && cp -r /tmp/tooluniverse-natural-product-dereplication/plugin/skills/tooluniverse-natural-product-dereplication ~/.claude/skills/tooluniverse-natural-product-dereplicationSKILL.md
# Natural Product Dereplication & Chemotaxonomy
Decide whether a putative natural product is **already known**, identify the **microbe that produces it**, attach the **literature reference**, and assign its **ChemOnt chemical class**. This is the dereplication question every NP chemist and metabolomics analyst asks of a new feature: *"have we seen this before, and what makes it?"*
**LOOK UP DON'T GUESS**: Never assume an NPAID, producing organism, exact mass, or chemical class. Every identity, provenance, and taxonomy claim must come from a live tool call.
**Scope (microbial NPs only)**: NPAtlas covers natural products from **bacteria and fungi**. It does NOT cover plant, animal, or marine-invertebrate metabolites unless a microbial producer was reported. A "no NPAtlas hit" therefore means *not a known microbial NP* — it does not prove the molecule is novel in an absolute sense.
---
## Backing Tools (all keyless; verify before quoting)
| Tool | Input | Returns |
|------|-------|---------|
| `NPAtlas_search_compounds` | `name` / `inchikey` / `formula` / `smiles`, `limit` | list of {npaid, name, molecular_formula, molecular_weight, exact_mass, inchikey, smiles}. (origin_organism is `null` here — fetch the full record for provenance) |
| `NPAtlas_get_compound` | `npaid` (e.g. `NPA014588`) | full record incl. `origin_organism` (producing microbe + taxonomic lineage) and `origin_reference` (title/doi/journal/year) |
| `ClassyFire_classify_by_inchikey` | `inchikey` (full 27-char) | ChemOnt kingdom→superclass→class→subclass→direct_parent, molecular_framework, substituents. `classified:false` if not in cache |
| `OPSIN_name_to_structure` | `name` (systematic IUPAC) | smiles / inchi / inchikey. `parsed:false` for trade/trivial names |
| `PubChem_get_CID_by_compound_name` | `name` | `{IdentifierList:{CID:[...]}}` |
| `PubChem_get_compound_properties_by_CID` | `cid`, `properties` (e.g. `["MolecularFormula","MolecularWeight","InChIKey","IUPACName"]`) | property table — use to obtain an InChIKey for arbitrary compounds |
---
## Workflow
```
Phase 0: Classify input — name / formula / exact mass / InChIKey / SMILES?
Phase 1: Obtain an InChIKey (the universal key for ClassyFire & precise NPAtlas match)
Phase 2: Dereplicate against NPAtlas (known microbial NP? which organism? which paper?)
Phase 3: Assign ChemOnt chemical class via ClassyFire
Phase 4: Cross-reference identity in PubChem
Phase 5: Report — known/novel call + provenance + class hierarchy + interpretation note
```
### Phase 0 — Classify the input
- **Full InChIKey** (27 chars, `XXXXXXXXXXXXXX-XXXXXXXXXX-X`) → skip to Phase 2; it is already the universal key.
- **Molecular formula / exact mass** → go straight to NPAtlas formula search (Phase 2); these are the rawest dereplication inputs (typical of an untargeted MS feature).
- **SMILES** → usable directly in `NPAtlas_search_compounds(smiles=...)`; also feed to PubChem for an InChIKey.
- **Systematic IUPAC name** (e.g. `2-acetyloxybenzoic acid`) → Phase 1 via OPSIN.
- **Trivial / trade / common name** (e.g. `staurosporine`, `penicillin`) → Phase 1 via PubChem (OPSIN will return `parsed:false` for these).
### Phase 1 — Obtain an InChIKey
```python
# Systematic IUPAC name → structure (OPSIN). parsed:false ⇒ fall through to PubChem.
op = tu.tools.OPSIN_name_to_structure(name="2-acetyloxybenzoic acid")
inchikey = op["data"]["inchikey"] # only if op["data"]["parsed"]
# Trivial/common name → PubChem CID → properties (incl. InChIKey)
cid = tu.tools.PubChem_get_CID_by_compound_name(name="staurosporine")["data"]["IdentifierList"]["CID"][0]
props = tu.tools.PubChem_get_compound_properties_by_CID(
cid=cid, properties=["MolecularFormula","MolecularWeight","InChIKey","IUPACName"])
inchikey = props["data"]["PropertyTable"]["Properties"][0]["InChIKey"]
```
The InChIKey is what makes dereplication exact: an InChIKey match is a structure match; a name match is not (synonyms, analogs, and salts share names).
### Phase 2 — Dereplicate against NPAtlas
Search by the most specific key available. Prefer **InChIKey** (exact structure), then **formula** (catches isomers — useful for an MS feature with only a formula), then **name** (loosest — returns analogs).
```python
# Exact, structure-level
hits = tu.tools.NPAtlas_search_compounds(inchikey="HKSZLNNOFSGOKW-FYTWVXJKSA-N", limit=5)
# MS-feature style (formula or exact mass) — expect multiple isomeric hits
hits = tu.tools.NPAtlas_search_compounds(formula="C28H26N4O3", limit=10)
```
For each candidate NPAID, fetch the **full record** to get the producing organism and reference (search results carry `origin_organism: null`):
```python
rec = tu.tools.NPAtlas_get_compound(npaid="NPA014588")["data"]
organism = rec["origin_organism"]["name"] # e.g. "Streptomyces"
lineage = rec["origin_organism"]["ancestors"] # domain→...→family
reference = rec["origin_reference"] # title, doi, journal, year
```
### Phase 3 — Assign ChemOnt chemical class
```python
cf = tu.tools.ClassyFire_classify_by_inchikey(inchikey=inchikey)["data"]
# cf["kingdom"], cf["superclass"], cf["class"], cf["subclass"], cf["direct_parent"]
# cf["molecular_framework"], cf["substituents"]
```
If `classified:false`, the InChIKey is not in the ClassyFire cache — report the class as *unavailable* (do not invent one). A correct InChIKey is required; a wrong stereo/protonation layer will miss the cache.
### Phase 4 — Cross-reference identity in PubChem
Confirm the same molecule exists in PubChem (CID, IUPAC name, formula, MW) so the identity is anchored to a second independent database. Disagreement in molecular formula between NPAtlas and PubChem is a red flag that the name/structure resolution went astray.
### Phase 5 — Report
Deliver:
1. **Dereplication call** — *Known microbial NP* (with NPAID) **or** *No NPAtlas match (possibly novel / non-microbial)*.
2. **Provenance** — producing organism + taxonomic lineage + literature reference (title,Install and configure ToolUniverse for any use case — MCP server (chat-based), CLI (command line with 9 subcommands), or Python SDK (Coding API with 3 calling patterns). Covers uv/uvx setup, MCP configuration for 12+ AI clients (Cursor, Claude Desktop, Windsurf, VS Code, Codex, Gemini CLI, Trae, Cline, etc.), full CLI reference (tu list/grep/find/info/run/test/status/build/serve), Coding API quickstart, agentic tools, code executor, API key walkthrough, skill installation, and upgrading. Use when user asks how to set up ToolUniverse, which access mode to use (MCP vs CLI vs SDK), configuring MCP servers, using the CLI, troubleshooting installation, upgrading, or mentions installing ToolUniverse or setting up scientific tools. Also triggers for "how do I use ToolUniverse", "what's the best way to access tools", "command line", "tu command", "coding API", "tu build".
Systematic ACMG/AMP germline variant classification with all 28 criteria (PVS1, PS1-4, PM1-6, PP1-5, BA1, BS1-4, BP1-7) for clinical significance. Produces 5-tier verdict (Pathogenic / Likely Pathogenic / VUS / Likely Benign / Benign) with cited evidence per criterion. Use for variant interpretation, VUS resolution, and pathogenicity assessment. Combines ClinVar, gnomAD, computational predictors, and gene-mechanism context.
Comprehensive ADMET (Absorption, Distribution, Metabolism, Excretion, Toxicity) profiling for drug candidates. Integrates ADMET-AI predictions, SwissADME drug-likeness, PubChemTox experimental toxicity, ChEMBL clinical data, Lipinski rule-of-five, and CYP interaction data. Use for drug-likeness assessment, BBB penetration, bioavailability, hepatotoxicity prediction, ADME/PK profiling, or screening compound libraries before lab testing.
Detect and analyze adverse drug event signals using FDA FAERS reports, drug labels, and disproportionality statistics (PRR, ROR, IC). Generates quantitative safety signal scores (0-100) with evidence grading. Use for post-market surveillance, pharmacovigilance, drug safety assessment, regulatory submissions, and detecting rare AE signals not visible in clinical trials.
Map environmental and industrial chemicals to adverse outcome pathways (AOPs) — molecular initiating event to organ-level toxicity. Uses AOPWiki, GHS classification, IARC carcinogen status, and LD50 data. Use for environmental/industrial chemical risk assessment, regulatory-grade hazard characterization, and AOP stressor mapping. Distinct from drug-safety analysis (use tooluniverse-pharmacovigilance for drugs).
Aging biology, cellular senescence, and longevity research. Covers senescence markers (p16/CDKN2A, SASP, SA-beta-gal), aging hallmarks, senolytic drug discovery (dasatinib+quercetin, fisetin, navitoclax), epigenetic clocks, telomere biology, and longevity GWAS. Use for senescence-pathway analysis, age-related disease genetics, senolytic-target discovery, and centenarian-genetics queries. Distinguishes correlative vs causal evidence (knockout, intervention).
Therapeutic antibody engineering and optimization, lead-to-clinical-candidate. Covers sequence humanization (germline alignment, framework retention), affinity maturation, developability (aggregation, stability, PTMs), structure modeling (AlphaFold/PDB CDR analysis), immunogenicity prediction, and manufacturing feasibility. Use for biologic-drug optimization, mAb design review, biosimilar engineering, and clinical-precedent comparison.
Discover novel small-molecule binders for protein targets using structure-based and ligand-based screening. Covers druggability assessment, known-ligand mining (ChEMBL, BindingDB), similarity expansion, ADMET filtering, and synthesis feasibility. Use for hit identification, virtual screening, target-to-compounds workflows, and lead-finding before commit-to-medchem.