tooluniverse-chemical-compound-retrieval
This Claude Code skill retrieves and disambiguates chemical compound data from PubChem and ChEMBL databases, resolving compound names to standardized identifiers (CID, ChEMBL ID, SMILES, InChI) and molecular properties. Use it when you need to identify specific chemical compounds, distinguish between isomers or multiple forms of a generic name like Vitamin D, validate compound identity across databases, or gather bioactivity and molecular property data for research or drug development workflows.
git clone --depth 1 https://github.com/mims-harvard/ToolUniverse /tmp/tooluniverse-chemical-compound-retrieval && cp -r /tmp/tooluniverse-chemical-compound-retrieval/plugin/skills/tooluniverse-chemical-compound-retrieval ~/.claude/skills/tooluniverse-chemical-compound-retrievalSKILL.md
# Chemical Compound Information Retrieval
Retrieve comprehensive chemical compound data with proper disambiguation and cross-database validation.
**LOOK UP DON'T GUESS**: Never assume a CID, ChEMBL ID, or molecular property value. Always retrieve from PubChem/ChEMBL.
**English-first**: Always use English compound names in tool calls. Respond in user's language.
## Domain Reasoning: Disambiguation
"Aspirin" = one compound. "Vitamin D" = multiple forms (D2/D3/active metabolite). For generic class names (steroids, vitamins, acids), present candidates and confirm before proceeding.
---
## Workflow
```
Phase 0: Clarify (only if highly ambiguous -- skip for unambiguous names or specific IDs)
Phase 1: Disambiguate → resolve PubChem CID + ChEMBL ID
Phase 2: Retrieve data (silent)
Phase 3: Report compound profile
```
### Phase 1: Disambiguation
```python
# By name
result = tu.tools.PubChem_get_CID_by_compound_name(compound_name=name)
# By SYSTEMATIC (IUPAC) name -> structure, deterministic parser (no DB lookup)
opsin = tu.tools.OPSIN_name_to_structure(name="2-acetoxybenzoic acid")
# Returns {parsed, smiles, inchi, inchikey}; use the SMILES/InChIKey to anchor a
# PubChem_get_CID_by_SMILES lookup. Trade/trivial names give parsed=false -> fall
# back to PubChem_get_CID_by_compound_name for those.
# By SMILES
result = tu.tools.PubChem_get_CID_by_SMILES(smiles=smiles)
# Cross-reference
chembl_result = tu.tools.ChEMBL_search_molecules(query=name, limit=5)
```
Verify: CID + ChEMBL ID + canonical SMILES + stereochemistry + salt forms.
### Phase 2: Data Retrieval
**PubChem**: `PubChem_get_compound_properties_by_CID`, `PubChemBioAssay_get_assay_summary`, `PubChemTox_get_acute_effects`, `PubChem_get_compound_2D_image_by_CID`
**ChEMBL**: `ChEMBL_get_compound_record_activities`, `ChEMBL_get_molecule_targets`, `ChEMBL_get_assay_activities`
**Optional**: `PubChem_get_associated_patents_by_CID`, `PubChem_search_compounds_by_similarity`
### Phase 3: Report
Compound Profile with: Identity (CID, ChEMBL ID, IUPAC, SMILES), Chemical Properties (MW, LogP, HBD, HBA, PSA, Lipinski), Bioactivity (targets, IC50/Ki), Drug Info (if approved), Data Sources.
---
## Fallback Chains
| Primary | Fallback |
|---------|----------|
| PubChem name lookup (systematic name) | `OPSIN_name_to_structure` → SMILES/InChIKey → PubChem_get_CID_by_SMILES |
| PubChem name lookup | ChEMBL search → SMILES → PubChem_get_CID_by_SMILES |
| ChEMBL bioactivity | PubChem bioassay summary |
| Drug label | Note "unavailable" |
---
## Evidence Grading
| Grade | Criteria |
|-------|----------|
| **Confirmed** | CID + ChEMBL cross-match, InChI/SMILES agree |
| **Probable** | CID found, partial ChEMBL match |
| **Uncertain** | Single database only, or multiple CIDs |
| **Unverified** | No cross-reference, single-source |
**Bioactivity**: ChEMBL > PubChem BioAssay for curated data. IC50/Ki < 100nM = potent, 100nM-1uM = moderate, >10uM = weak. Lipinski violations reduce oral bioavailability but don't disqualify.
---
## SMILES Verification
Always verify novel SMILES: `python3 src/tooluniverse/tools/smiles_verifier.py --smiles "SMILES_STRING"`. Invalid SMILES produce wrong results or cryptic errors.
---
## Tool Reference
**PubChem**: `PubChem_get_CID_by_compound_name`, `PubChem_get_CID_by_SMILES`, `PubChem_get_compound_properties_by_CID`, `PubChem_get_compound_2D_image_by_CID`, `PubChemBioAssay_get_assay_summary`, `PubChemTox_get_acute_effects`, `PubChem_get_associated_patents_by_CID`, `PubChem_search_compounds_by_similarity`, `PubChem_search_compounds_by_substructure`
**ChEMBL**: `ChEMBL_search_drugs`, `ChEMBL_get_molecule`, `ChEMBL_get_activity`, `ChEMBL_get_target`, `ChEMBL_search_targets`, `ChEMBL_search_assays`
**Name parsing**: `OPSIN_name_to_structure` (param `name`) — deterministic IUPAC/systematic-name → SMILES/InChI/InChIKey parser; the go-to for resolving a systematic name to structure without a DB round-trip. Trade/trivial names return `parsed=false` (use PubChem name lookup for those).Install and configure ToolUniverse for any use case — MCP server (chat-based), CLI (command line with 9 subcommands), or Python SDK (Coding API with 3 calling patterns). Covers uv/uvx setup, MCP configuration for 12+ AI clients (Cursor, Claude Desktop, Windsurf, VS Code, Codex, Gemini CLI, Trae, Cline, etc.), full CLI reference (tu list/grep/find/info/run/test/status/build/serve), Coding API quickstart, agentic tools, code executor, API key walkthrough, skill installation, and upgrading. Use when user asks how to set up ToolUniverse, which access mode to use (MCP vs CLI vs SDK), configuring MCP servers, using the CLI, troubleshooting installation, upgrading, or mentions installing ToolUniverse or setting up scientific tools. Also triggers for "how do I use ToolUniverse", "what's the best way to access tools", "command line", "tu command", "coding API", "tu build".
Systematic ACMG/AMP germline variant classification with all 28 criteria (PVS1, PS1-4, PM1-6, PP1-5, BA1, BS1-4, BP1-7) for clinical significance. Produces 5-tier verdict (Pathogenic / Likely Pathogenic / VUS / Likely Benign / Benign) with cited evidence per criterion. Use for variant interpretation, VUS resolution, and pathogenicity assessment. Combines ClinVar, gnomAD, computational predictors, and gene-mechanism context.
Comprehensive ADMET (Absorption, Distribution, Metabolism, Excretion, Toxicity) profiling for drug candidates. Integrates ADMET-AI predictions, SwissADME drug-likeness, PubChemTox experimental toxicity, ChEMBL clinical data, Lipinski rule-of-five, and CYP interaction data. Use for drug-likeness assessment, BBB penetration, bioavailability, hepatotoxicity prediction, ADME/PK profiling, or screening compound libraries before lab testing.
Detect and analyze adverse drug event signals using FDA FAERS reports, drug labels, and disproportionality statistics (PRR, ROR, IC). Generates quantitative safety signal scores (0-100) with evidence grading. Use for post-market surveillance, pharmacovigilance, drug safety assessment, regulatory submissions, and detecting rare AE signals not visible in clinical trials.
Map environmental and industrial chemicals to adverse outcome pathways (AOPs) — molecular initiating event to organ-level toxicity. Uses AOPWiki, GHS classification, IARC carcinogen status, and LD50 data. Use for environmental/industrial chemical risk assessment, regulatory-grade hazard characterization, and AOP stressor mapping. Distinct from drug-safety analysis (use tooluniverse-pharmacovigilance for drugs).
Aging biology, cellular senescence, and longevity research. Covers senescence markers (p16/CDKN2A, SASP, SA-beta-gal), aging hallmarks, senolytic drug discovery (dasatinib+quercetin, fisetin, navitoclax), epigenetic clocks, telomere biology, and longevity GWAS. Use for senescence-pathway analysis, age-related disease genetics, senolytic-target discovery, and centenarian-genetics queries. Distinguishes correlative vs causal evidence (knockout, intervention).
Therapeutic antibody engineering and optimization, lead-to-clinical-candidate. Covers sequence humanization (germline alignment, framework retention), affinity maturation, developability (aggregation, stability, PTMs), structure modeling (AlphaFold/PDB CDR analysis), immunogenicity prediction, and manufacturing feasibility. Use for biologic-drug optimization, mAb design review, biosimilar engineering, and clinical-precedent comparison.
Discover novel small-molecule binders for protein targets using structure-based and ligand-based screening. Covers druggability assessment, known-ligand mining (ChEMBL, BindingDB), similarity expansion, ADMET filtering, and synthesis feasibility. Use for hit identification, virtual screening, target-to-compounds workflows, and lead-finding before commit-to-medchem.