Skill284 repo starsupdated 4d ago

bioservices-multi-database

BioServices Multi-Database Access provides a unified Python interface to query 40+ bioinformatics web services including UniProt, KEGG, ChEMBL, ChEBI, and BLAST through a consistent object-oriented API. Use this skill when retrieving protein information, discovering biological pathways, cross-referencing compounds across databases, running sequence similarity searches, mapping identifiers between databases, or finding protein-protein interactions across multiple sources.

View source Repository: SciAgent-Skills

Install in Claude Code

Copy

git clone --depth 1 https://github.com/jaechang-hits/SciAgent-Skills /tmp/bioservices-multi-database && cp -r /tmp/bioservices-multi-database/skills/genomics-bioinformatics/databases/bioservices-multi-database ~/.claude/skills/bioservices-multi-database

Then start a new Claude Code session; the skill loads automatically.

Definition

SKILL.md

# BioServices Multi-Database Access

## Overview

BioServices provides a unified Python interface to 40+ bioinformatics web services including UniProt, KEGG, ChEMBL, ChEBI, PubChem, UniChem, PSICQUIC, QuickGO, and BLAST. Each service is accessed through a consistent object-oriented API with built-in caching, rate limiting, and output format handling.

## When to Use

- Querying protein information from UniProt (search, retrieve, ID mapping)
- Discovering KEGG pathways and extracting gene/interaction networks
- Cross-referencing compounds across ChEMBL, ChEBI, PubChem, and KEGG
- Running BLAST sequence similarity searches against UniProtKB
- Mapping identifiers between biological databases (UniProt, Ensembl, KEGG, RefSeq, PDB)
- Retrieving Gene Ontology annotations via QuickGO
- Finding protein-protein interactions via PSICQUIC (IntAct, MINT, BioGRID)
- Batch converting thousands of biological identifiers with error handling
- For single-database deep queries → use gget (Ensembl), pubchempy (PubChem), or chembl-database-bioactivity skill
- For pathway visualization → use pathway analysis tools (Cytoscape, NetworkX) after retrieving data with bioservices

## Prerequisites

```bash
pip install bioservices
# Optional: pandas for tabular output, matplotlib for visualization
pip install pandas matplotlib
```

**API Rate Limits**: Most services have rate limits. bioservices handles basic throttling internally, but for batch operations add explicit delays:
- UniProt mapping: ~1 request/second for batch jobs
- KEGG: 10 requests/second (be conservative with pathway parsing)
- ChEMBL/ChEBI: 5-10 requests/second
- BLAST: 1 job at a time (async polling, ~30-300s per job)

## Quick Start

```python
from bioservices import UniProt, KEGG
import time

# Protein lookup
u = UniProt(verbose=False)
result = u.search("ABL1_HUMAN", frmt="tsv", columns="accession,gene_names,organism_name,length")
print(result[:200])

# Pathway discovery
k = KEGG(verbose=False)
pathways = k.get_pathway_by_gene("hsa:25", "hsa")  # ABL1
print(f"ABL1 participates in {len(pathways)} pathways")
for pid, name in list(pathways.items())[:3]:
    print(f"  {pid}: {name}")
```

## Core API

### 1. Protein Analysis (UniProt)

```python
from bioservices import UniProt
u = UniProt(verbose=False)

# Search by protein name or gene
result = u.search("BRCA1 AND organism_id:9606", frmt="tsv",
                  columns="accession,gene_names,protein_name,length,go_p")
print(result[:300])

# Retrieve full entry
entry = u.retrieve("P38398", frmt="txt")  # Swiss-Prot flat file
fasta = u.retrieve("P38398", frmt="fasta")
print(fasta[:200])
```

```python
# ID mapping: gene names → UniProt accessions
result = u.mapping(fr="Gene_Name", to="UniProtKB", query="BRCA1 TP53 ABL1", taxId=9606)
print(f"Mapped {len(result['results'])} entries")
for r in result['results']:
    print(f"  {r['from']} → {r['to']['primaryAccession']}")
```

### 2. Pathway Discovery (KEGG)

```python
from bioservices import KEGG
k = KEGG(verbose=False)

# List pathways for an organism
pathways = k.pathwayIds  # All reference pathways
human_pathways = k.list("pathway", "hsa")
print(f"Human pathways: {len(human_pathways.strip().splitlines())}")

# Get pathway details
pathway_data = k.get("hsa04110")  # Cell cycle
parsed = k.parse(pathway_data)
print(f"Pathway: {parsed.get('NAME', 'Unknown')}")
print(f"Genes: {len(parsed.get('GENE', {}))}")
```

```python
# KGML parsing for interaction networks
from bioservices import KEGG
k = KEGG(verbose=False)

kgml = k.get("hsa04110", "kgml")  # XML pathway representation
# Parse KGML for entries and relations
import xml.etree.ElementTree as ET
root = ET.fromstring(kgml)
entries = root.findall("entry")
relations = root.findall("relation")
print(f"Entries: {len(entries)}, Relations: {len(relations)}")

# Extract interaction types
from collections import Counter
rel_types = Counter()
for rel in relations:
    for subtype in rel.findall("subtype"):
        rel_types[subtype.get("name")] += 1
print(f"Interaction types: {dict(rel_types)}")
```

### 3. Compound Databases (ChEMBL, ChEBI, UniChem, PubChem)

```python
from bioservices import ChEMBL, ChEBI, UniChem
import time

# ChEMBL compound lookup
chembl = ChEMBL(verbose=False)
result = chembl.get_molecule("CHEMBL25")  # Aspirin
print(f"Name: {result['pref_name']}")
print(f"MW: {result['molecule_properties']['full_mwt']}")
print(f"SMILES: {result['molecule_structures']['canonical_smiles']}")

time.sleep(0.2)

# ChEBI entity lookup
chebi = ChEBI(verbose=False)
entity = chebi.getCompleteEntity("CHEBI:15365")  # Aspirin
print(f"ChEBI Name: {entity.chebiAsciiName}")
print(f"Formula: {entity.formulae[0].data if entity.formulae else 'N/A'}")
```

```python
# Cross-database compound mapping via UniChem
from bioservices import UniChem
uc = UniChem()

# Map ChEMBL ID to other databases
# Source IDs: 1=ChEMBL, 2=DrugBank, 3=PDB, 4=IUPHAR, 7=ChEBI, 22=PubChem
mappings = uc.get_mapping("CHEMBL25", 1)  # From ChEMBL
for m in mappings[:5]:
    print(f"  Source {m['src_id']}: {m['src_compound_id']}")
```

### 4. Sequence Analysis (BLAST)

```python
from bioservices import NCBIblast
import time

blast = NCBIblast(verbose=False)

sequence = ">query\nMKTAYIAKQRQISFVKSHFSRQLE..."  # Truncated for brevity
job_id = blast.run(
    program="blastp",
    database="uniprotkb_swissprot",
    sequence=sequence,
    stype="protein",
    email="user@example.com"  # Required by NCBI
)
print(f"Job submitted: {job_id}")

# Poll for results (async)
while blast.getStatus(job_id) == "RUNNING":
    time.sleep(10)
    print("Waiting...")

result_types = blast.getResultTypes(job_id)
alignment = blast.getResult(job_id, "out")  # Text alignment
print(alignment[:500])
```

### 5. Identifier Mapping

```python
from bioservices import UniProt
u = UniProt(verbose=False)

# Batch mapping: UniProt → multiple databases
accessions = "P00520 P12931 P04637 P38398"

# UniProt → PDB
result = u.mapping(fr="UniProtKB_AC-ID", to="PDB", query=accessions)
for r