gget-genomic-databases
Unified CLI/Python interface to 20+ genomic databases. Gene lookups (Ensembl search/info/seq), BLAST/BLAT, AlphaFold, Enrichr enrichment, OpenTargets disease/drug, CELLxGENE single-cell, cBioPortal/COSMIC cancer, ARCHS4 expression. Spans genomics, proteomics, disease. For batch/advanced BLAST use biopython; for multi-DB Python SDK use bioservices.
git clone --depth 1 https://github.com/jaechang-hits/SciAgent-Skills /tmp/gget-genomic-databases && cp -r /tmp/gget-genomic-databases/skills/genomics-bioinformatics/databases/gget-genomic-databases ~/.claude/skills/gget-genomic-databasesSKILL.md
# gget — Unified Genomic Database Access
## Overview
gget is a command-line and Python package providing unified access to 20+ genomic databases and analysis methods. Query gene information, sequences, protein structures, expression data, and disease associations through a consistent interface. All modules work as both CLI tools and Python functions, returning DataFrames (Python) or JSON/CSV (CLI).
## When to Use
- Looking up gene information (names, IDs, descriptions) across species from Ensembl
- Retrieving nucleotide or protein sequences for Ensembl gene/transcript IDs
- Running BLAST or BLAT searches against standard reference databases
- Predicting protein 3D structures with AlphaFold2 from amino acid sequences
- Performing gene set enrichment analysis (GO, KEGG, disease terms) via Enrichr
- Querying single-cell RNA-seq datasets from CELLxGENE Census
- Finding disease and drug associations for a gene target via OpenTargets
- Downloading Ensembl reference genomes and annotations for a species
- Finding cancer mutations and genomic alterations via cBioPortal or COSMIC
- Getting tissue expression and correlated genes from ARCHS4
- For batch processing or advanced BLAST parameters, use `biopython` instead
- For programmatic multi-database workflows with rate limiting, use `bioservices` instead
## Prerequisites
- **Python packages**: `gget`
- **Optional setup**: Some modules require `gget setup <module>` before first use (alphafold, cellxgene, elm, gpt)
- **Environment**: Clean virtual environment recommended to avoid dependency conflicts
- **API notes**: gget queries remote databases — rate-limit large batch queries with `time.sleep()`. Databases update biweekly; keep gget updated. Max ~1000 Ensembl IDs per `gget.info()` call
```bash
pip install gget
# Optional: setup modules that need additional dependencies
gget setup alphafold # ~4GB model parameters, requires OpenMM
gget setup cellxgene # cellxgene-census package
gget setup elm # local ELM database
```
## Quick Start
```python
import gget
# Search for genes by keyword
results = gget.search(["BRCA1", "tumor suppressor"], species="homo_sapiens")
print(f"Found {len(results)} genes")
# Get detailed gene information (Ensembl + UniProt + NCBI)
info = gget.info(["ENSG00000012048"])
print(f"Gene: {info.iloc[0]['primary_gene_name']}")
# Enrichment analysis on a gene list
enrichment = gget.enrichr(["ACE2", "AGT", "AGTR1"], database="ontology")
print(f"Enriched terms: {len(enrichment)}")
```
## Core API
### Module 1: Reference & Gene Search (ref, search, info, seq)
Query Ensembl for gene references, search by keywords, retrieve gene metadata, and fetch sequences.
```python
import gget
# Search for genes by keyword
results = gget.search(["BRCA1", "tumor suppressor"], species="homo_sapiens")
print(f"Found {len(results)} genes")
print(results[["ensembl_id", "gene_name", "biotype"]].head())
# Get detailed gene information (Ensembl + UniProt + NCBI)
info = gget.info(["ENSG00000012048", "ENSG00000139618"])
print(f"Gene info columns: {list(info.columns)}")
```
```python
import gget
# Retrieve sequences
nucleotide_seqs = gget.seq(["ENSG00000012048"])
protein_seqs = gget.seq(["ENSG00000012048"], translate=True, isoforms=True)
print(f"Retrieved {len(protein_seqs)} isoform sequences")
# Download reference genome files (specify release for reproducibility)
ref_links = gget.ref("homo_sapiens", which="gtf", release=112)
print(f"GTF download link: {ref_links}")
```
### Module 2: Sequence Alignment (blast, blat, muscle, diamond)
BLAST/BLAT remote searches, multiple sequence alignment, and fast local alignment.
```python
import gget
import time
# BLAST against SwissProt (remote API — add delay for batch queries)
blast_results = gget.blast(
"MKWMFKEDHSLEHRCVESAKIRAKYPDRVPVIVEKVSGSQIVDIDKRKYLVPSDITVAQFMWIIRKRIQLPSEKAIFLFVDKTVPQSR",
database="swissprot", limit=10
)
print(f"Top hit: {blast_results.iloc[0]['Description']}, E-value: {blast_results.iloc[0]['e-value']}")
time.sleep(2) # Rate-limit between BLAST queries
# BLAT — find genomic position (UCSC)
blat_results = gget.blat("ATCGATCGATCGATCGATCG", assembly="human")
print(f"Genomic location: chr{blat_results.iloc[0]['chromosome']}:{blat_results.iloc[0]['start']}")
```
```python
import gget
# Multiple sequence alignment with Muscle5
aligned = gget.muscle("sequences.fasta", save=True)
# Fast local alignment with DIAMOND (local, no rate limit needed)
diamond_results = gget.diamond(
"GGETISAWESQME",
reference="reference.fasta",
sensitivity="very-sensitive",
threads=4
)
print(f"Alignments found: {len(diamond_results)}")
```
### Module 3: Protein Structure (pdb, alphafold, elm)
Download PDB structures, predict structures with AlphaFold2, find linear motifs.
```python
import gget
# Download PDB structure
pdb_data = gget.pdb("7S7U", save=True)
# Predict structure with AlphaFold2 (requires gget setup alphafold)
structure = gget.alphafold(
"MKWMFKEDHSLEHRCVESAKIRAKYPDRVPVIVEKVSGSQIVDIDKRKYLVPSDITVAQFMWIIRKRIQLPSEKAIFLFVDKTVPQSR",
plot=True, show_sidechains=True
)
print("Structure prediction complete, PDB file saved")
```
```python
import gget
# Find Eukaryotic Linear Motifs (requires gget setup elm)
ortholog_df, regex_df = gget.elm("LIAQSIGQASFV")
print(f"Ortholog motifs: {len(ortholog_df)}, Regex motifs: {len(regex_df)}")
```
### Module 4: Expression & Correlation (archs4, cellxgene, bgee)
Gene expression, tissue expression, correlated genes, single-cell data.
```python
import gget
# Tissue expression from ARCHS4
tissue_expr = gget.archs4("ACE2", which="tissue")
print(f"Expression across {len(tissue_expr)} tissues")
# Correlated genes from ARCHS4
correlated = gget.archs4("ACE2", which="correlation")
print(f"Top correlated gene: {correlated.iloc[0]['gene_symbol']}")
```
```python
import gget
# Single-cell data from CELLxGENE (requires gget setup cellxgene)
adata = gget.cellxgene(
gene=["ACE2", "TMPRSS2"],
tissue="lung",
c|
Opentrons Protocol API v2 for OT-2/Flex: Python protocols for pipetting, serial dilutions, PCR, plate replication; control thermocycler, heater-shaker, magnetic, temperature modules. Use pylabrobot for multi-vendor.
Interactive visualization with Plotly. 40+ chart types (scatter, line, heatmap, 3D, geographic) with hover, zoom, pan. Two APIs: Plotly Express (DataFrame) and Graph Objects (fine control). For static publication figures use matplotlib; for statistical grammar use seaborn.
Statistical visualization on matplotlib + pandas. Distributions (histplot, kdeplot, violin, box), relational (scatter, line), categorical, regression, correlation heatmaps. Auto aggregation/CIs. Use plotly for interactive; matplotlib for low-level.
Best practices for single-cell RNA-seq cell type annotation including marker-based, reference-based, and automated classification approaches.
Bayesian modeling with PyMC 5: priors, likelihood, NUTS/ADVI sampling, diagnostics (R-hat, ESS), LOO/WAIC comparison, prediction. Hierarchical, logistic, GP variants; predictive checks.
Time-to-event modeling with scikit-survival: Cox PH (elastic net), Random Survival Forests, Boosting, SVMs for censored data. C-index, Brier, time-dependent AUC; Kaplan-Meier, Nelson-Aalen, competing risks. Pipeline/GridSearchCV compatible. Use statsmodels for frequentist, pymc for Bayesian, lifelines for parametric.
>-