Skip to main content
ClaudeWave
Skill199 estrellas del repoactualizado 16d ago

gget-genomic-databases

Unified CLI/Python interface to 20+ genomic databases. Gene lookups (Ensembl search/info/seq), BLAST/BLAT, AlphaFold, Enrichr enrichment, OpenTargets disease/drug, CELLxGENE single-cell, cBioPortal/COSMIC cancer, ARCHS4 expression. Spans genomics, proteomics, disease. For batch/advanced BLAST use biopython; for multi-DB Python SDK use bioservices.

Instalar en Claude Code
Copiar
git clone --depth 1 https://github.com/jaechang-hits/SciAgent-Skills /tmp/gget-genomic-databases && cp -r /tmp/gget-genomic-databases/skills/genomics-bioinformatics/databases/gget-genomic-databases ~/.claude/skills/gget-genomic-databases
Después abre una sesión nueva de Claude Code; el skill carga automáticamente.

SKILL.md

# gget — Unified Genomic Database Access

## Overview

gget is a command-line and Python package providing unified access to 20+ genomic databases and analysis methods. Query gene information, sequences, protein structures, expression data, and disease associations through a consistent interface. All modules work as both CLI tools and Python functions, returning DataFrames (Python) or JSON/CSV (CLI).

## When to Use

- Looking up gene information (names, IDs, descriptions) across species from Ensembl
- Retrieving nucleotide or protein sequences for Ensembl gene/transcript IDs
- Running BLAST or BLAT searches against standard reference databases
- Predicting protein 3D structures with AlphaFold2 from amino acid sequences
- Performing gene set enrichment analysis (GO, KEGG, disease terms) via Enrichr
- Querying single-cell RNA-seq datasets from CELLxGENE Census
- Finding disease and drug associations for a gene target via OpenTargets
- Downloading Ensembl reference genomes and annotations for a species
- Finding cancer mutations and genomic alterations via cBioPortal or COSMIC
- Getting tissue expression and correlated genes from ARCHS4
- For batch processing or advanced BLAST parameters, use `biopython` instead
- For programmatic multi-database workflows with rate limiting, use `bioservices` instead

## Prerequisites

- **Python packages**: `gget`
- **Optional setup**: Some modules require `gget setup <module>` before first use (alphafold, cellxgene, elm, gpt)
- **Environment**: Clean virtual environment recommended to avoid dependency conflicts
- **API notes**: gget queries remote databases — rate-limit large batch queries with `time.sleep()`. Databases update biweekly; keep gget updated. Max ~1000 Ensembl IDs per `gget.info()` call

```bash
pip install gget

# Optional: setup modules that need additional dependencies
gget setup alphafold   # ~4GB model parameters, requires OpenMM
gget setup cellxgene   # cellxgene-census package
gget setup elm         # local ELM database
```

## Quick Start

```python
import gget

# Search for genes by keyword
results = gget.search(["BRCA1", "tumor suppressor"], species="homo_sapiens")
print(f"Found {len(results)} genes")

# Get detailed gene information (Ensembl + UniProt + NCBI)
info = gget.info(["ENSG00000012048"])
print(f"Gene: {info.iloc[0]['primary_gene_name']}")

# Enrichment analysis on a gene list
enrichment = gget.enrichr(["ACE2", "AGT", "AGTR1"], database="ontology")
print(f"Enriched terms: {len(enrichment)}")
```

## Core API

### Module 1: Reference & Gene Search (ref, search, info, seq)

Query Ensembl for gene references, search by keywords, retrieve gene metadata, and fetch sequences.

```python
import gget

# Search for genes by keyword
results = gget.search(["BRCA1", "tumor suppressor"], species="homo_sapiens")
print(f"Found {len(results)} genes")
print(results[["ensembl_id", "gene_name", "biotype"]].head())

# Get detailed gene information (Ensembl + UniProt + NCBI)
info = gget.info(["ENSG00000012048", "ENSG00000139618"])
print(f"Gene info columns: {list(info.columns)}")
```

```python
import gget

# Retrieve sequences
nucleotide_seqs = gget.seq(["ENSG00000012048"])
protein_seqs = gget.seq(["ENSG00000012048"], translate=True, isoforms=True)
print(f"Retrieved {len(protein_seqs)} isoform sequences")

# Download reference genome files (specify release for reproducibility)
ref_links = gget.ref("homo_sapiens", which="gtf", release=112)
print(f"GTF download link: {ref_links}")
```

### Module 2: Sequence Alignment (blast, blat, muscle, diamond)

BLAST/BLAT remote searches, multiple sequence alignment, and fast local alignment.

```python
import gget
import time

# BLAST against SwissProt (remote API — add delay for batch queries)
blast_results = gget.blast(
    "MKWMFKEDHSLEHRCVESAKIRAKYPDRVPVIVEKVSGSQIVDIDKRKYLVPSDITVAQFMWIIRKRIQLPSEKAIFLFVDKTVPQSR",
    database="swissprot", limit=10
)
print(f"Top hit: {blast_results.iloc[0]['Description']}, E-value: {blast_results.iloc[0]['e-value']}")
time.sleep(2)  # Rate-limit between BLAST queries

# BLAT — find genomic position (UCSC)
blat_results = gget.blat("ATCGATCGATCGATCGATCG", assembly="human")
print(f"Genomic location: chr{blat_results.iloc[0]['chromosome']}:{blat_results.iloc[0]['start']}")
```

```python
import gget

# Multiple sequence alignment with Muscle5
aligned = gget.muscle("sequences.fasta", save=True)

# Fast local alignment with DIAMOND (local, no rate limit needed)
diamond_results = gget.diamond(
    "GGETISAWESQME",
    reference="reference.fasta",
    sensitivity="very-sensitive",
    threads=4
)
print(f"Alignments found: {len(diamond_results)}")
```

### Module 3: Protein Structure (pdb, alphafold, elm)

Download PDB structures, predict structures with AlphaFold2, find linear motifs.

```python
import gget

# Download PDB structure
pdb_data = gget.pdb("7S7U", save=True)

# Predict structure with AlphaFold2 (requires gget setup alphafold)
structure = gget.alphafold(
    "MKWMFKEDHSLEHRCVESAKIRAKYPDRVPVIVEKVSGSQIVDIDKRKYLVPSDITVAQFMWIIRKRIQLPSEKAIFLFVDKTVPQSR",
    plot=True, show_sidechains=True
)
print("Structure prediction complete, PDB file saved")
```

```python
import gget

# Find Eukaryotic Linear Motifs (requires gget setup elm)
ortholog_df, regex_df = gget.elm("LIAQSIGQASFV")
print(f"Ortholog motifs: {len(ortholog_df)}, Regex motifs: {len(regex_df)}")
```

### Module 4: Expression & Correlation (archs4, cellxgene, bgee)

Gene expression, tissue expression, correlated genes, single-cell data.

```python
import gget

# Tissue expression from ARCHS4
tissue_expr = gget.archs4("ACE2", which="tissue")
print(f"Expression across {len(tissue_expr)} tissues")

# Correlated genes from ARCHS4
correlated = gget.archs4("ACE2", which="correlation")
print(f"Top correlated gene: {correlated.iloc[0]['gene_symbol']}")
```

```python
import gget

# Single-cell data from CELLxGENE (requires gget setup cellxgene)
adata = gget.cellxgene(
    gene=["ACE2", "TMPRSS2"],
    tissue="lung",
    c
sciagent-skill-creatorSkill

|

opentrons-integrationSkill

Opentrons Protocol API v2 for OT-2/Flex: Python protocols for pipetting, serial dilutions, PCR, plate replication; control thermocycler, heater-shaker, magnetic, temperature modules. Use pylabrobot for multi-vendor.

plotly-interactive-visualizationSkill

Interactive visualization with Plotly. 40+ chart types (scatter, line, heatmap, 3D, geographic) with hover, zoom, pan. Two APIs: Plotly Express (DataFrame) and Graph Objects (fine control). For static publication figures use matplotlib; for statistical grammar use seaborn.

seaborn-statistical-visualizationSkill

Statistical visualization on matplotlib + pandas. Distributions (histplot, kdeplot, violin, box), relational (scatter, line), categorical, regression, correlation heatmaps. Auto aggregation/CIs. Use plotly for interactive; matplotlib for low-level.

single-cell-annotationSkill

Best practices for single-cell RNA-seq cell type annotation including marker-based, reference-based, and automated classification approaches.

pymc-bayesian-modelingSkill

Bayesian modeling with PyMC 5: priors, likelihood, NUTS/ADVI sampling, diagnostics (R-hat, ESS), LOO/WAIC comparison, prediction. Hierarchical, logistic, GP variants; predictive checks.

scikit-survival-analysisSkill

Time-to-event modeling with scikit-survival: Cox PH (elastic net), Random Survival Forests, Boosting, SVMs for censored data. C-index, Brier, time-dependent AUC; Kaplan-Meier, Nelson-Aalen, competing risks. Pipeline/GridSearchCV compatible. Use statsmodels for frequentist, pymc for Bayesian, lifelines for parametric.

statistical-analysisSkill

>-