geo-database
NCBI GEO access via GEOparse and E-utilities. Search by keyword/organism/platform, download GSE series matrices, parse GPL annotations, extract GSM metadata, load expression matrices into pandas. For single-cell use cellxgene-census; for multi-DB access use gget-genomic-databases.
git clone --depth 1 https://github.com/jaechang-hits/SciAgent-Skills /tmp/geo-database && cp -r /tmp/geo-database/skills/genomics-bioinformatics/databases/geo-database ~/.claude/skills/geo-databaseSKILL.md
# GEO Gene Expression Omnibus Database
## Overview
GEO (Gene Expression Omnibus) is NCBI's public repository for high-throughput functional genomics data, containing 200,000+ datasets (series) from microarrays, RNA-seq, ChIP-seq, methylation, and proteomics experiments. GEOparse provides a Python interface for downloading and parsing GEO records (GSE series, GPL platforms, GSM samples) while NCBI E-utilities enables programmatic search across GEO's metadata.
## When to Use
- Searching for publicly available gene expression datasets by organism, tissue, disease, or experimental condition
- Downloading and parsing a specific GEO series (GSE) with its expression matrix and sample metadata
- Extracting sample annotation tables (e.g., treatment groups, clinical covariates) for meta-analysis
- Loading microarray expression data (GPL platform-annotated probes) into a tidy DataFrame
- Retrieving all GEO experiments associated with a gene or pathway of interest
- Building automated pipelines that download and process GEO datasets for downstream analysis
- For single-cell RNA-seq data at scale, use `cellxgene-census`; for aligned reads, download FASTQ from ENA/SRA instead
## Prerequisites
- **Python packages**: `GEOparse`, `requests`, `pandas`
- **Data requirements**: GSE/GPL/GSM accession numbers, or search terms
- **Environment**: internet connection; write access to local directory for downloads
- **Rate limits**: E-utilities: 3 req/s unauthenticated, 10 req/s with API key; GEO FTP is unlimited
```bash
pip install GEOparse requests pandas
```
## Quick Start
```python
import GEOparse
# Download a GEO series (caches in current directory)
gse = GEOparse.get_GEO("GSE2553", destdir="./geo_data/")
print(f"Title: {gse.metadata['title'][0]}")
print(f"Samples: {len(gse.gsms)}")
print(f"Platform: {list(gse.gpls.keys())}")
# Sample metadata
meta = gse.phenotype_data
print(meta.head())
```
## Core API
### Query 1: Search GEO Datasets via E-utilities
Find GEO series (GSE) by keyword, organism, or dataset type.
```python
import requests
EMAIL = "your@email.com"
BASE = "https://eutils.ncbi.nlm.nih.gov/entrez/eutils"
def geo_search(query, retmax=20):
r = requests.get(f"{BASE}/esearch.fcgi",
params={"db": "gds", "term": query,
"retmax": retmax, "retmode": "json", "email": EMAIL})
r.raise_for_status()
return r.json()["esearchresult"]
# Search for human breast cancer RNA-seq datasets
result = geo_search(
"breast cancer[title] AND Homo sapiens[organism] AND gse[entry type]",
retmax=10
)
print(f"Found {result['count']} matching GEO datasets")
print(f"First accessions (UIDs): {result['idlist']}")
```
```python
# Search for specific platform (e.g., Illumina HumanHT-12)
result = geo_search(
"Illumina HumanHT-12[platform] AND Homo sapiens[organism] AND gse[entry type]",
retmax=5
)
print(f"Illumina HumanHT-12 human datasets: {result['count']}")
```
### Query 2: Fetch Dataset Summary Metadata
Retrieve title, accession, and organism for search results.
```python
import requests
EMAIL = "your@email.com"
BASE = "https://eutils.ncbi.nlm.nih.gov/entrez/eutils"
def geo_summary(uids):
r = requests.post(f"{BASE}/esummary.fcgi",
data={"db": "gds", "id": ",".join(uids),
"retmode": "json", "email": EMAIL})
r.raise_for_status()
return r.json()["result"]
# Get metadata for search results
result = geo_search_func = lambda q: requests.get(
f"{BASE}/esearch.fcgi",
params={"db": "gds", "term": q, "retmax": 3, "retmode": "json", "email": EMAIL}
).json()["esearchresult"]["idlist"]
uids = requests.get(
f"{BASE}/esearch.fcgi",
params={"db": "gds", "term": "lung cancer[title] AND gse[entry type]",
"retmax": 3, "retmode": "json", "email": EMAIL}
).json()["esearchresult"]["idlist"]
summaries = geo_summary(uids)
for uid in summaries.get("uids", []):
s = summaries[uid]
print(f"\nAccession: {s.get('accession')} | {s.get('title')}")
print(f" Organism: {s.get('taxon')}")
print(f" Samples: {s.get('n_samples')}")
print(f" Type: {s.get('gdstype')}")
```
### Query 3: Download and Parse a GEO Series
Use GEOparse to download a full GSE record with expression matrix and sample metadata.
```python
import GEOparse
# Download GSE (auto-caches; skip download if already present)
gse = GEOparse.get_GEO("GSE2553", destdir="./geo_data/", silent=True)
# Series metadata
print(f"Title : {gse.metadata['title'][0]}")
print(f"Summary : {gse.metadata['summary'][0][:200]}...")
print(f"Samples : {len(gse.gsms)} GSMs")
print(f"Platforms: {list(gse.gpls.keys())}")
```
```python
# Sample metadata table (phenotype data)
meta = gse.phenotype_data
print(f"Metadata columns: {list(meta.columns)}")
print(meta.head())
```
### Query 4: Extract Expression Matrix
Parse probe-level expression data and optionally merge with platform gene annotations.
```python
import GEOparse, pandas as pd
gse = GEOparse.get_GEO("GSE2553", destdir="./geo_data/", silent=True)
# Pivot to gene expression matrix (probes × samples)
gpl_id = list(gse.gpls.keys())[0]
pivot = gse.pivot_samples("VALUE", gpl_id)
print(f"Expression matrix shape: {pivot.shape}") # (probes, samples)
print(pivot.iloc[:5, :3])
```
```python
# Annotate probes with gene symbols from the GPL platform
gpl = gse.gpls[gpl_id]
annot = gpl.table[["ID", "Gene Symbol", "Gene Title"]].copy()
annot.columns = ["ID", "gene_symbol", "gene_title"]
annot = annot.dropna(subset=["gene_symbol"])
annot = annot[annot["gene_symbol"] != ""]
expr_annotated = pivot.join(annot.set_index("ID"), how="inner")
print(f"Annotated expression matrix: {expr_annotated.shape}")
print(expr_annotated[["gene_symbol", "gene_title"]].head())
```
### Query 5: Download Individual Sample (GSM)
Retrieve expression values and metadata for a single sample.
```python
import GEOparse
gsm = GEOparse.get_GEO("GSM45553", destdir="./geo_data/", silent=Tru|
Opentrons Protocol API v2 for OT-2/Flex: Python protocols for pipetting, serial dilutions, PCR, plate replication; control thermocycler, heater-shaker, magnetic, temperature modules. Use pylabrobot for multi-vendor.
Interactive visualization with Plotly. 40+ chart types (scatter, line, heatmap, 3D, geographic) with hover, zoom, pan. Two APIs: Plotly Express (DataFrame) and Graph Objects (fine control). For static publication figures use matplotlib; for statistical grammar use seaborn.
Statistical visualization on matplotlib + pandas. Distributions (histplot, kdeplot, violin, box), relational (scatter, line), categorical, regression, correlation heatmaps. Auto aggregation/CIs. Use plotly for interactive; matplotlib for low-level.
Best practices for single-cell RNA-seq cell type annotation including marker-based, reference-based, and automated classification approaches.
Bayesian modeling with PyMC 5: priors, likelihood, NUTS/ADVI sampling, diagnostics (R-hat, ESS), LOO/WAIC comparison, prediction. Hierarchical, logistic, GP variants; predictive checks.
Time-to-event modeling with scikit-survival: Cox PH (elastic net), Random Survival Forests, Boosting, SVMs for censored data. C-index, Brier, time-dependent AUC; Kaplan-Meier, Nelson-Aalen, competing risks. Pipeline/GridSearchCV compatible. Use statsmodels for frequentist, pymc for Bayesian, lifelines for parametric.
>-