Skill286 estrellas del repoactualizado 5d ago

geo-database

The geo-database skill provides programmatic access to NCBI's Gene Expression Omnibus (GEO) repository, enabling users to search for publicly available genomic datasets by keyword, organism, or platform, download GEO series (GSE) with expression matrices and metadata, and parse sample annotations into pandas DataFrames. Use this skill when retrieving microarray or bulk RNA-seq expression data, extracting sample metadata for meta-analysis, or building automated pipelines that integrate GEO datasets into downstream analysis workflows.

Ver fuente Repositorio: SciAgent-Skills

Instalar en Claude Code

Copiar

git clone --depth 1 https://github.com/jaechang-hits/SciAgent-Skills /tmp/geo-database && cp -r /tmp/geo-database/skills/genomics-bioinformatics/databases/geo-database ~/.claude/skills/geo-database

Después abre una sesión nueva de Claude Code; el skill carga automáticamente.

Definición

SKILL.md

# GEO Gene Expression Omnibus Database

## Overview

GEO (Gene Expression Omnibus) is NCBI's public repository for high-throughput functional genomics data, containing 200,000+ datasets (series) from microarrays, RNA-seq, ChIP-seq, methylation, and proteomics experiments. GEOparse provides a Python interface for downloading and parsing GEO records (GSE series, GPL platforms, GSM samples) while NCBI E-utilities enables programmatic search across GEO's metadata.

## When to Use

- Searching for publicly available gene expression datasets by organism, tissue, disease, or experimental condition
- Downloading and parsing a specific GEO series (GSE) with its expression matrix and sample metadata
- Extracting sample annotation tables (e.g., treatment groups, clinical covariates) for meta-analysis
- Loading microarray expression data (GPL platform-annotated probes) into a tidy DataFrame
- Retrieving all GEO experiments associated with a gene or pathway of interest
- Building automated pipelines that download and process GEO datasets for downstream analysis
- For single-cell RNA-seq data at scale, use `cellxgene-census`; for aligned reads, download FASTQ from ENA/SRA instead

## Prerequisites

- **Python packages**: `GEOparse`, `requests`, `pandas`
- **Data requirements**: GSE/GPL/GSM accession numbers, or search terms
- **Environment**: internet connection; write access to local directory for downloads
- **Rate limits**: E-utilities: 3 req/s unauthenticated, 10 req/s with API key; GEO FTP is unlimited

```bash
pip install GEOparse requests pandas
```

## Quick Start

```python
import GEOparse

# Download a GEO series (caches in current directory)
gse = GEOparse.get_GEO("GSE2553", destdir="./geo_data/")
print(f"Title: {gse.metadata['title'][0]}")
print(f"Samples: {len(gse.gsms)}")
print(f"Platform: {list(gse.gpls.keys())}")

# Sample metadata
meta = gse.phenotype_data
print(meta.head())
```

## Core API

### Query 1: Search GEO Datasets via E-utilities

Find GEO series (GSE) by keyword, organism, or dataset type.

```python
import requests

EMAIL = "your@email.com"
BASE = "https://eutils.ncbi.nlm.nih.gov/entrez/eutils"

def geo_search(query, retmax=20):
    r = requests.get(f"{BASE}/esearch.fcgi",
                     params={"db": "gds", "term": query,
                             "retmax": retmax, "retmode": "json", "email": EMAIL})
    r.raise_for_status()
    return r.json()["esearchresult"]

# Search for human breast cancer RNA-seq datasets
result = geo_search(
    "breast cancer[title] AND Homo sapiens[organism] AND gse[entry type]",
    retmax=10
)
print(f"Found {result['count']} matching GEO datasets")
print(f"First accessions (UIDs): {result['idlist']}")
```

```python
# Search for specific platform (e.g., Illumina HumanHT-12)
result = geo_search(
    "Illumina HumanHT-12[platform] AND Homo sapiens[organism] AND gse[entry type]",
    retmax=5
)
print(f"Illumina HumanHT-12 human datasets: {result['count']}")
```

### Query 2: Fetch Dataset Summary Metadata

Retrieve title, accession, and organism for search results.

```python
import requests

EMAIL = "your@email.com"
BASE = "https://eutils.ncbi.nlm.nih.gov/entrez/eutils"

def geo_summary(uids):
    r = requests.post(f"{BASE}/esummary.fcgi",
                      data={"db": "gds", "id": ",".join(uids),
                            "retmode": "json", "email": EMAIL})
    r.raise_for_status()
    return r.json()["result"]

# Get metadata for search results
result = geo_search_func = lambda q: requests.get(
    f"{BASE}/esearch.fcgi",
    params={"db": "gds", "term": q, "retmax": 3, "retmode": "json", "email": EMAIL}
).json()["esearchresult"]["idlist"]

uids = requests.get(
    f"{BASE}/esearch.fcgi",
    params={"db": "gds", "term": "lung cancer[title] AND gse[entry type]",
            "retmax": 3, "retmode": "json", "email": EMAIL}
).json()["esearchresult"]["idlist"]

summaries = geo_summary(uids)
for uid in summaries.get("uids", []):
    s = summaries[uid]
    print(f"\nAccession: {s.get('accession')} | {s.get('title')}")
    print(f"  Organism: {s.get('taxon')}")
    print(f"  Samples: {s.get('n_samples')}")
    print(f"  Type: {s.get('gdstype')}")
```

### Query 3: Download and Parse a GEO Series

Use GEOparse to download a full GSE record with expression matrix and sample metadata.

```python
import GEOparse

# Download GSE (auto-caches; skip download if already present)
gse = GEOparse.get_GEO("GSE2553", destdir="./geo_data/", silent=True)

# Series metadata
print(f"Title   : {gse.metadata['title'][0]}")
print(f"Summary : {gse.metadata['summary'][0][:200]}...")
print(f"Samples : {len(gse.gsms)} GSMs")
print(f"Platforms: {list(gse.gpls.keys())}")
```

```python
# Sample metadata table (phenotype data)
meta = gse.phenotype_data
print(f"Metadata columns: {list(meta.columns)}")
print(meta.head())
```

### Query 4: Extract Expression Matrix

Parse probe-level expression data and optionally merge with platform gene annotations.

```python
import GEOparse, pandas as pd

gse = GEOparse.get_GEO("GSE2553", destdir="./geo_data/", silent=True)

# Pivot to gene expression matrix (probes × samples)
gpl_id = list(gse.gpls.keys())[0]
pivot = gse.pivot_samples("VALUE", gpl_id)
print(f"Expression matrix shape: {pivot.shape}")  # (probes, samples)
print(pivot.iloc[:5, :3])
```

```python
# Annotate probes with gene symbols from the GPL platform
gpl = gse.gpls[gpl_id]
annot = gpl.table[["ID", "Gene Symbol", "Gene Title"]].copy()
annot.columns = ["ID", "gene_symbol", "gene_title"]
annot = annot.dropna(subset=["gene_symbol"])
annot = annot[annot["gene_symbol"] != ""]

expr_annotated = pivot.join(annot.set_index("ID"), how="inner")
print(f"Annotated expression matrix: {expr_annotated.shape}")
print(expr_annotated[["gene_symbol", "gene_title"]].head())
```

### Query 5: Download Individual Sample (GSM)

Retrieve expression values and metadata for a single sample.

```python
import GEOparse

gsm = GEOparse.get_GEO("GSM45553", destdir="./geo_data/", silent=Tru

Del mismo repositorio

sciagent-skill-creatorSkill

opentrons-integrationSkill

Opentrons Protocol API v2 for OT-2/Flex: Python protocols for pipetting, serial dilutions, PCR, plate replication; control thermocycler, heater-shaker, magnetic, temperature modules. Use pylabrobot for multi-vendor.

plotly-interactive-visualizationSkill

Interactive visualization with Plotly. 40+ chart types (scatter, line, heatmap, 3D, geographic) with hover, zoom, pan. Two APIs: Plotly Express (DataFrame) and Graph Objects (fine control). For static publication figures use matplotlib; for statistical grammar use seaborn.

seaborn-statistical-visualizationSkill

Statistical visualization on matplotlib + pandas. Distributions (histplot, kdeplot, violin, box), relational (scatter, line), categorical, regression, correlation heatmaps. Auto aggregation/CIs. Use plotly for interactive; matplotlib for low-level.

single-cell-annotationSkill

Best practices for single-cell RNA-seq cell type annotation including marker-based, reference-based, and automated classification approaches.

pymc-bayesian-modelingSkill

Bayesian modeling with PyMC 5: priors, likelihood, NUTS/ADVI sampling, diagnostics (R-hat, ESS), LOO/WAIC comparison, prediction. Hierarchical, logistic, GP variants; predictive checks.

scikit-survival-analysisSkill

Time-to-event modeling with scikit-survival: Cox PH (elastic net), Random Survival Forests, Boosting, SVMs for censored data. C-index, Brier, time-dependent AUC; Kaplan-Meier, Nelson-Aalen, competing risks. Pipeline/GridSearchCV compatible. Use statsmodels for frequentist, pymc for Bayesian, lifelines for parametric.

statistical-analysisSkill