Skip to main content
ClaudeWave
Skill199 repo starsupdated 16d ago

archs4-database

Query ARCHS4 REST API for uniformly processed RNA-seq expression, tissue patterns, co-expression across 1M+ human/mouse samples. Retrieve z-scores, co-expressed genes, samples by metadata, HDF5 matrices. For variant population genetics use gnomad-database; for pathway enrichment use gget-genomic-databases (Enrichr).

Install in Claude Code
Copy
git clone --depth 1 https://github.com/jaechang-hits/SciAgent-Skills /tmp/archs4-database && cp -r /tmp/archs4-database/skills/genomics-bioinformatics/databases/archs4-database ~/.claude/skills/archs4-database
Then start a new Claude Code session; the skill loads automatically.

SKILL.md

# ARCHS4 Database

## Overview

ARCHS4 (All RNA-seq and ChIP-seq Sample and Signature Search) is a resource of uniformly aligned and processed human and mouse RNA-seq data from NCBI GEO and SRA, covering 1 million+ samples. The REST API at `https://maayanlab.cloud/archs4/api/` provides gene-level expression profiles, z-score normalized tissue expression, co-expression networks, and sample metadata search — all without authentication. Large-scale bulk queries can also use the downloadable HDF5 expression matrices.

## When to Use

- Retrieving tissue-specific or cell-type-specific expression z-scores for a gene of interest across hundreds of tissue types
- Finding genes co-expressed with a query gene (co-expression network construction or guilt-by-association analysis)
- Searching for RNA-seq samples by tissue, disease, or metadata keyword to identify candidate datasets for reanalysis
- Comparing expression profiles of multiple genes across tissues to prioritize candidates for wet-lab follow-up
- Accessing uniformly processed gene expression matrices (HDF5 format) for large-scale cross-study analysis
- Validating differential expression results by checking whether a gene's expression direction matches population-level tissue profiles
- For variant-level population allele frequencies use `gnomad-database`; ARCHS4 provides expression evidence only
- For Enrichr pathway enrichment from a gene list use `gget-genomic-databases` (`gget enrichr`); ARCHS4 is for expression lookups

## Prerequisites

- **Python packages**: `requests`, `pandas`, `matplotlib`, `seaborn`
- **Data requirements**: gene symbols (HGNC format, e.g., `TP53`, `BRCA1`); sample GEO/SRA IDs for direct sample queries
- **Environment**: internet connection; no API key or account required
- **Rate limits**: ~10 requests/second; add `time.sleep(0.1)` between sequential gene queries to avoid throttling

```bash
pip install requests pandas matplotlib seaborn
```

## Quick Start

```python
import requests

ARCHS4_BASE = "https://maayanlab.cloud/archs4/api/v1"

def archs4_get(endpoint: str, params: dict = None) -> dict:
    """Send a GET request to the ARCHS4 API and return parsed JSON."""
    r = requests.get(f"{ARCHS4_BASE}/{endpoint}", params=params, timeout=30)
    r.raise_for_status()
    return r.json()

# Quick check: top tissues expressing TP53
data = archs4_get("meta/genes/TP53/zscore")
tissues = data.get("values", [])
print(f"TP53 tissue expression entries: {len(tissues)}")
top5 = sorted(tissues, key=lambda x: x.get("zscore", 0), reverse=True)[:5]
for t in top5:
    print(f"  {t['tissue']:<40}  z={t['zscore']:.2f}")
# TP53 tissue expression entries: 200
#   thymus                                   z=2.81
#   testis                                   z=2.44
```

## Core API

### Query 1: Gene Expression Z-Scores Across Tissues

Retrieve z-score normalized expression for a gene across all available tissue types. Z-scores are computed per-sample relative to the population distribution; positive values indicate above-average expression.

```python
import requests
import pandas as pd

ARCHS4_BASE = "https://maayanlab.cloud/archs4/api/v1"

def get_gene_tissue_zscore(gene_symbol: str, species: str = "human") -> pd.DataFrame:
    """Return tissue z-score expression profile for a gene.

    Parameters
    ----------
    gene_symbol : str
        HGNC gene symbol (e.g., 'TP53').
    species : str
        'human' or 'mouse' (default: 'human').
    """
    endpoint = f"meta/genes/{gene_symbol}/zscore"
    r = requests.get(
        f"{ARCHS4_BASE}/{endpoint}",
        params={"species": species},
        timeout=30
    )
    r.raise_for_status()
    data = r.json()
    records = data.get("values", [])
    df = pd.DataFrame(records)
    return df.sort_values("zscore", ascending=False).reset_index(drop=True)

df = get_gene_tissue_zscore("MYC")
print(f"MYC tissue z-scores: {len(df)} tissue types")
print(df[["tissue", "zscore"]].head(10).to_string(index=False))
# MYC tissue z-scores: 200
#                     tissue  zscore
#                      colon    3.12
#             small intestine    2.98
#                    placenta    2.74
```

```python
# Query mouse tissues for a gene
df_mouse = get_gene_tissue_zscore("Myc", species="mouse")
print(f"Mouse Myc: top 5 tissues")
print(df_mouse[["tissue", "zscore"]].head(5).to_string(index=False))
```

### Query 2: Co-expressed Genes

Find genes whose expression is most correlated with a query gene across all ARCHS4 samples. Useful for identifying pathway partners, regulators, or candidate targets.

```python
import requests
import pandas as pd

ARCHS4_BASE = "https://maayanlab.cloud/archs4/api/v1"

def get_coexpressed_genes(gene_symbol: str, top_n: int = 50,
                           species: str = "human") -> pd.DataFrame:
    """Return genes co-expressed with the query gene.

    Parameters
    ----------
    gene_symbol : str
        HGNC gene symbol.
    top_n : int
        Number of correlated genes to return (default: 50).
    species : str
        'human' or 'mouse' (default: 'human').
    """
    r = requests.get(
        f"{ARCHS4_BASE}/meta/genes/{gene_symbol}/correlations",
        params={"species": species, "limit": top_n},
        timeout=30
    )
    r.raise_for_status()
    data = r.json()
    records = data.get("values", [])
    df = pd.DataFrame(records)
    return df.sort_values("correlation", ascending=False).reset_index(drop=True)

coexp = get_coexpressed_genes("PCNA", top_n=20)
print(f"Top co-expressed genes with PCNA (n={len(coexp)}):")
print(coexp[["gene", "correlation"]].head(10).to_string(index=False))
# Top co-expressed genes with PCNA (n=20):
#   gene  correlation
#   RFC4         0.91
#   RFC2         0.89
#   MCM6         0.87
```

```python
# Extract gene list for downstream enrichment
gene_list = coexp["gene"].tolist()
print(f"Co-expression gene list: {gene_list[:10]}")
# Pass gene_list to Enrichr or pathway analysis tools
```

### Query 3: Sample Search

Search fo
sciagent-skill-creatorSkill

|

opentrons-integrationSkill

Opentrons Protocol API v2 for OT-2/Flex: Python protocols for pipetting, serial dilutions, PCR, plate replication; control thermocycler, heater-shaker, magnetic, temperature modules. Use pylabrobot for multi-vendor.

plotly-interactive-visualizationSkill

Interactive visualization with Plotly. 40+ chart types (scatter, line, heatmap, 3D, geographic) with hover, zoom, pan. Two APIs: Plotly Express (DataFrame) and Graph Objects (fine control). For static publication figures use matplotlib; for statistical grammar use seaborn.

seaborn-statistical-visualizationSkill

Statistical visualization on matplotlib + pandas. Distributions (histplot, kdeplot, violin, box), relational (scatter, line), categorical, regression, correlation heatmaps. Auto aggregation/CIs. Use plotly for interactive; matplotlib for low-level.

single-cell-annotationSkill

Best practices for single-cell RNA-seq cell type annotation including marker-based, reference-based, and automated classification approaches.

pymc-bayesian-modelingSkill

Bayesian modeling with PyMC 5: priors, likelihood, NUTS/ADVI sampling, diagnostics (R-hat, ESS), LOO/WAIC comparison, prediction. Hierarchical, logistic, GP variants; predictive checks.

scikit-survival-analysisSkill

Time-to-event modeling with scikit-survival: Cox PH (elastic net), Random Survival Forests, Boosting, SVMs for censored data. C-index, Brier, time-dependent AUC; Kaplan-Meier, Nelson-Aalen, competing risks. Pipeline/GridSearchCV compatible. Use statsmodels for frequentist, pymc for Bayesian, lifelines for parametric.

statistical-analysisSkill

>-