Skill284 repo starsupdated 4d ago

archs4-database

The archs4-database Claude Code skill queries the ARCHS4 REST API to retrieve uniformly processed RNA-seq expression data across over 1 million human and mouse samples from GEO and SRA. Use it to obtain tissue-specific z-scores for genes, identify co-expressed genes, search samples by metadata, access expression matrices, and validate expression patterns across populations without requiring authentication or API keys.

View source Repository: SciAgent-Skills

Install in Claude Code

Copy

git clone --depth 1 https://github.com/jaechang-hits/SciAgent-Skills /tmp/archs4-database && cp -r /tmp/archs4-database/skills/genomics-bioinformatics/databases/archs4-database ~/.claude/skills/archs4-database

Then start a new Claude Code session; the skill loads automatically.

Definition

SKILL.md

# ARCHS4 Database

## Overview

ARCHS4 (All RNA-seq and ChIP-seq Sample and Signature Search) is a resource of uniformly aligned and processed human and mouse RNA-seq data from NCBI GEO and SRA, covering 1 million+ samples. The REST API at `https://maayanlab.cloud/archs4/api/` provides gene-level expression profiles, z-score normalized tissue expression, co-expression networks, and sample metadata search — all without authentication. Large-scale bulk queries can also use the downloadable HDF5 expression matrices.

## When to Use

- Retrieving tissue-specific or cell-type-specific expression z-scores for a gene of interest across hundreds of tissue types
- Finding genes co-expressed with a query gene (co-expression network construction or guilt-by-association analysis)
- Searching for RNA-seq samples by tissue, disease, or metadata keyword to identify candidate datasets for reanalysis
- Comparing expression profiles of multiple genes across tissues to prioritize candidates for wet-lab follow-up
- Accessing uniformly processed gene expression matrices (HDF5 format) for large-scale cross-study analysis
- Validating differential expression results by checking whether a gene's expression direction matches population-level tissue profiles
- For variant-level population allele frequencies use `gnomad-database`; ARCHS4 provides expression evidence only
- For Enrichr pathway enrichment from a gene list use `gget-genomic-databases` (`gget enrichr`); ARCHS4 is for expression lookups

## Prerequisites

- **Python packages**: `requests`, `pandas`, `matplotlib`, `seaborn`
- **Data requirements**: gene symbols (HGNC format, e.g., `TP53`, `BRCA1`); sample GEO/SRA IDs for direct sample queries
- **Environment**: internet connection; no API key or account required
- **Rate limits**: ~10 requests/second; add `time.sleep(0.1)` between sequential gene queries to avoid throttling

```bash
pip install requests pandas matplotlib seaborn
```

## Quick Start

```python
import requests

ARCHS4_BASE = "https://maayanlab.cloud/archs4/api/v1"

def archs4_get(endpoint: str, params: dict = None) -> dict:
    """Send a GET request to the ARCHS4 API and return parsed JSON."""
    r = requests.get(f"{ARCHS4_BASE}/{endpoint}", params=params, timeout=30)
    r.raise_for_status()
    return r.json()

# Quick check: top tissues expressing TP53
data = archs4_get("meta/genes/TP53/zscore")
tissues = data.get("values", [])
print(f"TP53 tissue expression entries: {len(tissues)}")
top5 = sorted(tissues, key=lambda x: x.get("zscore", 0), reverse=True)[:5]
for t in top5:
    print(f"  {t['tissue']:<40}  z={t['zscore']:.2f}")
# TP53 tissue expression entries: 200
#   thymus                                   z=2.81
#   testis                                   z=2.44
```

## Core API

### Query 1: Gene Expression Z-Scores Across Tissues

Retrieve z-score normalized expression for a gene across all available tissue types. Z-scores are computed per-sample relative to the population distribution; positive values indicate above-average expression.

```python
import requests
import pandas as pd

ARCHS4_BASE = "https://maayanlab.cloud/archs4/api/v1"

def get_gene_tissue_zscore(gene_symbol: str, species: str = "human") -> pd.DataFrame:
    """Return tissue z-score expression profile for a gene.

    Parameters
    ----------
    gene_symbol : str
        HGNC gene symbol (e.g., 'TP53').
    species : str
        'human' or 'mouse' (default: 'human').
    """
    endpoint = f"meta/genes/{gene_symbol}/zscore"
    r = requests.get(
        f"{ARCHS4_BASE}/{endpoint}",
        params={"species": species},
        timeout=30
    )
    r.raise_for_status()
    data = r.json()
    records = data.get("values", [])
    df = pd.DataFrame(records)
    return df.sort_values("zscore", ascending=False).reset_index(drop=True)

df = get_gene_tissue_zscore("MYC")
print(f"MYC tissue z-scores: {len(df)} tissue types")
print(df[["tissue", "zscore"]].head(10).to_string(index=False))
# MYC tissue z-scores: 200
#                     tissue  zscore
#                      colon    3.12
#             small intestine    2.98
#                    placenta    2.74
```

```python
# Query mouse tissues for a gene
df_mouse = get_gene_tissue_zscore("Myc", species="mouse")
print(f"Mouse Myc: top 5 tissues")
print(df_mouse[["tissue", "zscore"]].head(5).to_string(index=False))
```

### Query 2: Co-expressed Genes

Find genes whose expression is most correlated with a query gene across all ARCHS4 samples. Useful for identifying pathway partners, regulators, or candidate targets.

```python
import requests
import pandas as pd

ARCHS4_BASE = "https://maayanlab.cloud/archs4/api/v1"

def get_coexpressed_genes(gene_symbol: str, top_n: int = 50,
                           species: str = "human") -> pd.DataFrame:
    """Return genes co-expressed with the query gene.

    Parameters
    ----------
    gene_symbol : str
        HGNC gene symbol.
    top_n : int
        Number of correlated genes to return (default: 50).
    species : str
        'human' or 'mouse' (default: 'human').
    """
    r = requests.get(
        f"{ARCHS4_BASE}/meta/genes/{gene_symbol}/correlations",
        params={"species": species, "limit": top_n},
        timeout=30
    )
    r.raise_for_status()
    data = r.json()
    records = data.get("values", [])
    df = pd.DataFrame(records)
    return df.sort_values("correlation", ascending=False).reset_index(drop=True)

coexp = get_coexpressed_genes("PCNA", top_n=20)
print(f"Top co-expressed genes with PCNA (n={len(coexp)}):")
print(coexp[["gene", "correlation"]].head(10).to_string(index=False))
# Top co-expressed genes with PCNA (n=20):
#   gene  correlation
#   RFC4         0.91
#   RFC2         0.89
#   MCM6         0.87
```

```python
# Extract gene list for downstream enrichment
gene_list = coexp["gene"].tolist()
print(f"Co-expression gene list: {gene_list[:10]}")
# Pass gene_list to Enrichr or pathway analysis tools
```

### Query 3: Sample Search

Search fo