Skip to main content
ClaudeWave
Skill199 estrellas del repoactualizado 16d ago

regulomedb-database

Query RegulomeDB v2 GET REST API to score variants for regulatory function and retrieve overlapping evidence (TF binding, histone marks, DNase peaks, footprints, motifs, eQTLs, chromatin state). Scores range 1a (strongest) to 7 (none). Use for GWAS hit prioritization, regulatory variant annotation, cis-regulatory discovery. Use clinvar-database for pathogenicity; gwas-database for trait associations.

Instalar en Claude Code
Copiar
git clone --depth 1 https://github.com/jaechang-hits/SciAgent-Skills /tmp/regulomedb-database && cp -r /tmp/regulomedb-database/skills/genomics-bioinformatics/databases/regulomedb-database ~/.claude/skills/regulomedb-database
Después abre una sesión nueva de Claude Code; el skill carga automáticamente.

SKILL.md

# RegulomeDB Database

## Overview

RegulomeDB integrates large-scale functional genomics data (ENCODE, Roadmap Epigenomics) to score genetic variants for regulatory potential. Each variant receives a ranking from 1a (highest regulatory confidence: eQTL + TF + DNase + motif + chromatin) to 7 (no known regulatory function). The v2 API is exposed as **GET** `https://regulomedb.org/regulome-search/`; the legacy POST `/regulome-search/`, POST `/regulome-summary/`, and GET `/regulome-datasets/` JSON endpoints are no longer functional (return `regulome-notfound` stubs or 500). Access is free and requires no authentication.

## When to Use

- Prioritizing GWAS hits for regulatory follow-up — identify which SNPs land in active regulatory elements
- Annotating a VCF or variant list with regulatory scores to filter to functionally relevant variants
- Identifying which transcription factors bind near a variant of interest (via the `@graph` evidence rows)
- Checking whether a non-coding variant overlaps a QTL and active chromatin simultaneously (`features.QTL`)
- Retrieving all annotated rsIDs in a genomic region for cis-regulatory analysis (region query with `nearby_snps`)
- Use `clinvar-database` instead when you need clinical pathogenicity classifications; RegulomeDB scores regulatory function, not germline disease association
- Use `gwas-database` instead when you want published GWAS associations with traits

## Prerequisites

- **Python packages**: `requests`, `pandas`, `matplotlib`
- **Data requirements**: rsIDs (e.g., `rs4946036`), genomic positions (`chr1:1000000`), or region coordinates (`chr1:1000000-2000000`)
- **Genome build**: GRCh38 (default) or GRCh37; specify in all requests
- **Rate limits**: No published rate limits; use `time.sleep(0.3)` between requests in batch workflows

```bash
pip install requests pandas matplotlib
```

## Quick Start

```python
import requests

BASE = "https://regulomedb.org"

def regulome_score(variant, genome="GRCh38"):
    """Score a single variant (rsID or chr:pos-pos) via the GET /regulome-search/ endpoint."""
    r = requests.get(
        f"{BASE}/regulome-search/",
        params={"regions": variant, "genome": genome, "format": "json"},
        timeout=30,
    )
    r.raise_for_status()
    d = r.json()
    rs = d.get("regulome_score", {})
    vs = d.get("variants", [])
    return {
        "query": variant,
        "ranking": rs.get("ranking"),           # 1a / 1b / ... / 7
        "probability": float(rs.get("probability", 0)),
        "rsids": vs[0].get("rsids") if vs else [],
        "chrom": vs[0].get("chrom") if vs else None,
        "pos": vs[0].get("start") if vs else None,
    }

print(regulome_score("rs4946036"))
# {'query': 'rs4946036', 'ranking': '7', 'probability': 0.18412,
#  'rsids': ['rs4946036'], 'chrom': 'chr6', 'pos': 114819799}
```

## Core API

### Query 1: Score a Single Variant (rsID or position)

The GET `/regulome-search/` endpoint accepts an rsID or coordinate as `regions=`. Returns a `regulome_score` block (probability, ranking, tissue-specific scores) plus `features` flags and the per-dataset `@graph` evidence rows.

```python
import requests

BASE = "https://regulomedb.org"

def score_variant(variant, genome="GRCh38"):
    """Return the regulome_score block and resolved coordinates."""
    r = requests.get(
        f"{BASE}/regulome-search/",
        params={"regions": variant, "genome": genome, "format": "json"},
        timeout=30,
    )
    r.raise_for_status()
    d = r.json()
    rs = d.get("regulome_score", {})
    vs = d.get("variants", [])
    feats = d.get("features", {})
    print(f"Variant   : {variant}")
    print(f"Resolved  : {vs[0]['chrom']}:{vs[0]['start']} ({', '.join(vs[0].get('rsids', []))})")
    print(f"Ranking   : {rs.get('ranking')}  prob={rs.get('probability')}")
    print(f"Features  : ChIP={feats['ChIP']} Chromatin_accessibility={feats['Chromatin_accessibility']} "
          f"QTL={feats['QTL']} Footprint={feats['Footprint']} PWM_matched={feats['PWM_matched']}")
    return d

# Strong-regulatory locus example
score_variant("chr11:5226739-5226740")
# Ranking: 1a (HBB beta-globin promoter, multi-evidence)
```

```python
# Score by chromosomal position alone
score_variant("chr17:7670000-7670001")  # TP53 region
```

### Query 2: Region Scan — List Annotated Variants in a Window

A range query returns up to `limit` resolved variants (`variants[]`) and all `@graph` evidence rows in the window, plus `nearby_snps` (rsIDs adjacent to the resolved hits).

```python
import requests, pandas as pd

BASE = "https://regulomedb.org"

def scan_region(chrom, start, end, genome="GRCh38", limit=200):
    """List variants in a region with their resolved positions and overlapping rsIDs."""
    r = requests.get(
        f"{BASE}/regulome-search/",
        params={"regions": f"{chrom}:{start}-{end}", "genome": genome,
                "format": "json", "limit": limit},
        timeout=60,
    )
    r.raise_for_status()
    d = r.json()
    variants = d.get("variants", [])
    print(f"Variants in {chrom}:{start}-{end}: {len(variants)} (total indexed = {d.get('total')})")
    rows = [{"rsids": ", ".join(v.get("rsids", [])),
             "chrom": v.get("chrom"),
             "start": v.get("start"),
             "end": v.get("end")} for v in variants]
    return pd.DataFrame(rows)

df = scan_region("chr11", 5226000, 5227000)
print(df.head(10).to_string(index=False))
```

### Query 3: Full Evidence — Parse the `@graph` Rows

Each `@graph[i]` row is one experimental piece of evidence overlapping the query. Fields: `method, target_label, biosample_ontology{term_name, organ_slims, classification}, dataset, file, value, chrom, start, end, strand, ancestry, disease_term_name`.

```python
import requests, pandas as pd

BASE = "https://regulomedb.org"

def evidence_rows(variant, genome="GRCh38"):
    r = requests.get(
        f"{BASE}/regulome-search/",
        params={"regions": variant, "genome": genome, "format": "json"},
sciagent-skill-creatorSkill

|

opentrons-integrationSkill

Opentrons Protocol API v2 for OT-2/Flex: Python protocols for pipetting, serial dilutions, PCR, plate replication; control thermocycler, heater-shaker, magnetic, temperature modules. Use pylabrobot for multi-vendor.

plotly-interactive-visualizationSkill

Interactive visualization with Plotly. 40+ chart types (scatter, line, heatmap, 3D, geographic) with hover, zoom, pan. Two APIs: Plotly Express (DataFrame) and Graph Objects (fine control). For static publication figures use matplotlib; for statistical grammar use seaborn.

seaborn-statistical-visualizationSkill

Statistical visualization on matplotlib + pandas. Distributions (histplot, kdeplot, violin, box), relational (scatter, line), categorical, regression, correlation heatmaps. Auto aggregation/CIs. Use plotly for interactive; matplotlib for low-level.

single-cell-annotationSkill

Best practices for single-cell RNA-seq cell type annotation including marker-based, reference-based, and automated classification approaches.

pymc-bayesian-modelingSkill

Bayesian modeling with PyMC 5: priors, likelihood, NUTS/ADVI sampling, diagnostics (R-hat, ESS), LOO/WAIC comparison, prediction. Hierarchical, logistic, GP variants; predictive checks.

scikit-survival-analysisSkill

Time-to-event modeling with scikit-survival: Cox PH (elastic net), Random Survival Forests, Boosting, SVMs for censored data. C-index, Brier, time-dependent AUC; Kaplan-Meier, Nelson-Aalen, competing risks. Pipeline/GridSearchCV compatible. Use statsmodels for frequentist, pymc for Bayesian, lifelines for parametric.

statistical-analysisSkill

>-