Skip to main content
ClaudeWave
Skill199 repo starsupdated 16d ago

encode-database

ENCODE Portal REST API for regulatory genomics: TF ChIP-seq, ATAC-seq/DNase-seq peaks, histone marks, and RNA-seq across 1000+ cell types. Search experiments by assay/biosample/target; download BED/bigWig; retrieve SCREEN cCREs by region or gene. Use to annotate variants with regulatory tracks, find open chromatin in a cell type, or fetch peak files for ChIP/ATAC analysis. For regulatory variant scoring use regulomedb-database; for GWAS associations use gwas-database.

Install in Claude Code
Copy
git clone --depth 1 https://github.com/jaechang-hits/SciAgent-Skills /tmp/encode-database && cp -r /tmp/encode-database/skills/genomics-bioinformatics/databases/encode-database ~/.claude/skills/encode-database
Then start a new Claude Code session; the skill loads automatically.

SKILL.md

# ENCODE Database

## Overview

The ENCODE (Encyclopedia of DNA Elements) Project has generated thousands of functional genomics experiments — TF ChIP-seq, ATAC-seq, DNase-seq, histone ChIP-seq, and RNA-seq — across 1000+ human and mouse cell types and tissues. The ENCODE Portal REST API provides structured JSON access to experiment metadata, file download URLs, and SCREEN cCRE (candidate cis-Regulatory Elements) annotations. All data is freely accessible without authentication for most endpoints.

## When to Use

- Downloading TF ChIP-seq peak files (BED) for a specific transcription factor and cell type to annotate regulatory regions
- Finding ATAC-seq or DNase-seq peaks in a cell type to identify open chromatin regions near a gene of interest
- Retrieving cCREs (candidate cis-Regulatory Elements) overlapping a genomic region from ENCODE SCREEN
- Building reference regulatory tracks for variant annotation pipelines (e.g., annotating VCF variants against ENCODE peak sets)
- Exploring which experiments are available for a biosample (cell line, tissue, developmental stage) before planning a wet-lab experiment
- Querying all ChIP-seq experiments for a transcription factor across multiple cell types for comparative regulatory analysis
- Use `regulomedb-database` instead when you want pre-computed regulatory scores for specific SNPs — RegulomeDB integrates ENCODE data with eQTL and motif evidence into a single score
- Use `deeptools-ngs-analysis` instead when you have your own BAM files and need to generate bigWig coverage tracks; ENCODE database is for retrieving existing deposited data

## Prerequisites

- **Python packages**: `requests`, `pandas`, `matplotlib`
- **Data requirements**: experiment accessions (e.g., `ENCSR000AKC`), biosample names (e.g., `K562`), TF target names (e.g., `CTCF`, `TP53`), or genomic regions (`chr7:117548628-117748628`)
- **Environment**: internet connection; no authentication required for public data; add `Authorization: Bearer {api_key}` header for submitter access
- **Rate limits**: no published hard limit; add `time.sleep(0.5)` for large batch queries to avoid connection resets

```bash
pip install requests pandas matplotlib
```

## Quick Start

```python
import requests

BASE = "https://www.encodeproject.org"

def search_experiments(assay="TF ChIP-seq", target="CTCF", biosample="K562", limit=5):
    """Find ENCODE experiments matching assay type, target, and biosample."""
    params = {
        "type": "Experiment",
        "assay_title": assay,
        "target.label": target,
        "biosample_ontology.term_name": biosample,   # `biosample_summary` is a verbose freetext string; filter by ontology term name
        "status": "released",
        "format": "json",
        "limit": limit,
    }
    r = requests.get(f"{BASE}/search/", params=params, timeout=30)
    r.raise_for_status()
    data = r.json()
    experiments = data.get("@graph", [])
    print(f"Found {data.get('total', 0)} experiments for {target} ChIP-seq in {biosample}")
    for exp in experiments:
        print(f"  {exp['accession']}  {exp.get('biosample_summary', '')}  {exp.get('lab', {}).get('title', '')}")
    return experiments

exps = search_experiments(assay="TF ChIP-seq", target="CTCF", biosample="K562")
```

## Core API

### Query 1: Experiment Search — Find Experiments by Assay, Biosample, Target

Search the ENCODE Portal for experiments matching structured criteria.

```python
import requests, pandas as pd

BASE = "https://www.encodeproject.org"

def search_experiments(assay_title=None, target=None, biosample=None,
                       organism="Homo sapiens", status="released", limit=50):
    """
    Search ENCODE experiments with flexible filters.
    Returns: pd.DataFrame of matching experiments.
    """
    params = {
        "type": "Experiment",
        "status": status,
        "replicates.library.biosample.donor.organism.scientific_name": organism,
        "format": "json",
        "limit": limit,
    }
    if assay_title:
        params["assay_title"] = assay_title
    if target:
        params["target.label"] = target
    if biosample:
        params["biosample_ontology.term_name"] = biosample   # filter by ontology term, not the freetext `biosample_summary`

    r = requests.get(f"{BASE}/search/", params=params, timeout=30)
    r.raise_for_status()
    data = r.json()
    total = data.get("total", 0)
    print(f"Total matching experiments: {total} (showing {min(limit, total)})")

    records = []
    for exp in data.get("@graph", []):
        records.append({
            "accession": exp.get("accession"),
            "assay": exp.get("assay_title"),
            "biosample": exp.get("biosample_summary"),
            "target": exp.get("target", {}).get("label", ""),
            "lab": exp.get("lab", {}).get("title", ""),
            "date_released": exp.get("date_released", ""),
        })
    df = pd.DataFrame(records)
    print(df.to_string(index=False))
    return df

# CTCF ChIP-seq in HCT116 colon cancer cells
df = search_experiments(assay_title="TF ChIP-seq", target="CTCF", biosample="HCT116")
```

```python
# ATAC-seq experiments in multiple cell types
df_atac = search_experiments(assay_title="ATAC-seq", limit=20)
print(f"\nUnique cell types: {df_atac['biosample'].nunique()}")
```

### Query 2: File Download — Get Metadata and Download URLs for BED/bigWig Files

Retrieve file metadata for a specific experiment and obtain download URLs.

```python
import requests, pandas as pd

BASE = "https://www.encodeproject.org"

def get_experiment_files(accession, file_format="bed", output_type="peaks",
                         assembly="GRCh38"):
    """
    Get file download URLs for a specific ENCODE experiment.
    accession: experiment accession, e.g. 'ENCSR000AKC'
    file_format: 'bed', 'bigWig', 'fastq', 'bam'
    output_type: 'peaks', 'signal', 'alignments', 'reads'
    Returns: pd.DataFrame of matching files with download URLs.
    """
    params = {
        "type": "F
sciagent-skill-creatorSkill

|

opentrons-integrationSkill

Opentrons Protocol API v2 for OT-2/Flex: Python protocols for pipetting, serial dilutions, PCR, plate replication; control thermocycler, heater-shaker, magnetic, temperature modules. Use pylabrobot for multi-vendor.

plotly-interactive-visualizationSkill

Interactive visualization with Plotly. 40+ chart types (scatter, line, heatmap, 3D, geographic) with hover, zoom, pan. Two APIs: Plotly Express (DataFrame) and Graph Objects (fine control). For static publication figures use matplotlib; for statistical grammar use seaborn.

seaborn-statistical-visualizationSkill

Statistical visualization on matplotlib + pandas. Distributions (histplot, kdeplot, violin, box), relational (scatter, line), categorical, regression, correlation heatmaps. Auto aggregation/CIs. Use plotly for interactive; matplotlib for low-level.

single-cell-annotationSkill

Best practices for single-cell RNA-seq cell type annotation including marker-based, reference-based, and automated classification approaches.

pymc-bayesian-modelingSkill

Bayesian modeling with PyMC 5: priors, likelihood, NUTS/ADVI sampling, diagnostics (R-hat, ESS), LOO/WAIC comparison, prediction. Hierarchical, logistic, GP variants; predictive checks.

scikit-survival-analysisSkill

Time-to-event modeling with scikit-survival: Cox PH (elastic net), Random Survival Forests, Boosting, SVMs for censored data. C-index, Brier, time-dependent AUC; Kaplan-Meier, Nelson-Aalen, competing risks. Pipeline/GridSearchCV compatible. Use statsmodels for frequentist, pymc for Bayesian, lifelines for parametric.

statistical-analysisSkill

>-