Skill284 repo starsupdated 4d ago

ucsc-genome-browser

The UCSC Genome Browser REST API skill enables programmatic access to DNA sequences, gene annotations, and conservation scores across 100+ genome assemblies including hg38 and mm39. Use it to retrieve reference sequences for genomic regions, fetch RefSeq or GENCODE gene structures, query PhyloP conservation scores, list annotation tracks, and obtain chromosome sizes for genomic analyses.

View source Repository: SciAgent-Skills

Install in Claude Code

Copy

git clone --depth 1 https://github.com/jaechang-hits/SciAgent-Skills /tmp/ucsc-genome-browser && cp -r /tmp/ucsc-genome-browser/skills/genomics-bioinformatics/databases/ucsc-genome-browser ~/.claude/skills/ucsc-genome-browser

Then start a new Claude Code session; the skill loads automatically.

Definition

SKILL.md

# UCSC Genome Browser

## Overview

The UCSC Genome Browser REST API at `https://api.genome.ucsc.edu/` provides programmatic access to genome sequences, annotation tracks, and hub data for 100+ assemblies including hg38, mm39, and dm6. The API is free, requires no authentication, and returns JSON. Use it with the `requests` library to fetch DNA sequences for genomic regions, retrieve track data (genes, repeats, conservation), list available tracks, and query chromosome sizes for genome-scale coordinate arithmetic.

## When to Use

- Fetching the reference DNA sequence for any genomic region (e.g., promoter, exon, CRISPR target) across human, mouse, or other assemblies
- Retrieving RefSeq or GENCODE gene structure (exon coordinates, CDS boundaries, strand) for a locus of interest
- Looking up PhyloP or PhastCons conservation scores to assess evolutionary constraint at a variant site
- Listing and querying any of UCSC's 1000+ annotation tracks (repeats, regulatory elements, conservation) for a region
- Getting chromosome sizes for a genome assembly to set up bedtools, pysam, or coverage pipelines
- Accessing public UCSC track hubs (e.g., ENCODE, Roadmap Epigenomics) without downloading data locally
- Use `ensembl-database` instead when you need Ensembl stable IDs, VEP variant annotation, or cross-species comparative genomics via the Ensembl REST API
- For bulk local queries across millions of regions, use `bedtools-genomic-intervals` with pre-downloaded UCSC annotation files

## Prerequisites

- **Python packages**: `requests`, `matplotlib` (for visualization)
- **Data requirements**: genomic coordinates (chrom, start, end in 0-based half-open BED format), genome assembly name (e.g., `hg38`, `mm39`)
- **Environment**: internet connection; no authentication required
- **Rate limits**: no official published limit; add 0.5s delays for batch requests (>100 queries)

```bash
pip install requests matplotlib
```

## Quick Start

```python
import requests

BASE = "https://api.genome.ucsc.edu"

def get_sequence(genome, chrom, start, end):
    """Fetch DNA sequence for a genomic region (0-based, half-open)."""
    r = requests.get(f"{BASE}/getData/sequence",
                     params={"genome": genome, "chrom": chrom,
                             "start": start, "end": end})
    r.raise_for_status()
    return r.json()["dna"]

# Fetch 1 kb around the BRCA1 TSS on hg38
seq = get_sequence("hg38", "chr17", 43044294, 43045294)
print(f"Length: {len(seq)} bp")
print(f"Sequence: {seq[:60]}...")
# Length: 1000 bp
# Sequence: ATGATTGGTGGTTACATGCACAGTTGCTCTGGGAAGTTTCTTCTTCAGTTGAGAAAAGGT...
```

## Core API

### Query 1: Sequence Retrieval

Fetch the reference DNA sequence for any genomic region using the `getData/sequence` endpoint. Coordinates are 0-based, half-open (BED format).

```python
import requests

BASE = "https://api.genome.ucsc.edu"

def get_sequence(genome, chrom, start, end):
    """Return DNA sequence string for the given region."""
    r = requests.get(f"{BASE}/getData/sequence",
                     params={"genome": genome, "chrom": chrom,
                             "start": start, "end": end})
    r.raise_for_status()
    data = r.json()
    return data["dna"]

# TP53 exon 4 region (hg38)
seq = get_sequence("hg38", "chr17", 7676520, 7676620)
print(f"Region: chr17:7,676,520-7,676,620 ({len(seq)} bp)")
print(f"Sequence: {seq}")
```

```python
# Reverse-complement for minus-strand genes
def revcomp(seq):
    comp = str.maketrans("ACGTacgt", "TGCAtgca")
    return seq.translate(comp)[::-1]

# BRCA2 on minus strand (hg38)
seq_fwd = get_sequence("hg38", "chr13", 32315086, 32315186)
seq_rc  = revcomp(seq_fwd)
print(f"Forward: {seq_fwd[:30]}...")
print(f"RevComp: {seq_rc[:30]}...")
```

### Query 2: Track Data Query

Retrieve annotation data (BED records) from any UCSC track for a genomic region.

```python
import requests

BASE = "https://api.genome.ucsc.edu"

def get_track_data(genome, track, chrom, start, end):
    """Fetch annotation records from a UCSC track for a region."""
    r = requests.get(f"{BASE}/getData/track",
                     params={"genome": genome, "track": track,
                             "chrom": chrom, "start": start, "end": end})
    r.raise_for_status()
    data = r.json()
    # Track data is under the key matching the track name
    return data.get(track, data.get("data", []))

# Fetch RepeatMasker annotations in the MYC locus (hg38)
repeats = get_track_data("hg38", "rmsk", "chr8", 127_735_434, 127_742_951)
print(f"Repeat elements in MYC locus: {len(repeats)}")
for r in repeats[:3]:
    print(f"  {r.get('repName', r.get('name'))} | {r['chromStart']}-{r['chromEnd']}")
```

```python
# Fetch CpG islands near a promoter
cpg_islands = get_track_data("hg38", "cpgIslandExt", "chr17", 43_044_000, 43_050_000)
print(f"CpG islands found: {len(cpg_islands)}")
for island in cpg_islands:
    print(f"  {island['name']}: {island['chromStart']}-{island['chromEnd']}, "
          f"obsExp={island.get('obsExp', 'n/a')}")
```

### Query 3: Track List

List all available annotation tracks for a genome assembly to discover what data is available.

```python
import requests

BASE = "https://api.genome.ucsc.edu"

def list_tracks(genome):
    """Return a dict of {track_name: track_metadata} for a genome assembly."""
    r = requests.get(f"{BASE}/list/tracks", params={"genome": genome})
    r.raise_for_status()
    return r.json().get("tracks", {})

tracks = list_tracks("hg38")
print(f"Total tracks in hg38: {len(tracks)}")

# Find conservation-related tracks
conserv = {k: v for k, v in tracks.items() if "conserv" in k.lower() or "phylop" in k.lower()}
for name, meta in list(conserv.items())[:5]:
    print(f"  {name}: {meta.get('shortLabel', '')}")
```

### Query 4: Chromosome Sizes

Get the length of every chromosome (or scaffold) for a genome assembly.

```python
import requests

BASE = "https://api.genome.ucsc.edu"

def get_chrom_sizes(genome):
    """Return {chrom: size_in_bp} for a g