Skill284 repo starsupdated 4d ago

ensembl-database

The ensembl-database skill provides programmatic access to the Ensembl REST API for retrieving gene and transcript annotations, sequences, variant consequences, and regulatory features across 300+ species without requiring authentication. Use this skill to look up gene identifiers and convert between namespace formats (HGNC, RefSeq, UniProt), fetch genomic or protein sequences, annotate variants with predicted functional impact, query regulatory elements, or perform comparative genomics across species.

View source Repository: SciAgent-Skills

Install in Claude Code

Copy

git clone --depth 1 https://github.com/jaechang-hits/SciAgent-Skills /tmp/ensembl-database && cp -r /tmp/ensembl-database/skills/genomics-bioinformatics/databases/ensembl-database ~/.claude/skills/ensembl-database

Then start a new Claude Code session; the skill loads automatically.

Definition

SKILL.md

# Ensembl Genome Database

## Overview

Ensembl is a comprehensive genome annotation database covering 300+ vertebrate and non-vertebrate species. The Ensembl REST API provides programmatic access to gene models, transcript/protein sequences, variant annotations, cross-references, regulatory features, and comparative genomics without requiring any login or API key.

## When to Use

- Retrieving official gene and transcript annotations (stable IDs, biotype, genomic coordinates) for human or model organism genes
- Converting between gene identifier namespaces (HGNC symbol ↔ Ensembl ID ↔ RefSeq ↔ UniProt)
- Fetching genomic or cDNA/CDS/protein sequences for a gene or transcript
- Looking up variant consequences and functional impact (VEP) for a list of SNPs
- Querying regulatory features (promoters, enhancers, CTCF sites) in a genomic region
- Performing comparative genomics queries (orthologs, paralogs, gene trees) across species
- For local offline access to large genomic annotations, use `pyensembl` instead
- For pathway and metabolic annotations, use `kegg-database` or `reactome-database` instead

## Prerequisites

- **Python packages**: `requests`
- **Data requirements**: gene symbols, Ensembl stable IDs (ENSG…/ENST…/ENSP…), or genomic coordinates
- **Environment**: internet connection required; no API key needed
- **Rate limits**: max ~15 requests/second; use `expand=1` and batch endpoints to minimize calls

```bash
pip install requests
```

## Quick Start

```python
import requests

BASE = "https://rest.ensembl.org"
HEADERS = {"Content-Type": "application/json"}

def ensembl_get(endpoint, params=None):
    r = requests.get(f"{BASE}{endpoint}", headers=HEADERS, params=params)
    r.raise_for_status()
    return r.json()

# Look up human BRCA1
gene = ensembl_get("/lookup/symbol/homo_sapiens/BRCA1", params={"expand": 1})
print(f"ID: {gene['id']}, Chr: {gene['seq_region_name']}:{gene['start']}-{gene['end']}")
print(f"Transcripts: {len(gene.get('Transcript', []))}")
```

## Core API

### Query 1: Gene Lookup by Symbol or Stable ID

Retrieve gene metadata from a gene symbol or Ensembl stable ID.

```python
import requests

BASE = "https://rest.ensembl.org"
HEADERS = {"Content-Type": "application/json"}

# By gene symbol
r = requests.get(
    f"{BASE}/lookup/symbol/homo_sapiens/TP53",
    headers=HEADERS,
    params={"expand": 1}
)
gene = r.json()
print(f"Ensembl ID : {gene['id']}")
print(f"Location   : {gene['seq_region_name']}:{gene['start']}-{gene['end']} ({gene['strand']})")
print(f"Biotype    : {gene['biotype']}")
print(f"Transcripts: {len(gene.get('Transcript', []))}")
```

```python
# By stable ID (works for genes, transcripts, proteins)
r = requests.get(
    f"{BASE}/lookup/id/ENSG00000141510",
    headers=HEADERS,
    params={"expand": 0}
)
obj = r.json()
print(f"Symbol: {obj.get('display_name')}, Species: {obj.get('species')}")
```

### Query 2: Batch Lookup

Retrieve information for multiple IDs in one call (POST endpoint).

```python
import requests, json

BASE = "https://rest.ensembl.org"
HEADERS = {"Content-Type": "application/json"}

# Batch lookup by symbols
symbols = ["BRCA1", "BRCA2", "TP53", "EGFR", "MYC"]
r = requests.post(
    f"{BASE}/lookup/symbol/homo_sapiens",
    headers=HEADERS,
    data=json.dumps({"symbols": symbols})
)
results = r.json()
for sym, data in results.items():
    if data:
        print(f"{sym}: {data['id']} ({data['seq_region_name']}:{data['start']}-{data['end']})")
```

### Query 3: Sequence Retrieval

Fetch genomic, cDNA, CDS, or protein sequences.

```python
import requests

BASE = "https://rest.ensembl.org"
HEADERS = {"Content-Type": "text/plain"}

# Protein sequence for canonical transcript
r = requests.get(
    f"{BASE}/sequence/id/ENST00000269305",
    headers=HEADERS,
    params={"type": "protein"}
)
seq = r.text
print(f"Protein sequence ({len(seq)} aa): {seq[:60]}...")
```

```python
# Genomic region sequence
HEADERS_JSON = {"Content-Type": "application/json"}
r = requests.get(
    f"{BASE}/sequence/region/human/17:43044295..43125364",
    headers=HEADERS_JSON,
    params={"coord_system_version": "GRCh38"}
)
result = r.json()
print(f"Retrieved {len(result['seq'])} bp of genomic sequence")
```

### Query 4: Cross-References (ID Mapping)

Map Ensembl IDs to external database identifiers.

```python
import requests

BASE = "https://rest.ensembl.org"
HEADERS = {"Content-Type": "application/json"}

# All xrefs for a gene
r = requests.get(
    f"{BASE}/xrefs/id/ENSG00000141510",
    headers=HEADERS
)
xrefs = r.json()

# Group by database
from collections import defaultdict
by_db = defaultdict(list)
for x in xrefs:
    by_db[x["dbname"]].append(x["primary_id"])

for db in ["HGNC", "RefSeq_gene_name", "Uniprot_gn", "MIM_gene"]:
    if db in by_db:
        print(f"{db}: {by_db[db]}")
```

### Query 5: Variant Consequence Annotation (VEP)

Predict functional consequences of variants via REST VEP endpoint.

```python
import requests, json

BASE = "https://rest.ensembl.org"
HEADERS = {"Content-Type": "application/json"}

# Annotate a list of hgvs notations
variants = ["17:g.43094692C>T", "13:g.32929387C>T"]
r = requests.post(
    f"{BASE}/vep/human/hgvs",
    headers=HEADERS,
    data=json.dumps({"hgvs_notations": variants})
)
for v in r.json():
    print(f"\nVariant: {v.get('input')}")
    for tc in v.get("transcript_consequences", [])[:2]:
        print(f"  Gene: {tc.get('gene_symbol')}, Impact: {tc.get('impact')}, Consequence: {tc.get('consequence_terms')}")
```

```python
# Annotate by rsID
r = requests.get(
    f"{BASE}/vep/human/id/rs699",
    headers=HEADERS
)
v = r.json()[0]
print(f"rsID rs699 in gene: {v['transcript_consequences'][0]['gene_symbol']}")
print(f"Consequence: {v['transcript_consequences'][0]['consequence_terms']}")
```

### Query 6: Regulatory Features

Query regulatory build features in a genomic region.

```python
import requests

BASE = "https://rest.ensembl.org"
HEADERS = {"Content-Type": "application/json"}

# Regulatory fea