Skill284 repo starsupdated 4d ago

ena-database

The ena-database skill provides programmatic access to the European Nucleotide Archive's REST APIs for searching and retrieving nucleotide sequences, raw sequencing reads, genome assemblies, and associated metadata without authentication. Use this skill to download FASTQ/BAM files, fetch sequences by accession in FASTA or EMBL format, search studies by organism or keyword, retrieve assembly statistics, explore taxonomy, and cross-reference records with external databases like UniProt and PDB.

View source Repository: SciAgent-Skills

Install in Claude Code

Copy

git clone --depth 1 https://github.com/jaechang-hits/SciAgent-Skills /tmp/ena-database && cp -r /tmp/ena-database/skills/genomics-bioinformatics/databases/ena-database ~/.claude/skills/ena-database

Then start a new Claude Code session; the skill loads automatically.

Definition

SKILL.md

# ENA Database — European Nucleotide Archive Programmatic Access

## Overview

The European Nucleotide Archive (ENA) is EMBL-EBI's comprehensive nucleotide sequence database, encompassing raw sequencing reads, genome assemblies, annotated sequences, and associated metadata. It mirrors and extends INSDC data (GenBank, DDBJ). All access is via REST APIs with no authentication required.

## When to Use

- Searching for sequencing studies, samples, or experiments by organism, project, or keyword
- Downloading raw FASTQ/BAM files for reanalysis of public sequencing datasets
- Retrieving genome assemblies with quality statistics (N50, contig count, genome size)
- Fetching nucleotide sequences in FASTA or EMBL flat-file format by accession
- Exploring taxonomic lineage and finding organisms by partial name
- Cross-referencing ENA records with external databases (ArrayExpress, UniProt, PDB)
- Building bulk download lists for large-scale sequencing projects
- For **multi-database Python queries** (ENA + UniProt + KEGG), prefer `bioservices` instead
- For **NCBI-specific queries** (PubMed literature, GenBank records), use `pubmed-database` or Biopython Entrez

## Prerequisites

```bash
pip install requests
```

**API constraints**:
- **Rate limit**: 50 requests per second across all ENA APIs
- **No authentication** required
- **Large result sets**: use pagination (`limit` + `offset`) or streaming (`limit=0` for TSV download)
- Portal API base: `https://www.ebi.ac.uk/ena/portal/api`
- Browser API base: `https://www.ebi.ac.uk/ena/browser/api`
- Taxonomy API base: `https://www.ebi.ac.uk/ena/taxonomy/rest`
- Cross-ref API base: `https://www.ebi.ac.uk/ena/xref/rest`

## Quick Start

```python
import requests
import time

BASE_PORTAL = "https://www.ebi.ac.uk/ena/portal/api"
BASE_BROWSER = "https://www.ebi.ac.uk/ena/browser/api"
BASE_TAXONOMY = "https://www.ebi.ac.uk/ena/taxonomy/rest"
BASE_XREF = "https://www.ebi.ac.uk/ena/xref/rest"

def ena_query(endpoint, params=None, base=BASE_PORTAL):
    """Reusable ENA API caller with rate-limit compliance."""
    resp = requests.get(f"{base}/{endpoint}", params=params)
    resp.raise_for_status()
    time.sleep(0.02)  # 50 req/sec limit
    return resp

# Search for human RNA-seq studies
resp = ena_query("search", params={
    "result": "study",
    "query": 'tax_tree(9606)',   # `library_strategy` is a `read_run`/`read_experiment` field, not a `study` field
    "fields": "study_accession,study_title",
    "format": "json",
    "limit": 3,
})
studies = resp.json()
for s in studies:
    print(f"{s['study_accession']}: {s['study_title'][:60]}")
# PRJEB12345: Transcriptome analysis of human liver tissue...
```

## Core API

### Module 1: Portal API Search

The Portal API provides advanced metadata search across all ENA data types with boolean query syntax, field selection, and pagination.

```python
# Search read runs for a specific study
resp = ena_query("search", params={
    "result": "read_run",
    "query": 'study_accession="PRJEB1787"',
    "fields": "run_accession,sample_accession,instrument_model,read_count,base_count",
    "format": "json",
    "limit": 5,
})
runs = resp.json()
for r in runs:
    print(f"{r['run_accession']} — {r.get('instrument_model', 'N/A')}, "
          f"{int(r.get('read_count', 0)):,} reads")
# ERR123456 — Illumina HiSeq 2000, 45,231,890 reads

# Count total results without fetching data
count_resp = ena_query("count", params={
    "result": "read_run",
    "query": 'study_accession="PRJEB1787"',
})
print(f"Total runs: {count_resp.text.strip()}")
# Total runs: 142
```

### Module 2: Browser API Retrieval

Fetch individual records by accession in multiple formats: XML, FASTA, EMBL flat-file, or plain text.

```python
# Retrieve XML metadata for a study
resp = ena_query("xml/PRJEB1787", base=BASE_BROWSER)
print(resp.text[:300])
# <?xml version="1.0" encoding="UTF-8"?><PROJECT_SET>...

# Retrieve FASTA sequence for a coding sequence
resp = ena_query("fasta/M10051.1", base=BASE_BROWSER)
print(resp.text[:200])
# >ENA|M10051|M10051.1 Human insulin mRNA, complete cds.
# AGCCCTCCAGGACAGGCTGCAT...

# Retrieve EMBL flat-file format
resp = ena_query("embl/M10051.1", base=BASE_BROWSER)
print(resp.text[:300])
# ID   M10051; SV 1; linear; mRNA; STD; HUM; 786 BP.
# ...
```

### Module 3: File Reports and Downloads

Get download URLs for FASTQ, submitted, and analysis files. File reports return FTP and Aspera paths.

```python
# Get FASTQ file URLs for specific runs
resp = ena_query("filereport", params={
    "accession": "ERR000589",
    "result": "read_run",
    "fields": "run_accession,fastq_ftp,fastq_bytes,fastq_md5",
    "format": "json",
})
files = resp.json()
for f in files:
    ftp_urls = f.get("fastq_ftp", "").split(";")
    sizes = f.get("fastq_bytes", "").split(";")
    for url, size in zip(ftp_urls, sizes):
        if url:
            print(f"ftp://{url}  ({int(size)/1e6:.1f} MB)")
# ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR000/ERR000589/ERR000589_1.fastq.gz  (234.5 MB)
# ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR000/ERR000589/ERR000589_2.fastq.gz  (241.2 MB)
```

### Module 4: Taxonomy Queries

Look up organisms by taxonomy ID, scientific name, or partial name match.

```python
# Lookup by taxonomy ID
resp = ena_query("tax-id/9606", base=BASE_TAXONOMY)
tax = resp.json()
print(f"{tax['scientificName']} (taxId: {tax['taxId']}, rank: {tax['rank']})")
# Homo sapiens (taxId: 9606, rank: species)
print(f"Lineage: {tax['lineage'][:80]}...")

# Search by scientific name — endpoint returns a list (one entry per matching taxon)
resp = ena_query("scientific-name/Arabidopsis thaliana", base=BASE_TAXONOMY)
matches = resp.json()
result = matches[0] if isinstance(matches, list) else matches
print(f"Tax ID: {result['taxId']}, Common: {result.get('commonName', 'N/A')}")
# Tax ID: 3702, Common: thale cress

# Suggest organisms by partial name
resp = ena_query("suggest-for-search/salmo", base=BASE_TAXONOMY)
suggestions = resp.json()
for s in suggestions[:3]: