Skip to main content
ClaudeWave
Skill199 repo starsupdated 16d ago

ena-database

ENA REST API for sequences, reads, assemblies, and annotations. Portal API search, Browser API retrieval (XML/FASTA/EMBL), file reports for FASTQ/BAM URLs, taxonomy, cross-refs. For multi-DB Python use bioservices; for NCBI-only use pubmed-database or Biopython Entrez.

Install in Claude Code
Copy
git clone --depth 1 https://github.com/jaechang-hits/SciAgent-Skills /tmp/ena-database && cp -r /tmp/ena-database/skills/genomics-bioinformatics/databases/ena-database ~/.claude/skills/ena-database
Then start a new Claude Code session; the skill loads automatically.

SKILL.md

# ENA Database — European Nucleotide Archive Programmatic Access

## Overview

The European Nucleotide Archive (ENA) is EMBL-EBI's comprehensive nucleotide sequence database, encompassing raw sequencing reads, genome assemblies, annotated sequences, and associated metadata. It mirrors and extends INSDC data (GenBank, DDBJ). All access is via REST APIs with no authentication required.

## When to Use

- Searching for sequencing studies, samples, or experiments by organism, project, or keyword
- Downloading raw FASTQ/BAM files for reanalysis of public sequencing datasets
- Retrieving genome assemblies with quality statistics (N50, contig count, genome size)
- Fetching nucleotide sequences in FASTA or EMBL flat-file format by accession
- Exploring taxonomic lineage and finding organisms by partial name
- Cross-referencing ENA records with external databases (ArrayExpress, UniProt, PDB)
- Building bulk download lists for large-scale sequencing projects
- For **multi-database Python queries** (ENA + UniProt + KEGG), prefer `bioservices` instead
- For **NCBI-specific queries** (PubMed literature, GenBank records), use `pubmed-database` or Biopython Entrez

## Prerequisites

```bash
pip install requests
```

**API constraints**:
- **Rate limit**: 50 requests per second across all ENA APIs
- **No authentication** required
- **Large result sets**: use pagination (`limit` + `offset`) or streaming (`limit=0` for TSV download)
- Portal API base: `https://www.ebi.ac.uk/ena/portal/api`
- Browser API base: `https://www.ebi.ac.uk/ena/browser/api`
- Taxonomy API base: `https://www.ebi.ac.uk/ena/taxonomy/rest`
- Cross-ref API base: `https://www.ebi.ac.uk/ena/xref/rest`

## Quick Start

```python
import requests
import time

BASE_PORTAL = "https://www.ebi.ac.uk/ena/portal/api"
BASE_BROWSER = "https://www.ebi.ac.uk/ena/browser/api"
BASE_TAXONOMY = "https://www.ebi.ac.uk/ena/taxonomy/rest"
BASE_XREF = "https://www.ebi.ac.uk/ena/xref/rest"

def ena_query(endpoint, params=None, base=BASE_PORTAL):
    """Reusable ENA API caller with rate-limit compliance."""
    resp = requests.get(f"{base}/{endpoint}", params=params)
    resp.raise_for_status()
    time.sleep(0.02)  # 50 req/sec limit
    return resp

# Search for human RNA-seq studies
resp = ena_query("search", params={
    "result": "study",
    "query": 'tax_tree(9606)',   # `library_strategy` is a `read_run`/`read_experiment` field, not a `study` field
    "fields": "study_accession,study_title",
    "format": "json",
    "limit": 3,
})
studies = resp.json()
for s in studies:
    print(f"{s['study_accession']}: {s['study_title'][:60]}")
# PRJEB12345: Transcriptome analysis of human liver tissue...
```

## Core API

### Module 1: Portal API Search

The Portal API provides advanced metadata search across all ENA data types with boolean query syntax, field selection, and pagination.

```python
# Search read runs for a specific study
resp = ena_query("search", params={
    "result": "read_run",
    "query": 'study_accession="PRJEB1787"',
    "fields": "run_accession,sample_accession,instrument_model,read_count,base_count",
    "format": "json",
    "limit": 5,
})
runs = resp.json()
for r in runs:
    print(f"{r['run_accession']} — {r.get('instrument_model', 'N/A')}, "
          f"{int(r.get('read_count', 0)):,} reads")
# ERR123456 — Illumina HiSeq 2000, 45,231,890 reads

# Count total results without fetching data
count_resp = ena_query("count", params={
    "result": "read_run",
    "query": 'study_accession="PRJEB1787"',
})
print(f"Total runs: {count_resp.text.strip()}")
# Total runs: 142
```

### Module 2: Browser API Retrieval

Fetch individual records by accession in multiple formats: XML, FASTA, EMBL flat-file, or plain text.

```python
# Retrieve XML metadata for a study
resp = ena_query("xml/PRJEB1787", base=BASE_BROWSER)
print(resp.text[:300])
# <?xml version="1.0" encoding="UTF-8"?><PROJECT_SET>...

# Retrieve FASTA sequence for a coding sequence
resp = ena_query("fasta/M10051.1", base=BASE_BROWSER)
print(resp.text[:200])
# >ENA|M10051|M10051.1 Human insulin mRNA, complete cds.
# AGCCCTCCAGGACAGGCTGCAT...

# Retrieve EMBL flat-file format
resp = ena_query("embl/M10051.1", base=BASE_BROWSER)
print(resp.text[:300])
# ID   M10051; SV 1; linear; mRNA; STD; HUM; 786 BP.
# ...
```

### Module 3: File Reports and Downloads

Get download URLs for FASTQ, submitted, and analysis files. File reports return FTP and Aspera paths.

```python
# Get FASTQ file URLs for specific runs
resp = ena_query("filereport", params={
    "accession": "ERR000589",
    "result": "read_run",
    "fields": "run_accession,fastq_ftp,fastq_bytes,fastq_md5",
    "format": "json",
})
files = resp.json()
for f in files:
    ftp_urls = f.get("fastq_ftp", "").split(";")
    sizes = f.get("fastq_bytes", "").split(";")
    for url, size in zip(ftp_urls, sizes):
        if url:
            print(f"ftp://{url}  ({int(size)/1e6:.1f} MB)")
# ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR000/ERR000589/ERR000589_1.fastq.gz  (234.5 MB)
# ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR000/ERR000589/ERR000589_2.fastq.gz  (241.2 MB)
```

### Module 4: Taxonomy Queries

Look up organisms by taxonomy ID, scientific name, or partial name match.

```python
# Lookup by taxonomy ID
resp = ena_query("tax-id/9606", base=BASE_TAXONOMY)
tax = resp.json()
print(f"{tax['scientificName']} (taxId: {tax['taxId']}, rank: {tax['rank']})")
# Homo sapiens (taxId: 9606, rank: species)
print(f"Lineage: {tax['lineage'][:80]}...")

# Search by scientific name — endpoint returns a list (one entry per matching taxon)
resp = ena_query("scientific-name/Arabidopsis thaliana", base=BASE_TAXONOMY)
matches = resp.json()
result = matches[0] if isinstance(matches, list) else matches
print(f"Tax ID: {result['taxId']}, Common: {result.get('commonName', 'N/A')}")
# Tax ID: 3702, Common: thale cress

# Suggest organisms by partial name
resp = ena_query("suggest-for-search/salmo", base=BASE_TAXONOMY)
suggestions = resp.json()
for s in suggestions[:3]:
sciagent-skill-creatorSkill

|

opentrons-integrationSkill

Opentrons Protocol API v2 for OT-2/Flex: Python protocols for pipetting, serial dilutions, PCR, plate replication; control thermocycler, heater-shaker, magnetic, temperature modules. Use pylabrobot for multi-vendor.

plotly-interactive-visualizationSkill

Interactive visualization with Plotly. 40+ chart types (scatter, line, heatmap, 3D, geographic) with hover, zoom, pan. Two APIs: Plotly Express (DataFrame) and Graph Objects (fine control). For static publication figures use matplotlib; for statistical grammar use seaborn.

seaborn-statistical-visualizationSkill

Statistical visualization on matplotlib + pandas. Distributions (histplot, kdeplot, violin, box), relational (scatter, line), categorical, regression, correlation heatmaps. Auto aggregation/CIs. Use plotly for interactive; matplotlib for low-level.

single-cell-annotationSkill

Best practices for single-cell RNA-seq cell type annotation including marker-based, reference-based, and automated classification approaches.

pymc-bayesian-modelingSkill

Bayesian modeling with PyMC 5: priors, likelihood, NUTS/ADVI sampling, diagnostics (R-hat, ESS), LOO/WAIC comparison, prediction. Hierarchical, logistic, GP variants; predictive checks.

scikit-survival-analysisSkill

Time-to-event modeling with scikit-survival: Cox PH (elastic net), Random Survival Forests, Boosting, SVMs for censored data. C-index, Brier, time-dependent AUC; Kaplan-Meier, Nelson-Aalen, competing risks. Pipeline/GridSearchCV compatible. Use statsmodels for frequentist, pymc for Bayesian, lifelines for parametric.

statistical-analysisSkill

>-