Skip to main content
ClaudeWave
Skill199 repo starsupdated 16d ago

dbsnp-database

Query NCBI dbSNP for SNP records by rsID, gene, or region via E-utilities and Variation Services REST API. Retrieve alleles, MAF, variant class (SNV/indel/MNV), clinical links, cross-DB IDs (ClinVar, dbVar, 1000G). Free; 3 req/sec (10 with key). For clinical pathogenicity use clinvar-database; for population frequencies use gnomad-database.

Install in Claude Code
Copy
git clone --depth 1 https://github.com/jaechang-hits/SciAgent-Skills /tmp/dbsnp-database && cp -r /tmp/dbsnp-database/skills/genomics-bioinformatics/databases/dbsnp-database ~/.claude/skills/dbsnp-database
Then start a new Claude Code session; the skill loads automatically.

SKILL.md

# dbSNP Database

## Overview

NCBI dbSNP is the primary public repository for short human genetic variants, cataloguing over 1 billion SNPs, indels, and MNVs with allele frequencies, functional annotations, and cross-references to ClinVar, gnomAD, and 1000 Genomes. Variants are identified by stable rsIDs (reference SNP cluster IDs). Access is free via two APIs: the legacy NCBI E-utilities and the newer NCBI Variation Services REST API, which returns structured JSON.

## When to Use

- Looking up allele frequencies and variant class for a known rsID
- Searching all dbSNP variants in a gene or chromosomal region by name or coordinates
- Resolving rsIDs to genomic coordinates (GRCh38/GRCh37) and HGVS notation
- Checking whether a variant of interest has clinical significance links to ClinVar entries
- Batch-fetching hundreds of rsIDs efficiently using epost+efetch history server
- Cross-referencing a list of variant positions to dbSNP rsIDs for downstream annotation
- For clinical pathogenicity classifications use `clinvar-database`; dbSNP provides IDs and frequency but not curated clinical significance
- For population frequency stratified by ancestry use `gnomad-database`; dbSNP MAF is a single aggregate frequency

## Prerequisites

- **Python packages**: `requests`, `pandas`, `matplotlib`, `xml.etree.ElementTree` (stdlib)
- **Data requirements**: rsIDs (`rs80357906`), gene symbols, or chromosomal coordinates
- **Environment**: internet connection; NCBI Entrez email required for E-utilities (set `email` parameter)
- **Rate limits**: 3 requests/second without API key; 10 requests/second with free NCBI API key. Register at https://www.ncbi.nlm.nih.gov/account/ — add `&api_key=YOUR_KEY` to all requests

```bash
pip install requests pandas matplotlib
# xml.etree.ElementTree is part of Python stdlib — no additional install needed
```

## Quick Start

```python
import requests
import json

EMAIL = "your@email.com"      # required by NCBI policy
BASE_EUTILS = "https://eutils.ncbi.nlm.nih.gov/entrez/eutils"
BASE_VARIATION = "https://api.ncbi.nlm.nih.gov/variation/v0"

def fetch_snp_by_rsid(rsid: str) -> dict:
    """Fetch a dbSNP record by rsID using the NCBI Variation Services API (structured JSON)."""
    rs_num = str(rsid).lstrip("rs")
    r = requests.get(f"{BASE_VARIATION}/refsnp/{rs_num}", timeout=15)
    r.raise_for_status()
    return r.json()

record = fetch_snp_by_rsid("rs1800497")  # DRD2 Taq1A
print(f"rsID: rs{record['refsnp_id']}")
print(f"Variant type: {record['primary_snapshot_data'].get('variant_type')}")
# Top-level keys: citations, create_date, dbsnp1_merges, last_update_build_id,
# last_update_date, lost_obs_movements, mane_select_ids, present_obs_movements,
# primary_snapshot_data, refsnp_id. (No top-level `organism` field.)
# rsID: rs1800497
# Variant type: snv
```

## Core API

### Query 1: rsID Lookup via E-utilities

Fetch the full SNP record for a single rsID using efetch with `db=snp`. Returns an XML document with alleles, placements, and frequency data.

```python
import requests
import xml.etree.ElementTree as ET

EMAIL = "your@email.com"
BASE = "https://eutils.ncbi.nlm.nih.gov/entrez/eutils"

def efetch_snp_xml(rsid: str) -> ET.Element:
    """Fetch dbSNP XML record for a single rsID via the docsum rettype.
    Note: rettype="xml" returns a namespaced ExchangeSet; rettype="docsum"
    returns the simpler eSummaryResult/DocumentSummary tree without namespaces."""
    rs_num = str(rsid).lstrip("rs")
    r = requests.get(f"{BASE}/efetch.fcgi",
                     params={"db": "snp", "id": rs_num,
                             "rettype": "docsum", "retmode": "xml",
                             "email": EMAIL},
                     timeout=20)
    r.raise_for_status()
    return ET.fromstring(r.text)

root = efetch_snp_xml("rs80357906")

# Parse the DocumentSummary record (MAF/MAFALLELE were removed in 2024;
# GLOBAL_MAFS is a sub-tree — use ESummary JSON below for easier access)
for docsum in root.iter("DocumentSummary"):
    rs_id = docsum.get("uid")
    snp_class = docsum.findtext("SNP_CLASS", "Unknown")
    chr_pos = docsum.findtext("CHRPOS", "N/A")
    clin_sig = docsum.findtext("CLINICAL_SIGNIFICANCE", "N/A")
    print(f"rs{rs_id} | Class: {snp_class} | Position: {chr_pos}")
    print(f"  ClinSig: {clin_sig}")
# rs80357906 | Class: delins | Position: 17:43057062
#   ClinSig: pathogenic,risk-factor,uncertain-significance
```

```python
# Fetch using ESummary for structured JSON (preferred for batch)
def esummary_snp(rsid: str) -> dict:
    rs_num = str(rsid).lstrip("rs")
    r = requests.get(f"{BASE}/esummary.fcgi",
                     params={"db": "snp", "id": rs_num,
                             "retmode": "json", "email": EMAIL},
                     timeout=15)
    r.raise_for_status()
    result = r.json()["result"]
    return result.get(rs_num, {})

rec = esummary_snp("rs80357906")
print(f"rs{rec.get('snp_id')}:")
print(f"  Class       : {rec.get('snp_class')}")            # e.g., 'delins'
# `maf`/`mafallele` were removed from ESummary in 2024 — use `global_mafs`
# (list of {study, freq}) and pick a study (e.g., 'GnomAD_genomes') or the
# global aggregate ('TOPMED'/'1000Genomes').
for m in rec.get('global_mafs', [])[:4]:
    print(f"  MAF[{m['study']:18s}]: {m['freq']}")
print(f"  ChrPos      : {rec.get('chrpos')}")               # 17:43057062
print(f"  ClinSig     : {rec.get('clinical_significance')}")
print(f"  FxnClass    : {rec.get('fxn_class')}")
```

### Query 2: Gene Variant Search

Search dbSNP for all variants in a gene using esearch. Returns a list of rsIDs matching the gene.

```python
import requests

EMAIL = "your@email.com"
BASE = "https://eutils.ncbi.nlm.nih.gov/entrez/eutils"

def esearch_snp(query: str, retmax: int = 100) -> tuple[list, int]:
    """Search dbSNP using a query string. Returns (id_list, total_count)."""
    r = requests.get(f"{BASE}/esearch.fcgi",
                     params={"db": "snp", "term": query,
sciagent-skill-creatorSkill

|

opentrons-integrationSkill

Opentrons Protocol API v2 for OT-2/Flex: Python protocols for pipetting, serial dilutions, PCR, plate replication; control thermocycler, heater-shaker, magnetic, temperature modules. Use pylabrobot for multi-vendor.

plotly-interactive-visualizationSkill

Interactive visualization with Plotly. 40+ chart types (scatter, line, heatmap, 3D, geographic) with hover, zoom, pan. Two APIs: Plotly Express (DataFrame) and Graph Objects (fine control). For static publication figures use matplotlib; for statistical grammar use seaborn.

seaborn-statistical-visualizationSkill

Statistical visualization on matplotlib + pandas. Distributions (histplot, kdeplot, violin, box), relational (scatter, line), categorical, regression, correlation heatmaps. Auto aggregation/CIs. Use plotly for interactive; matplotlib for low-level.

single-cell-annotationSkill

Best practices for single-cell RNA-seq cell type annotation including marker-based, reference-based, and automated classification approaches.

pymc-bayesian-modelingSkill

Bayesian modeling with PyMC 5: priors, likelihood, NUTS/ADVI sampling, diagnostics (R-hat, ESS), LOO/WAIC comparison, prediction. Hierarchical, logistic, GP variants; predictive checks.

scikit-survival-analysisSkill

Time-to-event modeling with scikit-survival: Cox PH (elastic net), Random Survival Forests, Boosting, SVMs for censored data. C-index, Brier, time-dependent AUC; Kaplan-Meier, Nelson-Aalen, competing risks. Pipeline/GridSearchCV compatible. Use statsmodels for frequentist, pymc for Bayesian, lifelines for parametric.

statistical-analysisSkill

>-