dbsnp-database
Query NCBI dbSNP for SNP records by rsID, gene, or region via E-utilities and Variation Services REST API. Retrieve alleles, MAF, variant class (SNV/indel/MNV), clinical links, cross-DB IDs (ClinVar, dbVar, 1000G). Free; 3 req/sec (10 with key). For clinical pathogenicity use clinvar-database; for population frequencies use gnomad-database.
git clone --depth 1 https://github.com/jaechang-hits/SciAgent-Skills /tmp/dbsnp-database && cp -r /tmp/dbsnp-database/skills/genomics-bioinformatics/databases/dbsnp-database ~/.claude/skills/dbsnp-databaseSKILL.md
# dbSNP Database
## Overview
NCBI dbSNP is the primary public repository for short human genetic variants, cataloguing over 1 billion SNPs, indels, and MNVs with allele frequencies, functional annotations, and cross-references to ClinVar, gnomAD, and 1000 Genomes. Variants are identified by stable rsIDs (reference SNP cluster IDs). Access is free via two APIs: the legacy NCBI E-utilities and the newer NCBI Variation Services REST API, which returns structured JSON.
## When to Use
- Looking up allele frequencies and variant class for a known rsID
- Searching all dbSNP variants in a gene or chromosomal region by name or coordinates
- Resolving rsIDs to genomic coordinates (GRCh38/GRCh37) and HGVS notation
- Checking whether a variant of interest has clinical significance links to ClinVar entries
- Batch-fetching hundreds of rsIDs efficiently using epost+efetch history server
- Cross-referencing a list of variant positions to dbSNP rsIDs for downstream annotation
- For clinical pathogenicity classifications use `clinvar-database`; dbSNP provides IDs and frequency but not curated clinical significance
- For population frequency stratified by ancestry use `gnomad-database`; dbSNP MAF is a single aggregate frequency
## Prerequisites
- **Python packages**: `requests`, `pandas`, `matplotlib`, `xml.etree.ElementTree` (stdlib)
- **Data requirements**: rsIDs (`rs80357906`), gene symbols, or chromosomal coordinates
- **Environment**: internet connection; NCBI Entrez email required for E-utilities (set `email` parameter)
- **Rate limits**: 3 requests/second without API key; 10 requests/second with free NCBI API key. Register at https://www.ncbi.nlm.nih.gov/account/ — add `&api_key=YOUR_KEY` to all requests
```bash
pip install requests pandas matplotlib
# xml.etree.ElementTree is part of Python stdlib — no additional install needed
```
## Quick Start
```python
import requests
import json
EMAIL = "your@email.com" # required by NCBI policy
BASE_EUTILS = "https://eutils.ncbi.nlm.nih.gov/entrez/eutils"
BASE_VARIATION = "https://api.ncbi.nlm.nih.gov/variation/v0"
def fetch_snp_by_rsid(rsid: str) -> dict:
"""Fetch a dbSNP record by rsID using the NCBI Variation Services API (structured JSON)."""
rs_num = str(rsid).lstrip("rs")
r = requests.get(f"{BASE_VARIATION}/refsnp/{rs_num}", timeout=15)
r.raise_for_status()
return r.json()
record = fetch_snp_by_rsid("rs1800497") # DRD2 Taq1A
print(f"rsID: rs{record['refsnp_id']}")
print(f"Variant type: {record['primary_snapshot_data'].get('variant_type')}")
# Top-level keys: citations, create_date, dbsnp1_merges, last_update_build_id,
# last_update_date, lost_obs_movements, mane_select_ids, present_obs_movements,
# primary_snapshot_data, refsnp_id. (No top-level `organism` field.)
# rsID: rs1800497
# Variant type: snv
```
## Core API
### Query 1: rsID Lookup via E-utilities
Fetch the full SNP record for a single rsID using efetch with `db=snp`. Returns an XML document with alleles, placements, and frequency data.
```python
import requests
import xml.etree.ElementTree as ET
EMAIL = "your@email.com"
BASE = "https://eutils.ncbi.nlm.nih.gov/entrez/eutils"
def efetch_snp_xml(rsid: str) -> ET.Element:
"""Fetch dbSNP XML record for a single rsID via the docsum rettype.
Note: rettype="xml" returns a namespaced ExchangeSet; rettype="docsum"
returns the simpler eSummaryResult/DocumentSummary tree without namespaces."""
rs_num = str(rsid).lstrip("rs")
r = requests.get(f"{BASE}/efetch.fcgi",
params={"db": "snp", "id": rs_num,
"rettype": "docsum", "retmode": "xml",
"email": EMAIL},
timeout=20)
r.raise_for_status()
return ET.fromstring(r.text)
root = efetch_snp_xml("rs80357906")
# Parse the DocumentSummary record (MAF/MAFALLELE were removed in 2024;
# GLOBAL_MAFS is a sub-tree — use ESummary JSON below for easier access)
for docsum in root.iter("DocumentSummary"):
rs_id = docsum.get("uid")
snp_class = docsum.findtext("SNP_CLASS", "Unknown")
chr_pos = docsum.findtext("CHRPOS", "N/A")
clin_sig = docsum.findtext("CLINICAL_SIGNIFICANCE", "N/A")
print(f"rs{rs_id} | Class: {snp_class} | Position: {chr_pos}")
print(f" ClinSig: {clin_sig}")
# rs80357906 | Class: delins | Position: 17:43057062
# ClinSig: pathogenic,risk-factor,uncertain-significance
```
```python
# Fetch using ESummary for structured JSON (preferred for batch)
def esummary_snp(rsid: str) -> dict:
rs_num = str(rsid).lstrip("rs")
r = requests.get(f"{BASE}/esummary.fcgi",
params={"db": "snp", "id": rs_num,
"retmode": "json", "email": EMAIL},
timeout=15)
r.raise_for_status()
result = r.json()["result"]
return result.get(rs_num, {})
rec = esummary_snp("rs80357906")
print(f"rs{rec.get('snp_id')}:")
print(f" Class : {rec.get('snp_class')}") # e.g., 'delins'
# `maf`/`mafallele` were removed from ESummary in 2024 — use `global_mafs`
# (list of {study, freq}) and pick a study (e.g., 'GnomAD_genomes') or the
# global aggregate ('TOPMED'/'1000Genomes').
for m in rec.get('global_mafs', [])[:4]:
print(f" MAF[{m['study']:18s}]: {m['freq']}")
print(f" ChrPos : {rec.get('chrpos')}") # 17:43057062
print(f" ClinSig : {rec.get('clinical_significance')}")
print(f" FxnClass : {rec.get('fxn_class')}")
```
### Query 2: Gene Variant Search
Search dbSNP for all variants in a gene using esearch. Returns a list of rsIDs matching the gene.
```python
import requests
EMAIL = "your@email.com"
BASE = "https://eutils.ncbi.nlm.nih.gov/entrez/eutils"
def esearch_snp(query: str, retmax: int = 100) -> tuple[list, int]:
"""Search dbSNP using a query string. Returns (id_list, total_count)."""
r = requests.get(f"{BASE}/esearch.fcgi",
params={"db": "snp", "term": query,|
Opentrons Protocol API v2 for OT-2/Flex: Python protocols for pipetting, serial dilutions, PCR, plate replication; control thermocycler, heater-shaker, magnetic, temperature modules. Use pylabrobot for multi-vendor.
Interactive visualization with Plotly. 40+ chart types (scatter, line, heatmap, 3D, geographic) with hover, zoom, pan. Two APIs: Plotly Express (DataFrame) and Graph Objects (fine control). For static publication figures use matplotlib; for statistical grammar use seaborn.
Statistical visualization on matplotlib + pandas. Distributions (histplot, kdeplot, violin, box), relational (scatter, line), categorical, regression, correlation heatmaps. Auto aggregation/CIs. Use plotly for interactive; matplotlib for low-level.
Best practices for single-cell RNA-seq cell type annotation including marker-based, reference-based, and automated classification approaches.
Bayesian modeling with PyMC 5: priors, likelihood, NUTS/ADVI sampling, diagnostics (R-hat, ESS), LOO/WAIC comparison, prediction. Hierarchical, logistic, GP variants; predictive checks.
Time-to-event modeling with scikit-survival: Cox PH (elastic net), Random Survival Forests, Boosting, SVMs for censored data. C-index, Brier, time-dependent AUC; Kaplan-Meier, Nelson-Aalen, competing risks. Pipeline/GridSearchCV compatible. Use statsmodels for frequentist, pymc for Bayesian, lifelines for parametric.
>-