Skip to main content
ClaudeWave
Skill199 estrellas del repoactualizado 16d ago

gnomad-database

gnomAD v4 population variant frequencies via GraphQL API. Allele counts and frequencies stratified by ancestry (AFR, AMR, EAS, NFE, SAS, FIN, ASJ, MID), gene-level constraint (pLI, LOEUF, missense z), and coverage. Identify rare or constrained variants. For clinical pathogenicity use clinvar-database; for GWAS use gwas-database.

Instalar en Claude Code
Copiar
git clone --depth 1 https://github.com/jaechang-hits/SciAgent-Skills /tmp/gnomad-database && cp -r /tmp/gnomad-database/skills/genomics-bioinformatics/databases/gnomad-database ~/.claude/skills/gnomad-database
Después abre una sesión nueva de Claude Code; el skill carga automáticamente.

SKILL.md

# gnomAD Database

## Overview

The Genome Aggregation Database (gnomAD) is a resource of aggregated exome and genome sequencing data from 730,000+ individuals. It provides population variant frequencies stratified by 9 ancestry groups, gene-level constraint scores (pLI, LOEUF), and read coverage information. Access is free via a GraphQL API at `https://gnomad.broadinstitute.org/api` — no authentication required, no official SDK.

## When to Use

- Checking whether a candidate variant is rare enough to be clinically relevant (AF < 0.1% in all populations)
- Retrieving allele frequencies stratified by ancestry group (AFR, AMR, EAS, NFE, SAS, FIN, ASJ, MID) for a variant
- Identifying all rare loss-of-function variants in a gene for burden testing or candidate prioritization
- Getting gene constraint metrics (pLI, LOEUF) to assess tolerance to loss-of-function variants
- Checking read depth coverage for a region to evaluate if low variant frequency reflects low sequencing coverage
- Filtering a VCF by population frequency — query gnomAD AF to discard common variants before clinical interpretation
- For clinical pathogenicity classifications use `clinvar-database`; gnomAD provides frequency evidence but does not classify pathogenicity
- For GWAS associations at the study level use `gwas-database`; gnomAD is for population frequency lookups

## Prerequisites

- **Python packages**: `requests`, `pandas`, `matplotlib`
- **Data requirements**: gene symbols (e.g., `BRCA1`), variant IDs (`1-69511-A-G` format, or rsIDs)
- **Environment**: internet connection; no API key required
- **Rate limits**: no official published limits; use `time.sleep(0.5)` between requests for polite access; avoid bursts over 10 requests/second

```bash
pip install requests pandas matplotlib
```

## Quick Start

```python
import requests
import time

GNOMAD_API = "https://gnomad.broadinstitute.org/api"

def gnomad_query(query: str, variables: dict = None) -> dict:
    """Execute a gnomAD GraphQL query and return the data payload."""
    payload = {"query": query, "variables": variables or {}}
    r = requests.post(GNOMAD_API, json=payload, timeout=30)
    r.raise_for_status()
    result = r.json()
    if "errors" in result:
        raise ValueError(f"GraphQL errors: {result['errors']}")
    return result["data"]

# Quick check: get pLI / LOEUF for BRCA1
# GnomadConstraint fields are FLAT (no nested `lof { oe_ci { upper } }` type).
# `pli` is the current field; `pLI` is preserved as a deprecated alias.
query = """
query GeneConstraint($gene_symbol: String!, $reference_genome: ReferenceGenomeId!) {
  gene(gene_symbol: $gene_symbol, reference_genome: $reference_genome) {
    gnomad_constraint { pli oe_lof_upper }
  }
}
"""
data = gnomad_query(query, {"gene_symbol": "BRCA1", "reference_genome": "GRCh38"})
constraint = data["gene"]["gnomad_constraint"]
print(f"BRCA1 pLI:   {constraint['pli']:.3e}")        # ~5.5e-38 (very high LoF-intolerant)
print(f"BRCA1 LOEUF: {constraint['oe_lof_upper']:.3f}") # 0.928
```

## Core API

### Query 1: Gene Variant Query

Fetch all variants in a gene with population allele frequencies. Returns a list of variants with their genome-level frequencies.

```python
import requests, time

GNOMAD_API = "https://gnomad.broadinstitute.org/api"

def gnomad_query(query, variables=None):
    r = requests.post(GNOMAD_API, json={"query": query, "variables": variables or {}}, timeout=30)
    r.raise_for_status()
    result = r.json()
    if "errors" in result:
        raise ValueError(f"GraphQL errors: {result['errors']}")
    return result["data"]

GENE_VARIANTS_QUERY = """
query GeneVariants($gene_symbol: String!, $reference_genome: ReferenceGenomeId!, $dataset: DatasetId!) {
  gene(gene_symbol: $gene_symbol, reference_genome: $reference_genome) {
    gene_id
    symbol
    variants(dataset: $dataset) {
      variant_id
      rsids
      chrom
      pos
      ref
      alt
      consequence
      lof
      genome {
        an
        ac
        af
        faf95 { popmax popmax_population }
      }
    }
  }
}
"""

data = gnomad_query(GENE_VARIANTS_QUERY, {
    "gene_symbol": "PCSK9",
    "reference_genome": "GRCh38",
    "dataset": "gnomad_r4"
})
variants = data["gene"]["variants"]
print(f"Gene: {data['gene']['symbol']} ({data['gene']['gene_id']})")
print(f"Total variants: {len(variants)}")
# Filter to rare variants (AF < 0.001)
rare = [v for v in variants if v["genome"] and v["genome"]["af"] is not None and v["genome"]["af"] < 0.001]
print(f"Rare variants (AF < 0.1%): {len(rare)}")
for v in rare[:3]:
    print(f"  {v['variant_id']} | {v['consequence']} | AF={v['genome']['af']:.2e}")
```

### Query 2: Variant Lookup

Fetch detailed information for a single variant by its gnomAD variant ID (CHROM-POS-REF-ALT format) or search by rsID.

```python
VARIANT_QUERY = """
query VariantDetails($variantId: String!, $dataset: DatasetId!) {
  variant(variantId: $variantId, dataset: $dataset) {
    variant_id
    rsids
    chrom
    pos
    ref
    alt
    transcript_consequences {
      gene_symbol
      transcript_id
      is_canonical
      major_consequence
      lof
      lof_filter
      lof_flags
    }
    genome {
      an
      ac
      af
      faf95 { popmax popmax_population }
      populations { id ac an homozygote_count }
    }
  }
}
"""

# Query.variant() arg is `variantId` (camelCase). The top-level deprecated
# `consequence`/`lof`/`lof_filter`/`lof_flags` fields on VariantDetails were
# removed — read them from `transcript_consequences` (plural list; pick the
# canonical transcript with is_canonical=True).
data = gnomad_query(VARIANT_QUERY, {
    "variantId": "1-55039974-G-T",    # PCSK9 p.Tyr142Ter (LoF)
    "dataset": "gnomad_r4"
})
v = data["variant"]
canon = next((t for t in (v.get("transcript_consequences") or []) if t.get("is_canonical")),
             (v.get("transcript_consequences") or [{}])[0])
print(f"Variant     : {v['variant_id']}")
print(f"rsIDs       : {v['rsids']}")
print(f"Gene
sciagent-skill-creatorSkill

|

opentrons-integrationSkill

Opentrons Protocol API v2 for OT-2/Flex: Python protocols for pipetting, serial dilutions, PCR, plate replication; control thermocycler, heater-shaker, magnetic, temperature modules. Use pylabrobot for multi-vendor.

plotly-interactive-visualizationSkill

Interactive visualization with Plotly. 40+ chart types (scatter, line, heatmap, 3D, geographic) with hover, zoom, pan. Two APIs: Plotly Express (DataFrame) and Graph Objects (fine control). For static publication figures use matplotlib; for statistical grammar use seaborn.

seaborn-statistical-visualizationSkill

Statistical visualization on matplotlib + pandas. Distributions (histplot, kdeplot, violin, box), relational (scatter, line), categorical, regression, correlation heatmaps. Auto aggregation/CIs. Use plotly for interactive; matplotlib for low-level.

single-cell-annotationSkill

Best practices for single-cell RNA-seq cell type annotation including marker-based, reference-based, and automated classification approaches.

pymc-bayesian-modelingSkill

Bayesian modeling with PyMC 5: priors, likelihood, NUTS/ADVI sampling, diagnostics (R-hat, ESS), LOO/WAIC comparison, prediction. Hierarchical, logistic, GP variants; predictive checks.

scikit-survival-analysisSkill

Time-to-event modeling with scikit-survival: Cox PH (elastic net), Random Survival Forests, Boosting, SVMs for censored data. C-index, Brier, time-dependent AUC; Kaplan-Meier, Nelson-Aalen, competing risks. Pipeline/GridSearchCV compatible. Use statsmodels for frequentist, pymc for Bayesian, lifelines for parametric.

statistical-analysisSkill

>-