Skip to main content
ClaudeWave
Skill199 repo starsupdated 16d ago

jaspar-database

JASPAR 2024 TF binding profiles via REST API and pyJASPAR. Retrieve PFMs/PWMs by TF name, JASPAR ID, species, or structural class. Scan DNA for TFBS; browse by taxon (human, mouse) or TF family (bHLH, zinc finger). Use for motif enrichment input, TFBS scanning, and regulatory sequence analysis. For ChIP-seq peak motif discovery use homer-motif-analysis; for regulatory variant scoring use regulomedb-database.

Install in Claude Code
Copy
git clone --depth 1 https://github.com/jaechang-hits/SciAgent-Skills /tmp/jaspar-database && cp -r /tmp/jaspar-database/skills/genomics-bioinformatics/databases/jaspar-database ~/.claude/skills/jaspar-database
Then start a new Claude Code session; the skill loads automatically.

SKILL.md

# JASPAR Database

## Overview

JASPAR is a curated, open-access database of transcription factor (TF) binding profiles represented as position frequency matrices (PFMs). The 2024 release contains 1,209 profiles in the CORE vertebrate collection, covering 783 TFs with experimentally validated binding data from SELEX, ChIP-seq, and PBM experiments. Access is free via the JASPAR REST API at `https://jaspar.elixir.no/api/v1/` — no authentication required — and through the `pyJASPAR` Python library for matrix retrieval and manipulation.

## When to Use

- Looking up the PWM or PFM for a specific TF by name (e.g., CTCF, SP1, GATA1) to use as motif input for a scanning tool
- Retrieving all JASPAR profiles for a species (e.g., Homo sapiens, Mus musculus) to build a motif library for enrichment analysis
- Scanning a DNA promoter sequence for predicted TF binding sites using a known PWM
- Finding all TFs of a given structural class (bHLH, zinc finger, homeodomain) to build a TF family binding profile set
- Getting metadata for a JASPAR matrix: number of binding sites, information content, GC content, experiment type
- Downloading complete JASPAR collection sets (CORE, UNVALIDATED, CNE) in JASPAR or MEME format for batch analysis
- Use `homer-motif-analysis` instead when you need de novo motif discovery from ChIP-seq peaks; JASPAR is for retrieving known matrices
- For regulatory element annotations tied to a genomic region use `encode-database` or `regulomedb-database`

## Prerequisites

- **Python packages**: `requests`, `pandas`, `matplotlib`, `numpy`
- **Optional**: `pyJASPAR` (Python library wrapping JASPAR REST API with BIOPYTHON motif objects)
- **Data requirements**: TF gene symbols, JASPAR matrix IDs (e.g., `MA0139.1`), or DNA sequences (string or FASTA)
- **Environment**: internet connection; no API key required
- **Rate limits**: no official published limits; use `time.sleep(0.5)` between batch requests

```bash
pip install requests pandas matplotlib numpy
pip install pyJASPAR   # optional; pulls in biopython
```

## Quick Start

```python
import requests

JASPAR_API = "https://jaspar.elixir.no/api/v1"

# Search for CTCF profile in the CORE vertebrate collection
r = requests.get(f"{JASPAR_API}/matrix/", params={
    "search": "CTCF",
    "collection": "CORE",
    "tax_group": "vertebrates",
    "format": "json"
}, timeout=15)
r.raise_for_status()
results = r.json()
print(f"Profiles found: {results['count']}")
for m in results["results"][:3]:
    print(f"  {m['matrix_id']}  {m['name']}  sites={m['sites']}  type={m['type']}")
# Profiles found: 2
#   MA0139.1  CTCF  sites=190  type=ChIP-seq
#   MA1929.1  CTCF  sites=2135  type=ChIP-seq
```

## Core API

### Query 1: Matrix Search

Search for TF profiles by TF name, species, collection, or taxonomic group. Returns a paginated list of matching profile records.

```python
import requests, time

JASPAR_API = "https://jaspar.elixir.no/api/v1"

def jaspar_search(search=None, collection="CORE", tax_id=None, tax_group=None,
                  tf_class=None, tf_family=None, page_size=50):
    """Search JASPAR matrices. Returns list of result dicts."""
    params = {"format": "json", "page_size": page_size}
    if search:      params["search"]     = search
    if collection:  params["collection"] = collection
    if tax_id:      params["tax_id"]     = tax_id
    if tax_group:   params["tax_group"]  = tax_group
    if tf_class:    params["tf_class"]   = tf_class
    if tf_family:   params["tf_family"]  = tf_family

    all_results = []
    url = f"{JASPAR_API}/matrix/"
    while url:
        r = requests.get(url, params=params if url == f"{JASPAR_API}/matrix/" else None, timeout=15)
        r.raise_for_status()
        data = r.json()
        all_results.extend(data["results"])
        url = data.get("next")   # follow pagination
        time.sleep(0.3)
    return all_results

# Example: all CORE vertebrate profiles for GATA family
gata_profiles = jaspar_search(search="GATA", collection="CORE", tax_group="vertebrates")
print(f"GATA profiles: {len(gata_profiles)}")
for m in gata_profiles[:4]:
    print(f"  {m['matrix_id']}  {m['name']:12s}  {m.get('tf_class','')}  sites={m['sites']}")
```

### Query 2: Matrix Retrieval

Fetch the full profile record for a specific matrix ID, including the raw PFM counts, metadata, and TF annotations.

```python
import requests

JASPAR_API = "https://jaspar.elixir.no/api/v1"

def get_matrix(matrix_id):
    """Return full matrix record for a JASPAR ID (e.g. 'MA0139.1')."""
    r = requests.get(f"{JASPAR_API}/matrix/{matrix_id}/", params={"format": "json"}, timeout=15)
    r.raise_for_status()
    return r.json()

m = get_matrix("MA0139.1")   # CTCF
print(f"ID: {m['matrix_id']}  Name: {m['name']}")
print(f"Collection: {m['collection']}  Type: {m['type']}")
print(f"Species: {[s['name'] for s in m.get('species', [])]}")
print(f"UniProt: {m.get('uniprot_ids', [])}")
print(f"Sites: {m['sites']}  Binding sites used to build matrix")
print(f"TF class: {m.get('class_name', 'n/a')}  Family: {m.get('family_name', 'n/a')}")

# PFM structure: dict mapping position (as str) -> {A, C, G, T: count}
pfm = m["pfm"]
n_positions = len(pfm)
print(f"\nPFM length: {n_positions} positions")
print(f"Position 0: {pfm['0']}")   # {A: x, C: y, G: z, T: w}
# Position 0: {'A': 87, 'C': 12, 'G': 22, 'T': 69}
```

### Query 3: PWM Computation from PFM

Convert a raw PFM (count matrix) to a position weight matrix (PWM) using log-odds scoring. The PWM is used for binding site scanning.

```python
import requests, numpy as np

JASPAR_API = "https://jaspar.elixir.no/api/v1"

def pfm_to_pwm(pfm_dict, pseudocount=0.8, background=None):
    """
    Convert JASPAR PFM dict to PWM (log2 odds).
    pfm_dict: dict of str(position) -> {A, C, G, T: float}
    Returns: numpy array shape (4, L), rows = [A, C, G, T]
    """
    if background is None:
        background = {"A": 0.25, "C": 0.25, "G": 0.25, "T": 0.25}
    bases = ["A", "C", "G", "T"]
    L = len(pfm_d
sciagent-skill-creatorSkill

|

opentrons-integrationSkill

Opentrons Protocol API v2 for OT-2/Flex: Python protocols for pipetting, serial dilutions, PCR, plate replication; control thermocycler, heater-shaker, magnetic, temperature modules. Use pylabrobot for multi-vendor.

plotly-interactive-visualizationSkill

Interactive visualization with Plotly. 40+ chart types (scatter, line, heatmap, 3D, geographic) with hover, zoom, pan. Two APIs: Plotly Express (DataFrame) and Graph Objects (fine control). For static publication figures use matplotlib; for statistical grammar use seaborn.

seaborn-statistical-visualizationSkill

Statistical visualization on matplotlib + pandas. Distributions (histplot, kdeplot, violin, box), relational (scatter, line), categorical, regression, correlation heatmaps. Auto aggregation/CIs. Use plotly for interactive; matplotlib for low-level.

single-cell-annotationSkill

Best practices for single-cell RNA-seq cell type annotation including marker-based, reference-based, and automated classification approaches.

pymc-bayesian-modelingSkill

Bayesian modeling with PyMC 5: priors, likelihood, NUTS/ADVI sampling, diagnostics (R-hat, ESS), LOO/WAIC comparison, prediction. Hierarchical, logistic, GP variants; predictive checks.

scikit-survival-analysisSkill

Time-to-event modeling with scikit-survival: Cox PH (elastic net), Random Survival Forests, Boosting, SVMs for censored data. C-index, Brier, time-dependent AUC; Kaplan-Meier, Nelson-Aalen, competing risks. Pipeline/GridSearchCV compatible. Use statsmodels for frequentist, pymc for Bayesian, lifelines for parametric.

statistical-analysisSkill

>-