Skip to main content
ClaudeWave
Skill199 repo starsupdated 16d ago

homer-motif-analysis

De novo and known TF motif enrichment in ChIP-seq/ATAC-seq peaks via HOMER. findMotifsGenome.pl finds over-represented patterns vs background; annotatePeaks.pl assigns context (TSS distance, gene, repeat). Use after MACS3 to identify enriched TFs, annotate peaks with nearest genes, and validate ChIP-seq via the target motif.

Install in Claude Code
Copy
git clone --depth 1 https://github.com/jaechang-hits/SciAgent-Skills /tmp/homer-motif-analysis && cp -r /tmp/homer-motif-analysis/skills/genomics-bioinformatics/homer-motif-analysis ~/.claude/skills/homer-motif-analysis
Then start a new Claude Code session; the skill loads automatically.

SKILL.md

# HOMER — Motif Analysis and Peak Annotation

## Overview

HOMER (Hypergeometric Optimization of Motif EnRichment) is a suite of Perl/C++ tools for analyzing genomic regulatory elements. Its two primary commands are `findMotifsGenome.pl`, which performs de novo motif discovery and known motif enrichment against JASPAR/HOMER databases, and `annotatePeaks.pl`, which maps each peak to the nearest gene, distance to TSS, and genomic feature class (promoter, intron, intergenic, repeat). HOMER takes BED-format peak files from MACS3 or similar peak callers and a reference genome assembly as input, and outputs HTML/text reports ranking enriched motifs by p-value and fold enrichment over a matched background.

## When to Use

- Identifying which transcription factors are bound in a ChIP-seq peak set by enriching their known motifs from JASPAR or the HOMER motif library
- Discovering novel sequence motifs de novo in open chromatin regions from ATAC-seq without prior knowledge of the binding TF
- Comparing motif landscapes between two conditions (e.g., treated vs. untreated peak sets) by running HOMER with one set as target and the other as background
- Annotating genomic peaks with nearest genes and distance to TSS for downstream functional analysis or integration with DESeq2 results
- Validating ChIP-seq experiment quality: a successful pull-down should show the target TF's canonical motif as the top hit
- Use `macs3-peak-calling` first to generate the peak BED files that serve as input to HOMER
- Use `jaspar-database` to cross-reference HOMER-discovered motifs with JASPAR IDs and additional TF metadata
- Use `MEME-CHIP` (web or local) when you need a more probabilistic ZOOPS/TCM model or the MEME Suite ecosystem
- Use `AME` (part of MEME Suite) as a faster alternative for known motif scanning without de novo discovery

## Prerequisites

- **Software**: HOMER (Perl + compiled binaries), conda or manual install
- **Genomes**: must download genome sequence via `installGenome.pl` after HOMER install
- **Input**: BED file of peaks (at minimum: chr, start, end columns); ideally summit-centered peaks from MACS3
- **Python packages** (for parsing/visualization): `pandas`, `matplotlib`, `seaborn`

> **Check before installing**: The tool may already be available in the current environment (e.g., inside a `pixi` / `conda` env). Run `command -v findMotifsGenome.pl` first and skip the install commands below if it returns a path. When running inside a pixi project, invoke the tool via `pixi run findMotifsGenome.pl` rather than bare `findMotifsGenome.pl`.

```bash
# Install HOMER via conda (recommended — handles Perl dependencies)
conda install -c bioconda homer

# Verify installation
findMotifsGenome.pl 2>&1 | head -3
# Usage: findMotifsGenome.pl <peak/BED file> <genome> <output directory> [options]

annotatePeaks.pl 2>&1 | head -3
# Usage: annotatePeaks.pl <peak/BED file> <genome> [options]

# Install reference genomes (downloads 2-way masker + sequence; ~3–10 GB each)
installGenome.pl hg38
installGenome.pl mm10

# Install Python parsing dependencies
pip install pandas matplotlib seaborn
```

## Quick Start

```bash
# Run de novo + known motif enrichment on TF ChIP-seq peaks (hg38, 200 bp window)
findMotifsGenome.pl peaks/tf_chip_summits.bed hg38 motif_output/ \
    -size 200 -mask -p 4

# Annotate peaks with nearest genes and genomic features
annotatePeaks.pl peaks/tf_chip_peaks.narrowPeak hg38 > annotated_peaks.txt

echo "Top known motif:"
head -2 motif_output/knownResults.txt | tail -1 | cut -f1-4

echo "Annotated peaks: $(wc -l < annotated_peaks.txt) lines"
```

## Workflow

### Step 1: Installation and Genome Setup

Install HOMER and download the reference genome sequence required for motif analysis.

```bash
# Activate conda environment (or use existing env)
conda create -n homer_env -c bioconda homer python=3.10 -y
conda activate homer_env

# List available genomes
installGenome.pl list

# Install human (hg38) and mouse (mm10) genomes
# Downloads masked genome sequence and annotation files
installGenome.pl hg38
# Output: Installing hg38... Done. (3-5 min, ~3 GB)

installGenome.pl mm10
# Output: Installing mm10... Done. (3-5 min, ~2.5 GB)

# Verify genome is installed
ls ~/.homer/data/genomes/hg38/
# genome.fa  chrom.sizes  ...

# Check HOMER motif database
ls ~/.homer/data/knownTFs/
# vertebrates.motifs  jaspar.motifs  ...
```

### Step 2: Prepare Input Peak File

Prepare a summit-centered BED file from MACS3 output for optimal motif resolution.

```bash
# Option A: Use MACS3 summit file directly (already 1 bp summit positions)
# Expand summits to ±100 bp (200 bp total) centered on summit
awk 'BEGIN{OFS="\t"} {
    start = ($2 - 100 < 0) ? 0 : $2 - 100;
    print $1, start, $2 + 100, $4, $5
}' peaks/tf_chip_summits.bed > peaks/tf_chip_200bp.bed

echo "Summit-centered peaks: $(wc -l < peaks/tf_chip_200bp.bed)"
# Summit-centered peaks: 12453

# Option B: Use narrowPeak file directly (HOMER accepts multi-column BED)
# HOMER uses columns 1-3 (chr, start, end) and centers internally with -size
cp peaks/tf_chip_peaks.narrowPeak peaks/input_peaks.bed

# Option C: Prepare a custom background region file (matched GC content)
# HOMER auto-generates background if not provided, but explicit background
# is recommended when comparing two peak sets
# Use the control peak set or random genomic regions as background:
bedtools shuffle -i peaks/tf_chip_peaks.narrowPeak \
    -g ~/.homer/data/genomes/hg38/chrom.sizes \
    -excl peaks/tf_chip_peaks.narrowPeak > peaks/background_regions.bed

echo "Background regions: $(wc -l < peaks/background_regions.bed)"
# Background regions: 12453
```

### Step 3: De Novo Motif Discovery

Run `findMotifsGenome.pl` for de novo motif discovery and known motif enrichment simultaneously.

```bash
mkdir -p motif_output/

# Full run: de novo + known motif enrichment
# -size 200: use 200 bp window centered on peak midpoint
# -mask: mask repetitive elements (recommended for clean motifs)
# -
sciagent-skill-creatorSkill

|

opentrons-integrationSkill

Opentrons Protocol API v2 for OT-2/Flex: Python protocols for pipetting, serial dilutions, PCR, plate replication; control thermocycler, heater-shaker, magnetic, temperature modules. Use pylabrobot for multi-vendor.

plotly-interactive-visualizationSkill

Interactive visualization with Plotly. 40+ chart types (scatter, line, heatmap, 3D, geographic) with hover, zoom, pan. Two APIs: Plotly Express (DataFrame) and Graph Objects (fine control). For static publication figures use matplotlib; for statistical grammar use seaborn.

seaborn-statistical-visualizationSkill

Statistical visualization on matplotlib + pandas. Distributions (histplot, kdeplot, violin, box), relational (scatter, line), categorical, regression, correlation heatmaps. Auto aggregation/CIs. Use plotly for interactive; matplotlib for low-level.

single-cell-annotationSkill

Best practices for single-cell RNA-seq cell type annotation including marker-based, reference-based, and automated classification approaches.

pymc-bayesian-modelingSkill

Bayesian modeling with PyMC 5: priors, likelihood, NUTS/ADVI sampling, diagnostics (R-hat, ESS), LOO/WAIC comparison, prediction. Hierarchical, logistic, GP variants; predictive checks.

scikit-survival-analysisSkill

Time-to-event modeling with scikit-survival: Cox PH (elastic net), Random Survival Forests, Boosting, SVMs for censored data. C-index, Brier, time-dependent AUC; Kaplan-Meier, Nelson-Aalen, competing risks. Pipeline/GridSearchCV compatible. Use statsmodels for frequentist, pymc for Bayesian, lifelines for parametric.

statistical-analysisSkill

>-