Skill284 repo starsupdated 4d ago

homer-motif-analysis

HOMER performs de novo and known transcription factor motif enrichment analysis on ChIP-seq and ATAC-seq peaks using findMotifsGenome.pl, and annotates peaks with nearest genes and genomic context using annotatePeaks.pl. Use this skill after MACS3 peak calling to identify enriched transcription factors, validate experiment quality by confirming target motif presence, and assign functional annotations to peaks for downstream analysis integration.

View source Repository: SciAgent-Skills

Install in Claude Code

Copy

git clone --depth 1 https://github.com/jaechang-hits/SciAgent-Skills /tmp/homer-motif-analysis && cp -r /tmp/homer-motif-analysis/skills/genomics-bioinformatics/homer-motif-analysis ~/.claude/skills/homer-motif-analysis

Then start a new Claude Code session; the skill loads automatically.

Definition

SKILL.md

# HOMER — Motif Analysis and Peak Annotation

## Overview

HOMER (Hypergeometric Optimization of Motif EnRichment) is a suite of Perl/C++ tools for analyzing genomic regulatory elements. Its two primary commands are `findMotifsGenome.pl`, which performs de novo motif discovery and known motif enrichment against JASPAR/HOMER databases, and `annotatePeaks.pl`, which maps each peak to the nearest gene, distance to TSS, and genomic feature class (promoter, intron, intergenic, repeat). HOMER takes BED-format peak files from MACS3 or similar peak callers and a reference genome assembly as input, and outputs HTML/text reports ranking enriched motifs by p-value and fold enrichment over a matched background.

## When to Use

- Identifying which transcription factors are bound in a ChIP-seq peak set by enriching their known motifs from JASPAR or the HOMER motif library
- Discovering novel sequence motifs de novo in open chromatin regions from ATAC-seq without prior knowledge of the binding TF
- Comparing motif landscapes between two conditions (e.g., treated vs. untreated peak sets) by running HOMER with one set as target and the other as background
- Annotating genomic peaks with nearest genes and distance to TSS for downstream functional analysis or integration with DESeq2 results
- Validating ChIP-seq experiment quality: a successful pull-down should show the target TF's canonical motif as the top hit
- Use `macs3-peak-calling` first to generate the peak BED files that serve as input to HOMER
- Use `jaspar-database` to cross-reference HOMER-discovered motifs with JASPAR IDs and additional TF metadata
- Use `MEME-CHIP` (web or local) when you need a more probabilistic ZOOPS/TCM model or the MEME Suite ecosystem
- Use `AME` (part of MEME Suite) as a faster alternative for known motif scanning without de novo discovery

## Prerequisites

- **Software**: HOMER (Perl + compiled binaries), conda or manual install
- **Genomes**: must download genome sequence via `installGenome.pl` after HOMER install
- **Input**: BED file of peaks (at minimum: chr, start, end columns); ideally summit-centered peaks from MACS3
- **Python packages** (for parsing/visualization): `pandas`, `matplotlib`, `seaborn`

> **Check before installing**: The tool may already be available in the current environment (e.g., inside a `pixi` / `conda` env). Run `command -v findMotifsGenome.pl` first and skip the install commands below if it returns a path. When running inside a pixi project, invoke the tool via `pixi run findMotifsGenome.pl` rather than bare `findMotifsGenome.pl`.

```bash
# Install HOMER via conda (recommended — handles Perl dependencies)
conda install -c bioconda homer

# Verify installation
findMotifsGenome.pl 2>&1 | head -3
# Usage: findMotifsGenome.pl <peak/BED file> <genome> <output directory> [options]

annotatePeaks.pl 2>&1 | head -3
# Usage: annotatePeaks.pl <peak/BED file> <genome> [options]

# Install reference genomes (downloads 2-way masker + sequence; ~3–10 GB each)
installGenome.pl hg38
installGenome.pl mm10

# Install Python parsing dependencies
pip install pandas matplotlib seaborn
```

## Quick Start

```bash
# Run de novo + known motif enrichment on TF ChIP-seq peaks (hg38, 200 bp window)
findMotifsGenome.pl peaks/tf_chip_summits.bed hg38 motif_output/ \
    -size 200 -mask -p 4

# Annotate peaks with nearest genes and genomic features
annotatePeaks.pl peaks/tf_chip_peaks.narrowPeak hg38 > annotated_peaks.txt

echo "Top known motif:"
head -2 motif_output/knownResults.txt | tail -1 | cut -f1-4

echo "Annotated peaks: $(wc -l < annotated_peaks.txt) lines"
```

## Workflow

### Step 1: Installation and Genome Setup

Install HOMER and download the reference genome sequence required for motif analysis.

```bash
# Activate conda environment (or use existing env)
conda create -n homer_env -c bioconda homer python=3.10 -y
conda activate homer_env

# List available genomes
installGenome.pl list

# Install human (hg38) and mouse (mm10) genomes
# Downloads masked genome sequence and annotation files
installGenome.pl hg38
# Output: Installing hg38... Done. (3-5 min, ~3 GB)

installGenome.pl mm10
# Output: Installing mm10... Done. (3-5 min, ~2.5 GB)

# Verify genome is installed
ls ~/.homer/data/genomes/hg38/
# genome.fa  chrom.sizes  ...

# Check HOMER motif database
ls ~/.homer/data/knownTFs/
# vertebrates.motifs  jaspar.motifs  ...
```

### Step 2: Prepare Input Peak File

Prepare a summit-centered BED file from MACS3 output for optimal motif resolution.

```bash
# Option A: Use MACS3 summit file directly (already 1 bp summit positions)
# Expand summits to ±100 bp (200 bp total) centered on summit
awk 'BEGIN{OFS="\t"} {
    start = ($2 - 100 < 0) ? 0 : $2 - 100;
    print $1, start, $2 + 100, $4, $5
}' peaks/tf_chip_summits.bed > peaks/tf_chip_200bp.bed

echo "Summit-centered peaks: $(wc -l < peaks/tf_chip_200bp.bed)"
# Summit-centered peaks: 12453

# Option B: Use narrowPeak file directly (HOMER accepts multi-column BED)
# HOMER uses columns 1-3 (chr, start, end) and centers internally with -size
cp peaks/tf_chip_peaks.narrowPeak peaks/input_peaks.bed

# Option C: Prepare a custom background region file (matched GC content)
# HOMER auto-generates background if not provided, but explicit background
# is recommended when comparing two peak sets
# Use the control peak set or random genomic regions as background:
bedtools shuffle -i peaks/tf_chip_peaks.narrowPeak \
    -g ~/.homer/data/genomes/hg38/chrom.sizes \
    -excl peaks/tf_chip_peaks.narrowPeak > peaks/background_regions.bed

echo "Background regions: $(wc -l < peaks/background_regions.bed)"
# Background regions: 12453
```

### Step 3: De Novo Motif Discovery

Run `findMotifsGenome.pl` for de novo motif discovery and known motif enrichment simultaneously.

```bash
mkdir -p motif_output/

# Full run: de novo + known motif enrichment
# -size 200: use 200 bp window centered on peak midpoint
# -mask: mask repetitive elements (recommended for clean motifs)
# -