deeptools-ngs-analysis
NGS CLI for ChIP/RNA/ATAC-seq. BAM→bigWig with RPGC/CPM/RPKM, sample correlation/PCA, heatmaps/profiles around features, fingerprints. For alignment use STAR/BWA; for peak calling use MACS2.
git clone --depth 1 https://github.com/jaechang-hits/SciAgent-Skills /tmp/deeptools-ngs-analysis && cp -r /tmp/deeptools-ngs-analysis/skills/genomics-bioinformatics/interval-ops/deeptools-ngs-analysis ~/.claude/skills/deeptools-ngs-analysisSKILL.md
# deepTools — NGS Data Analysis Toolkit
## Overview
deepTools is a command-line toolkit for processing and visualizing high-throughput sequencing data. It converts BAM alignments to normalized coverage tracks (bigWig), performs quality control (correlation, PCA, fingerprint), and generates publication-quality heatmaps and profile plots around genomic features. Supports ChIP-seq, RNA-seq, ATAC-seq, and MNase-seq.
## When to Use
- Converting BAM files to normalized bigWig coverage tracks
- Comparing ChIP-seq treatment vs input control (log2 ratio tracks)
- Assessing sample quality: replicate correlation, PCA, coverage depth
- Evaluating ChIP enrichment strength (fingerprint plots)
- Creating heatmaps and profile plots around TSS, peaks, or other genomic regions
- Analyzing ATAC-seq data with Tn5 offset correction
- Generating strand-specific RNA-seq coverage tracks
- For **read alignment**, use STAR, BWA, or bowtie2 instead
- For **peak calling**, use MACS2 or HOMER instead
- For **BAM/VCF file manipulation**, use pysam instead
## Prerequisites
```bash
pip install deeptools
# Verify installation
bamCoverage --version
```
**Input requirements**: BAM files must be sorted and indexed (`.bai` file present). Generate index with `samtools index input.bam`. BED files for genomic regions (genes, peaks) in standard 3+ column format.
## Quick Start
```bash
# Convert BAM to normalized bigWig
bamCoverage --bam sample.bam --outFileName sample.bw \
--normalizeUsing RPGC --effectiveGenomeSize 2913022398 \
--binSize 10 --numberOfProcessors 8
# Create heatmap around TSS
computeMatrix reference-point -S sample.bw -R genes.bed \
-b 3000 -a 3000 --referencePoint TSS -o matrix.gz
plotHeatmap -m matrix.gz -o heatmap.png --colorMap RdBu
```
## Core API
### 1. BAM to Coverage Conversion
Convert BAM alignments to normalized coverage tracks (bigWig or bedGraph).
```bash
# Basic conversion with RPGC normalization
bamCoverage --bam input.bam --outFileName output.bw \
--normalizeUsing RPGC --effectiveGenomeSize 2913022398 \
--binSize 10 --numberOfProcessors 8 \
--extendReads 200 --ignoreDuplicates
# CPM normalization (simpler, no genome size needed)
bamCoverage --bam input.bam --outFileName output.bw \
--normalizeUsing CPM --binSize 10 -p 8
# RNA-seq: strand-specific coverage
bamCoverage --bam rnaseq.bam --outFileName forward.bw \
--filterRNAstrand forward --normalizeUsing CPM -p 8
# IMPORTANT: Never use --extendReads for RNA-seq (spans splice junctions)
```
### 2. Sample Comparison
Compare treatment vs control or generate ratio tracks.
```bash
# Log2 ratio: treatment / control
bamCompare -b1 treatment.bam -b2 control.bam -o log2ratio.bw \
--operation log2 --scaleFactorsMethod readCount \
--extendReads 200 -p 8
# Subtract control from treatment
bamCompare -b1 treatment.bam -b2 control.bam -o subtract.bw \
--operation subtract --scaleFactorsMethod readCount
```
### 3. Quality Control
Assess sample quality, replicate concordance, and enrichment strength.
```bash
# Sample correlation heatmap
multiBamSummary bins --bamfiles rep1.bam rep2.bam rep3.bam \
-o counts.npz --binSize 10000 -p 8
plotCorrelation -in counts.npz --corMethod pearson \
--whatToShow heatmap -o correlation.png
# Good: replicates cluster with r > 0.9
# PCA of samples
plotPCA -in counts.npz -o pca.png --plotTitle "Sample PCA"
# ChIP enrichment fingerprint
plotFingerprint -b input.bam chip.bam -o fingerprint.png \
--extendReads 200 --ignoreDuplicates
# Good ChIP: steep rise curve; flat diagonal = poor enrichment
# Coverage depth assessment
plotCoverage -b sample.bam -o coverage.png --ignoreDuplicates -p 8
# Fragment size distribution (paired-end)
bamPEFragmentSize -b sample.bam -o fragsize.png
```
### 4. Heatmaps and Profile Plots
Visualize signal around genomic features (TSS, peaks, gene bodies).
```bash
# Reference-point mode: signal around TSS
computeMatrix reference-point -S chip.bw -R genes.bed \
-b 3000 -a 3000 --referencePoint TSS -o matrix.gz -p 8
# Scale-regions mode: signal across gene bodies
computeMatrix scale-regions -S chip.bw -R genes.bed \
-b 1000 -a 1000 --regionBodyLength 5000 -o matrix.gz -p 8
# Generate heatmap
plotHeatmap -m matrix.gz -o heatmap.png \
--colorMap RdBu --kmeans 3 --sortUsing mean
# Generate profile plot
plotProfile -m matrix.gz -o profile.png \
--plotType lines --colors blue red
# Multiple signal files: compare marks
computeMatrix reference-point -S h3k4me3.bw h3k27me3.bw -R genes.bed \
-b 3000 -a 3000 --referencePoint TSS -o multi_matrix.gz
plotHeatmap -m multi_matrix.gz -o multi_heatmap.png
```
### 5. Read Filtering and Processing
Filter reads before analysis or correct for assay-specific biases.
```bash
# Filter by mapping quality and fragment size
alignmentSieve --bam input.bam --outFile filtered.bam \
--minMappingQuality 10 --minFragmentLength 150 \
--maxFragmentLength 700
# ATAC-seq: apply Tn5 offset correction (+4/-5 bp shift)
alignmentSieve --bam atac.bam --outFile shifted.bam --ATACshift
# Then index: samtools index shifted.bam
# GC bias correction (only if significant bias detected)
computeGCBias -b input.bam --effectiveGenomeSize 2913022398 \
-g genome.2bit --GCbiasFrequenciesFile gc_freq.txt -p 8
correctGCBias -b input.bam --effectiveGenomeSize 2913022398 \
--GCbiasFrequenciesFile gc_freq.txt -o corrected.bam
```
### 6. Enrichment Analysis
Quantify signal enrichment at specific regions.
```bash
# Signal enrichment at peak regions
plotEnrichment -b chip.bam input.bam --BED peaks.bed \
-o enrichment.png --ignoreDuplicates -p 8
```
## Key Concepts
### Normalization Methods
| Method | Formula | When to Use | Requires |
|--------|---------|-------------|----------|
| **RPGC** | 1× genome coverage | ChIP-seq, ATAC-seq | `--effectiveGenomeSize` |
| **CPM** | Counts per million | Any assay, quick comparison | Nothing |
| **RPKM** | Per kb per million | RNA-seq gene-level | Nothi|
Opentrons Protocol API v2 for OT-2/Flex: Python protocols for pipetting, serial dilutions, PCR, plate replication; control thermocycler, heater-shaker, magnetic, temperature modules. Use pylabrobot for multi-vendor.
Interactive visualization with Plotly. 40+ chart types (scatter, line, heatmap, 3D, geographic) with hover, zoom, pan. Two APIs: Plotly Express (DataFrame) and Graph Objects (fine control). For static publication figures use matplotlib; for statistical grammar use seaborn.
Statistical visualization on matplotlib + pandas. Distributions (histplot, kdeplot, violin, box), relational (scatter, line), categorical, regression, correlation heatmaps. Auto aggregation/CIs. Use plotly for interactive; matplotlib for low-level.
Best practices for single-cell RNA-seq cell type annotation including marker-based, reference-based, and automated classification approaches.
Bayesian modeling with PyMC 5: priors, likelihood, NUTS/ADVI sampling, diagnostics (R-hat, ESS), LOO/WAIC comparison, prediction. Hierarchical, logistic, GP variants; predictive checks.
Time-to-event modeling with scikit-survival: Cox PH (elastic net), Random Survival Forests, Boosting, SVMs for censored data. C-index, Brier, time-dependent AUC; Kaplan-Meier, Nelson-Aalen, competing risks. Pipeline/GridSearchCV compatible. Use statsmodels for frequentist, pymc for Bayesian, lifelines for parametric.
>-