bwa-mem2-dna-aligner
Fast short-read DNA aligner for WGS/WES/ChIP-seq. 2× faster BWA-MEM successor; outputs SAM/BAM with read group headers for GATK. Primary plus supplementary records for chimeric reads. Use STAR for RNA-seq splice-aware alignment; Bowtie2 is a comparable alternative.
git clone --depth 1 https://github.com/jaechang-hits/SciAgent-Skills /tmp/bwa-mem2-dna-aligner && cp -r /tmp/bwa-mem2-dna-aligner/skills/genomics-bioinformatics/alignment/bwa-mem2-dna-aligner ~/.claude/skills/bwa-mem2-dna-alignerSKILL.md
# BWA-MEM2 — DNA Short-Read Aligner
## Overview
BWA-MEM2 aligns short DNA reads (Illumina, 50–250 bp) to a reference genome using the BWT-FM index. It is the standard aligner for whole-genome sequencing (WGS), whole-exome sequencing (WES), ChIP-seq, and ATAC-seq DNA alignment. BWA-MEM2 is 2× faster than the original BWA-MEM while producing identical results. It outputs SAM format with proper read group (`@RG`) headers required by GATK HaplotypeCaller and Picard tools. For paired-end reads, it marks proper pairs and resolves chimeric/split reads into supplementary alignments.
## When to Use
- Aligning WGS or WES Illumina reads to a reference genome for variant calling (SNP, indel, SV)
- ChIP-seq or ATAC-seq DNA alignment to produce BAM files for peak calling with MACS3
- Producing GATK-compatible BAM files with `@RG` read group tags
- Aligning reads ≥ 50 bp; for shorter reads (< 50 bp), BWA-backtrack may be more appropriate
- Re-aligning legacy FASTQ files to an updated reference genome assembly
- Use **STAR** instead for RNA-seq reads that span splice junctions
- Use **Bowtie2** as an alternative for local alignment or when index size must be minimized
## Prerequisites
- **Software**: bwa-mem2 (conda or pre-compiled binary), samtools
- **Reference**: genome FASTA (e.g., GRCh38, hg19, mm10)
- **RAM**: ~28 GB for human genome index; 6–8 GB for mouse
> **Check before installing**: The tool may already be available in the current environment (e.g., inside a `pixi` / `conda` env). Run `command -v bwa-mem2` first and skip the install commands below if it returns a path. When running inside a pixi project, invoke the tool via `pixi run bwa-mem2` rather than bare `bwa-mem2`.
```bash
# Install with conda (recommended)
conda install -c bioconda bwa-mem2 samtools
# Or download pre-compiled binary
wget https://github.com/bwa-mem2/bwa-mem2/releases/download/v2.2.1/bwa-mem2-2.2.1_x64-linux.tar.bz2
tar -jxf bwa-mem2-2.2.1_x64-linux.tar.bz2
export PATH="$PWD/bwa-mem2-2.2.1_x64-linux:$PATH"
# Verify
bwa-mem2 version
# 2.2.1
```
## Quick Start
```bash
# 1. Build genome index (~30 min, run once)
bwa-mem2 index GRCh38.fa
# 2. Align paired-end reads and sort
bwa-mem2 mem -t 16 -R "@RG\tID:sample1\tSM:sample1\tPL:ILLUMINA" \
GRCh38.fa sample1_R1.fastq.gz sample1_R2.fastq.gz \
| samtools sort -@ 8 -o sample1.sorted.bam
# 3. Index the BAM
samtools index sample1.sorted.bam
echo "Aligned reads: $(samtools view -c -F 4 sample1.sorted.bam)"
```
## Workflow
### Step 1: Download Reference Genome
Obtain the reference genome FASTA file matching the target assembly.
```bash
# Download GRCh38 primary assembly (human)
wget https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/001/405/GCA_000001405.15_GRCh38/seqs_for_alignment_pipelines.ucsc_ids/GCA_000001405.15_GRCh38_no_alt_analysis_set.fna.gz
gunzip GCA_000001405.15_GRCh38_no_alt_analysis_set.fna.gz
mv GCA_000001405.15_GRCh38_no_alt_analysis_set.fna GRCh38.fa
# Or use ENSEMBL/GENCODE
wget https://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_47/GRCh38.primary_assembly.genome.fa.gz
gunzip GRCh38.primary_assembly.genome.fa.gz
echo "Reference size: $(du -sh GRCh38.fa)"
```
### Step 2: Build BWA-MEM2 Index
Index the reference genome — required once per genome, takes ~25-35 min for human.
```bash
# Build index (~28 GB RAM required for human genome)
bwa-mem2 index GRCh38.fa
# This creates: GRCh38.fa.0123, GRCh38.fa.amb, GRCh38.fa.ann,
# GRCh38.fa.bwt.2bit.64, GRCh38.fa.pac
echo "Index files: $(ls GRCh38.fa.* | wc -l) created"
ls -lh GRCh38.fa.*
```
### Step 3: Align Paired-End Reads
Align FASTQ reads with a read group header required for GATK compatibility.
```bash
# Align with read group (required for GATK)
# @RG fields: ID (run ID), SM (sample name), PL (platform), LB (library), PU (flowcell)
bwa-mem2 mem \
-t 16 \
-R "@RG\tID:sample1_run1\tSM:sample1\tPL:ILLUMINA\tLB:lib1\tPU:flowcell1" \
GRCh38.fa \
sample1_R1.fastq.gz \
sample1_R2.fastq.gz \
| samtools sort -@ 8 -m 2G -o sample1.sorted.bam
samtools index sample1.sorted.bam
echo "Alignment complete."
echo "Total reads: $(samtools view -c sample1.sorted.bam)"
echo "Mapped reads: $(samtools view -c -F 4 sample1.sorted.bam)"
```
### Step 4: Mark PCR Duplicates
Remove or mark optical and PCR duplicates before variant calling.
```bash
# Option A: samtools markdup (fast)
samtools fixmate -m sample1.sorted.bam sample1.fixmate.bam
samtools sort -@ 8 -o sample1.fixmate.sorted.bam sample1.fixmate.bam
samtools markdup -@ 8 sample1.fixmate.sorted.bam sample1.markdup.bam
samtools index sample1.markdup.bam
echo "Duplication rate:"
samtools flagstat sample1.markdup.bam | grep "duplicate"
# Option B: Picard MarkDuplicates (GATK best practices)
picard MarkDuplicates \
INPUT=sample1.sorted.bam \
OUTPUT=sample1.markdup.bam \
METRICS_FILE=sample1.dupmetrics.txt \
REMOVE_DUPLICATES=false \
CREATE_INDEX=true
cat sample1.dupmetrics.txt | grep -A2 "ESTIMATED"
```
### Step 5: Assess Alignment Quality
Generate alignment statistics and check key quality metrics.
```bash
# Full alignment statistics
samtools flagstat sample1.markdup.bam > sample1.flagstat.txt
cat sample1.flagstat.txt
# Coverage statistics
samtools coverage sample1.markdup.bam | head -30
# Parse key metrics with Python
python3 - << 'EOF'
from pathlib import Path
flagstat = Path("sample1.flagstat.txt").read_text()
for line in flagstat.splitlines():
if any(kw in line for kw in ["total", "mapped", "properly paired", "duplicate"]):
print(line)
EOF
```
### Step 6: Complete WGS/WES Pipeline → Variant Calling
Pipe BWA-MEM2 output directly into GATK HaplotypeCaller.
```bash
#!/bin/bash
# Complete WGS alignment → variant calling pipeline
GENOME="GRCh38.fa"
SAMPLE="sample1"
R1="data/${SAMPLE}_R1.fastq.gz"
R2="data/${SAMPLE}_R2.fastq.gz"
THREADS=16
OUTDIR="results/${SAMPLE}"
mkdir -p "$OUTDIR"
# Step 1: Align + sort
bwa-mem2 mem -t $THREADS \
-R "@RG\tID:${SAMPL|
Opentrons Protocol API v2 for OT-2/Flex: Python protocols for pipetting, serial dilutions, PCR, plate replication; control thermocycler, heater-shaker, magnetic, temperature modules. Use pylabrobot for multi-vendor.
Interactive visualization with Plotly. 40+ chart types (scatter, line, heatmap, 3D, geographic) with hover, zoom, pan. Two APIs: Plotly Express (DataFrame) and Graph Objects (fine control). For static publication figures use matplotlib; for statistical grammar use seaborn.
Statistical visualization on matplotlib + pandas. Distributions (histplot, kdeplot, violin, box), relational (scatter, line), categorical, regression, correlation heatmaps. Auto aggregation/CIs. Use plotly for interactive; matplotlib for low-level.
Best practices for single-cell RNA-seq cell type annotation including marker-based, reference-based, and automated classification approaches.
Bayesian modeling with PyMC 5: priors, likelihood, NUTS/ADVI sampling, diagnostics (R-hat, ESS), LOO/WAIC comparison, prediction. Hierarchical, logistic, GP variants; predictive checks.
Time-to-event modeling with scikit-survival: Cox PH (elastic net), Random Survival Forests, Boosting, SVMs for censored data. C-index, Brier, time-dependent AUC; Kaplan-Meier, Nelson-Aalen, competing risks. Pipeline/GridSearchCV compatible. Use statsmodels for frequentist, pymc for Bayesian, lifelines for parametric.
>-