samtools-bam-processing
CLI toolkit for SAM/BAM/CRAM: sort, index, convert, filter, QC alignments. Core commands: view, sort, index, flagstat, stats, depth, markdup, merge. Required between alignment and variant/peak calling. Use pysam for Python-native BAM access; deeptools for normalized coverage tracks.
git clone --depth 1 https://github.com/jaechang-hits/SciAgent-Skills /tmp/samtools-bam-processing && cp -r /tmp/samtools-bam-processing/skills/genomics-bioinformatics/alignment/samtools-bam-processing ~/.claude/skills/samtools-bam-processingSKILL.md
# samtools — SAM/BAM/CRAM Alignment Toolkit ## Overview samtools is the standard command-line toolkit for processing sequence alignment files in SAM, BAM, and CRAM formats. It handles the complete alignment file lifecycle: format conversion, coordinate sorting, index creation, quality control statistics, read filtering, duplicate marking, and multi-file merging. samtools is a near-universal component of NGS pipelines between alignment (STAR, BWA) and downstream analysis (variant calling, peak calling, coverage). ## When to Use - Sorting BAM files by coordinate after alignment (required before indexing) - Indexing sorted BAM files for random access and region queries - Converting between SAM, BAM, and CRAM formats to save storage - Generating alignment QC metrics: mapping rates, insert sizes, per-chromosome stats - Filtering reads by mapping quality, FLAG bits, or genomic regions - Marking or removing PCR duplicates before variant calling - Merging multiple BAM files from different lanes or samples - Calculating per-base depth or coverage breadth for target regions - Use `pysam` instead for Python-native BAM manipulation in custom scripts - Use `deeptools bamCoverage` instead when you need normalized bigWig coverage tracks - Use `mosdepth` instead for whole-genome per-base depth (faster, parallelized) ## Prerequisites - **Installation**: samtools 1.17+ recommended - **Input requirements**: SAM/BAM/CRAM files; CRAM requires FASTA reference - **Companion tools**: `samtools faidx` for FASTA indexing; `samtools sort` before `samtools index` > **Check before installing**: The tool may already be available in the current environment (e.g., inside a `pixi` / `conda` env). Run `command -v samtools` first and skip the install commands below if it returns a path. When running inside a pixi project, invoke the tool via `pixi run samtools` rather than bare `samtools`. ```bash # Bioconda (recommended) conda install -c bioconda samtools # Homebrew (macOS) brew install samtools # Verify samtools --version | head -1 ``` ## Quick Start ```bash # Typical post-alignment workflow: sort → index → QC samtools sort -@ 8 -o sorted.bam input.bam samtools index sorted.bam samtools flagstat sorted.bam ``` ## Core API ### Module 1: BAM/SAM I/O and Format Conversion Convert between SAM/BAM/CRAM formats and extract subsets. ```bash # SAM → BAM (saves ~75% disk space) samtools view -b -h input.sam -o output.bam # BAM → CRAM (saves additional 40-50%) samtools view -C -T reference.fa input.bam -o output.cram # Filter: mapping quality ≥20, exclude unmapped (-F 4) samtools view -q 20 -F 4 input.bam -o filtered.bam # Extract specific region (requires index) samtools view -h sorted.bam "chr1:1000000-2000000" -o region.bam # Count reads matching filter samtools view -c -F 4 input.bam # Output: 45231923 (number of mapped reads) ``` ```bash # Extract reads as FASTQ (for realignment or de novo assembly) samtools fastq -@ 4 -1 R1.fastq.gz -2 R2.fastq.gz -0 unpaired.fastq.gz input.bam # Extract reads as FASTA samtools fasta input.bam > reads.fasta # Filter by read group samtools view -r SAMPLE_001 multi_rg.bam -o sample001.bam ``` ### Module 2: Sorting and Indexing Organize BAM files for efficient random access. ```bash # Sort by coordinate (required before indexing) samtools sort -@ 8 -m 2G input.bam -o sorted.bam # Sort by read name (required for fixmate/markdup) samtools sort -n -@ 8 input.bam -o namesorted.bam # Index sorted BAM (creates sorted.bam.bai) samtools index sorted.bam # For chromosomes > 512 Mbp: use CSI index instead samtools index -c sorted.bam # Group reads by name (fast, for fixmate — no full sort needed) samtools collate -o collated.bam input.bam ``` ### Module 3: Quality Control and Statistics Generate alignment QC metrics and coverage reports. ```bash # Quick summary: total, mapped, paired, properly paired samtools flagstat sorted.bam # Example output: # 50000000 + 0 in total (QC-passed reads + QC-failed reads) # 48523111 + 0 mapped (97.05% : N/A) # 50000000 + 0 paired in sequencing # 48490234 + 0 properly paired (96.98% : N/A) # Per-chromosome mapped/unmapped read counts samtools idxstats sorted.bam # chr1 248956422 12345678 0 # chr2 242193529 11234567 0 # Comprehensive stats (insert sizes, GC content, base quality) samtools stats -r reference.fa sorted.bam > full_stats.txt grep "^SN" full_stats.txt | cut -f2,3 # Summary Numbers only # Coverage report (min/max/mean per region/chromosome) samtools coverage sorted.bam ``` ```bash # Per-base read depth for specific regions samtools depth -b target_regions.bed sorted.bam > depth.txt # Output: chr pos depth (e.g., chr1 1000 45) # Statistics split by read group samtools stats -S RG sorted.bam > per_rg_stats.txt ``` ### Module 4: Read Filtering and FLAG Operations Filter reads using SAM FLAG bits for specific subsets. ```bash # FLAG reference — common masks: # 1 = paired 4 = unmapped # 2 = proper pair 8 = mate unmapped # 16 = reverse strand 64 = R1 (first in pair) # 128 = R2 256= secondary alignment # 1024 = PCR duplicate 2048= supplementary # Extract properly paired, mapped reads (FLAG 2 set, 4 unset) samtools view -f 2 -F 4 sorted.bam -o proper_pairs.bam # Extract R1 reads only samtools view -f 64 sorted.bam -o R1.bam # Remove secondary and supplementary alignments samtools view -F 2304 sorted.bam -o primary.bam # Extract reads from BED file regions samtools view -L regions.bed -b sorted.bam -o regions.bam ``` ### Module 5: Duplicate Handling Mark or remove PCR duplicates before variant calling. ```bash # Full duplicate marking workflow (collate → fixmate → sort → markdup) samtools collate -@ 8 -o collated.bam input.bam samtools fixmate -m -@ 8 collated.bam fixmated.bam samtools sort -@ 8 -o sorted.bam fixmated.bam samtools markdup -@ 8 sorted.bam marked.bam samtools index marked.bam # Check duplication rate samtools flagstat marked.bam | grep "duplicates" # Output: 23
|
Opentrons Protocol API v2 for OT-2/Flex: Python protocols for pipetting, serial dilutions, PCR, plate replication; control thermocycler, heater-shaker, magnetic, temperature modules. Use pylabrobot for multi-vendor.
Interactive visualization with Plotly. 40+ chart types (scatter, line, heatmap, 3D, geographic) with hover, zoom, pan. Two APIs: Plotly Express (DataFrame) and Graph Objects (fine control). For static publication figures use matplotlib; for statistical grammar use seaborn.
Statistical visualization on matplotlib + pandas. Distributions (histplot, kdeplot, violin, box), relational (scatter, line), categorical, regression, correlation heatmaps. Auto aggregation/CIs. Use plotly for interactive; matplotlib for low-level.
Best practices for single-cell RNA-seq cell type annotation including marker-based, reference-based, and automated classification approaches.
Bayesian modeling with PyMC 5: priors, likelihood, NUTS/ADVI sampling, diagnostics (R-hat, ESS), LOO/WAIC comparison, prediction. Hierarchical, logistic, GP variants; predictive checks.
Time-to-event modeling with scikit-survival: Cox PH (elastic net), Random Survival Forests, Boosting, SVMs for censored data. C-index, Brier, time-dependent AUC; Kaplan-Meier, Nelson-Aalen, competing risks. Pipeline/GridSearchCV compatible. Use statsmodels for frequentist, pymc for Bayesian, lifelines for parametric.
>-