Skip to main content
ClaudeWave
Skill199 repo starsupdated 16d ago

samtools-bam-processing

CLI toolkit for SAM/BAM/CRAM: sort, index, convert, filter, QC alignments. Core commands: view, sort, index, flagstat, stats, depth, markdup, merge. Required between alignment and variant/peak calling. Use pysam for Python-native BAM access; deeptools for normalized coverage tracks.

Install in Claude Code
Copy
git clone --depth 1 https://github.com/jaechang-hits/SciAgent-Skills /tmp/samtools-bam-processing && cp -r /tmp/samtools-bam-processing/skills/genomics-bioinformatics/alignment/samtools-bam-processing ~/.claude/skills/samtools-bam-processing
Then start a new Claude Code session; the skill loads automatically.

SKILL.md

# samtools — SAM/BAM/CRAM Alignment Toolkit

## Overview

samtools is the standard command-line toolkit for processing sequence alignment files in SAM, BAM, and CRAM formats. It handles the complete alignment file lifecycle: format conversion, coordinate sorting, index creation, quality control statistics, read filtering, duplicate marking, and multi-file merging. samtools is a near-universal component of NGS pipelines between alignment (STAR, BWA) and downstream analysis (variant calling, peak calling, coverage).

## When to Use

- Sorting BAM files by coordinate after alignment (required before indexing)
- Indexing sorted BAM files for random access and region queries
- Converting between SAM, BAM, and CRAM formats to save storage
- Generating alignment QC metrics: mapping rates, insert sizes, per-chromosome stats
- Filtering reads by mapping quality, FLAG bits, or genomic regions
- Marking or removing PCR duplicates before variant calling
- Merging multiple BAM files from different lanes or samples
- Calculating per-base depth or coverage breadth for target regions
- Use `pysam` instead for Python-native BAM manipulation in custom scripts
- Use `deeptools bamCoverage` instead when you need normalized bigWig coverage tracks
- Use `mosdepth` instead for whole-genome per-base depth (faster, parallelized)

## Prerequisites

- **Installation**: samtools 1.17+ recommended
- **Input requirements**: SAM/BAM/CRAM files; CRAM requires FASTA reference
- **Companion tools**: `samtools faidx` for FASTA indexing; `samtools sort` before `samtools index`

> **Check before installing**: The tool may already be available in the current environment (e.g., inside a `pixi` / `conda` env). Run `command -v samtools` first and skip the install commands below if it returns a path. When running inside a pixi project, invoke the tool via `pixi run samtools` rather than bare `samtools`.

```bash
# Bioconda (recommended)
conda install -c bioconda samtools

# Homebrew (macOS)
brew install samtools

# Verify
samtools --version | head -1
```

## Quick Start

```bash
# Typical post-alignment workflow: sort → index → QC
samtools sort -@ 8 -o sorted.bam input.bam
samtools index sorted.bam
samtools flagstat sorted.bam
```

## Core API

### Module 1: BAM/SAM I/O and Format Conversion

Convert between SAM/BAM/CRAM formats and extract subsets.

```bash
# SAM → BAM (saves ~75% disk space)
samtools view -b -h input.sam -o output.bam

# BAM → CRAM (saves additional 40-50%)
samtools view -C -T reference.fa input.bam -o output.cram

# Filter: mapping quality ≥20, exclude unmapped (-F 4)
samtools view -q 20 -F 4 input.bam -o filtered.bam

# Extract specific region (requires index)
samtools view -h sorted.bam "chr1:1000000-2000000" -o region.bam

# Count reads matching filter
samtools view -c -F 4 input.bam
# Output: 45231923 (number of mapped reads)
```

```bash
# Extract reads as FASTQ (for realignment or de novo assembly)
samtools fastq -@ 4 -1 R1.fastq.gz -2 R2.fastq.gz -0 unpaired.fastq.gz input.bam

# Extract reads as FASTA
samtools fasta input.bam > reads.fasta

# Filter by read group
samtools view -r SAMPLE_001 multi_rg.bam -o sample001.bam
```

### Module 2: Sorting and Indexing

Organize BAM files for efficient random access.

```bash
# Sort by coordinate (required before indexing)
samtools sort -@ 8 -m 2G input.bam -o sorted.bam

# Sort by read name (required for fixmate/markdup)
samtools sort -n -@ 8 input.bam -o namesorted.bam

# Index sorted BAM (creates sorted.bam.bai)
samtools index sorted.bam

# For chromosomes > 512 Mbp: use CSI index instead
samtools index -c sorted.bam

# Group reads by name (fast, for fixmate — no full sort needed)
samtools collate -o collated.bam input.bam
```

### Module 3: Quality Control and Statistics

Generate alignment QC metrics and coverage reports.

```bash
# Quick summary: total, mapped, paired, properly paired
samtools flagstat sorted.bam
# Example output:
# 50000000 + 0 in total (QC-passed reads + QC-failed reads)
# 48523111 + 0 mapped (97.05% : N/A)
# 50000000 + 0 paired in sequencing
# 48490234 + 0 properly paired (96.98% : N/A)

# Per-chromosome mapped/unmapped read counts
samtools idxstats sorted.bam
# chr1  248956422  12345678  0
# chr2  242193529  11234567  0

# Comprehensive stats (insert sizes, GC content, base quality)
samtools stats -r reference.fa sorted.bam > full_stats.txt
grep "^SN" full_stats.txt | cut -f2,3  # Summary Numbers only

# Coverage report (min/max/mean per region/chromosome)
samtools coverage sorted.bam
```

```bash
# Per-base read depth for specific regions
samtools depth -b target_regions.bed sorted.bam > depth.txt
# Output: chr  pos  depth (e.g., chr1  1000  45)

# Statistics split by read group
samtools stats -S RG sorted.bam > per_rg_stats.txt
```

### Module 4: Read Filtering and FLAG Operations

Filter reads using SAM FLAG bits for specific subsets.

```bash
# FLAG reference — common masks:
# 1    = paired         4  = unmapped
# 2    = proper pair    8  = mate unmapped
# 16   = reverse strand 64 = R1 (first in pair)
# 128  = R2             256= secondary alignment
# 1024 = PCR duplicate  2048= supplementary

# Extract properly paired, mapped reads (FLAG 2 set, 4 unset)
samtools view -f 2 -F 4 sorted.bam -o proper_pairs.bam

# Extract R1 reads only
samtools view -f 64 sorted.bam -o R1.bam

# Remove secondary and supplementary alignments
samtools view -F 2304 sorted.bam -o primary.bam

# Extract reads from BED file regions
samtools view -L regions.bed -b sorted.bam -o regions.bam
```

### Module 5: Duplicate Handling

Mark or remove PCR duplicates before variant calling.

```bash
# Full duplicate marking workflow (collate → fixmate → sort → markdup)
samtools collate -@ 8 -o collated.bam input.bam
samtools fixmate -m -@ 8 collated.bam fixmated.bam
samtools sort -@ 8 -o sorted.bam fixmated.bam
samtools markdup -@ 8 sorted.bam marked.bam
samtools index marked.bam

# Check duplication rate
samtools flagstat marked.bam | grep "duplicates"
# Output: 23
sciagent-skill-creatorSkill

|

opentrons-integrationSkill

Opentrons Protocol API v2 for OT-2/Flex: Python protocols for pipetting, serial dilutions, PCR, plate replication; control thermocycler, heater-shaker, magnetic, temperature modules. Use pylabrobot for multi-vendor.

plotly-interactive-visualizationSkill

Interactive visualization with Plotly. 40+ chart types (scatter, line, heatmap, 3D, geographic) with hover, zoom, pan. Two APIs: Plotly Express (DataFrame) and Graph Objects (fine control). For static publication figures use matplotlib; for statistical grammar use seaborn.

seaborn-statistical-visualizationSkill

Statistical visualization on matplotlib + pandas. Distributions (histplot, kdeplot, violin, box), relational (scatter, line), categorical, regression, correlation heatmaps. Auto aggregation/CIs. Use plotly for interactive; matplotlib for low-level.

single-cell-annotationSkill

Best practices for single-cell RNA-seq cell type annotation including marker-based, reference-based, and automated classification approaches.

pymc-bayesian-modelingSkill

Bayesian modeling with PyMC 5: priors, likelihood, NUTS/ADVI sampling, diagnostics (R-hat, ESS), LOO/WAIC comparison, prediction. Hierarchical, logistic, GP variants; predictive checks.

scikit-survival-analysisSkill

Time-to-event modeling with scikit-survival: Cox PH (elastic net), Random Survival Forests, Boosting, SVMs for censored data. C-index, Brier, time-dependent AUC; Kaplan-Meier, Nelson-Aalen, competing risks. Pipeline/GridSearchCV compatible. Use statsmodels for frequentist, pymc for Bayesian, lifelines for parametric.

statistical-analysisSkill

>-