bcftools-variant-manipulation
CLI for VCF/BCF: filter, merge, annotate, query, normalize, compute stats. Core post-variant-calling: quality filtering, multi-sample merging, rsID annotation, genotype extraction. Samtools companion in HTSlib. Use GATK for complex indel realignment during calling; use VCFtools for population genetics stats.
git clone --depth 1 https://github.com/jaechang-hits/SciAgent-Skills /tmp/bcftools-variant-manipulation && cp -r /tmp/bcftools-variant-manipulation/skills/genomics-bioinformatics/variant/bcftools-variant-manipulation ~/.claude/skills/bcftools-variant-manipulationSKILL.md
# bcftools — VCF/BCF Variant Manipulation Toolkit
## Overview
bcftools is the standard command-line toolkit for processing VCF (Variant Call Format) and BCF (Binary Call Format) files in the HTSlib ecosystem. It covers the complete post-variant-calling workflow: format conversion, quality filtering, variant normalization, multi-sample merging, annotation with external databases, genotype extraction, and QC statistics. bcftools uses streaming by design — most commands read from stdin and write to stdout, making it ideal for memory-efficient pipelines on large cohorts.
## When to Use
- Filtering variants by quality (QUAL, DP, AF) after variant calling
- Merging VCF files from multiple samples into a joint call set
- Adding rsIDs or gene annotations to variant calls
- Extracting specific fields (genotypes, allele depths) as tabular output
- Normalizing indel representations and splitting multi-allelic records
- Calling variants from pileup output (mpileup + call)
- Computing per-sample and overall VCF QC statistics
- Use `GATK HaplotypeCaller` instead when calling variants with local realignment in human samples
- Use `VCFtools` instead for population genetics statistics (Fst, LD, Hardy-Weinberg)
- Use `bcftools` in the HTSlib pipeline; use `picard` for duplicate-marking and library metrics
## Prerequisites
- **Installation**: bcftools 1.17+ (part of HTSlib suite with samtools)
- **Input requirements**: VCF or BGzipped+tabix-indexed VCF (`.vcf.gz + .vcf.gz.tbi`) for region queries
- **Companion tools**: `samtools` for BAM processing; `tabix` for VCF indexing
> **Check before installing**: The tool may already be available in the current environment (e.g., inside a `pixi` / `conda` env). Run `command -v bcftools` first and skip the install commands below if it returns a path. When running inside a pixi project, invoke the tool via `pixi run bcftools` rather than bare `bcftools`.
```bash
# Bioconda (recommended — installs HTSlib suite)
conda install -c bioconda bcftools
# Homebrew (macOS)
brew install bcftools
# Verify
bcftools --version | head -1
# bcftools 1.20
# Index a VCF for region queries
bcftools index -t variants.vcf.gz # creates .tbi
bcftools index -c variants.vcf.gz # creates .csi (for chromosomes > 512 Mb)
```
## Quick Start
```bash
# Typical post-calling workflow: normalize → filter → annotate → extract
bcftools norm -d any -f reference.fa variants.vcf.gz \
| bcftools filter -i 'QUAL>20 && DP>10' \
| bcftools annotate -a dbSNP.vcf.gz -c ID \
| bcftools view -O z -o final.vcf.gz
# Index the output
bcftools index -t final.vcf.gz
# Count variants at each stage
bcftools stats final.vcf.gz | grep "^SN"
```
## Core API
### Module 1: VCF/BCF I/O and Format Conversion
Convert between text VCF and binary BCF; compress and index for random access.
```bash
# VCF → compressed BCF (fastest format for piping)
bcftools view -O b -o variants.bcf variants.vcf
# BCF → VCF (for human-readable output)
bcftools view -O v -o variants.vcf variants.bcf
# VCF → bgzipped + indexed (standard archive format)
bcftools view -O z -W -o variants.vcf.gz variants.vcf
# -W automatically creates .tbi index after writing
```
```bash
# Extract specific samples
bcftools view -s sample1,sample2 -O z -o subset.vcf.gz variants.vcf.gz
# Exclude samples (prefix with ^)
bcftools view -s ^outlier_sample -O z -o cleaned.vcf.gz variants.vcf.gz
# Extract by region (fast; requires index)
bcftools view -r chr1:1000000-2000000 variants.vcf.gz -O v -o chr1_region.vcf
# Streaming pipeline: no intermediate files
samtools mpileup -Ou input.bam | bcftools call -m -Oz -o calls.vcf.gz
```
### Module 2: Variant Filtering
Apply quality thresholds and FLAG-based filters to retain high-confidence calls.
```bash
# Expression-based filter (include)
bcftools filter -i 'QUAL>20 && DP>10' variants.vcf.gz -O z -o filtered.vcf.gz
# Expression-based filter (exclude)
bcftools filter -e 'QUAL<10 || DP<5' variants.vcf.gz -O v -o filtered.vcf
# Soft filter: mark but keep (sets FILTER field to label)
bcftools filter -s LowQual -e 'QUAL<20' variants.vcf.gz -O z -o soft_filtered.vcf.gz
# Variants with QUAL<20 get FILTER="LowQual"; others get FILTER=PASS
```
```bash
# Keep only PASS variants
bcftools view -f PASS variants.vcf.gz -O z -o pass_only.vcf.gz
# SNP-only output
bcftools view --type snps variants.vcf.gz -O z -o snps.vcf.gz
# Indel-only output
bcftools view --type indels variants.vcf.gz -O z -o indels.vcf.gz
# Filter by allele frequency and depth
bcftools filter -i 'AF>0.1 && DP>20 && MQ>40' variants.vcf.gz -O z -o confident.vcf.gz
# Remove SNPs within 3 bp of indels
bcftools filter --SnpGap 3 variants.vcf.gz -O z -o gapfiltered.vcf.gz
```
### Module 3: VCF Query and Extraction
Transform VCF content into tabular text for downstream analysis.
```bash
# Extract chrom, position, ref, alt, quality
bcftools query -f '%CHROM\t%POS\t%REF\t%ALT\t%QUAL\n' variants.vcf.gz > variants.txt
# With header row (-H adds #-prefixed column names)
bcftools query -H -f '%CHROM\t%POS\t%REF\t%ALT\t%QUAL\n' variants.vcf.gz > variants.tsv
# Per-sample genotypes and allele depths
bcftools query -f '[%SAMPLE\t%GT\t%AD\n]' variants.vcf.gz > genotypes.txt
# Output: sample1 0/1 25,18 (ref_depth,alt_depth)
```
```bash
# Rare variants (AF < 1%)
bcftools query -i 'AF<0.01' -f '%CHROM\t%POS\t%REF\t%ALT\t%AF\n' \
variants.vcf.gz > rare_variants.txt
# Count variants per chromosome
bcftools query -f '%CHROM\n' variants.vcf.gz | sort | uniq -c | sort -rn
# Extract genotype matrix across all samples
bcftools query -f '%CHROM:%POS\t[%GT\t]\n' -H variants.vcf.gz > genotype_matrix.tsv
```
### Module 4: Multi-file Operations
Combine VCF files from multiple samples (merge) or chromosomes (concat).
```bash
# Merge: join VCFs from DIFFERENT sample sets (same variants)
bcftools merge sample1.vcf.gz sample2.vcf.gz sample3.vcf.gz \
-O z -o cohort.vcf.gz
# Merge with auto-indexing and threading
bcftools merge -O b -W --threads 4 s|
Opentrons Protocol API v2 for OT-2/Flex: Python protocols for pipetting, serial dilutions, PCR, plate replication; control thermocycler, heater-shaker, magnetic, temperature modules. Use pylabrobot for multi-vendor.
Interactive visualization with Plotly. 40+ chart types (scatter, line, heatmap, 3D, geographic) with hover, zoom, pan. Two APIs: Plotly Express (DataFrame) and Graph Objects (fine control). For static publication figures use matplotlib; for statistical grammar use seaborn.
Statistical visualization on matplotlib + pandas. Distributions (histplot, kdeplot, violin, box), relational (scatter, line), categorical, regression, correlation heatmaps. Auto aggregation/CIs. Use plotly for interactive; matplotlib for low-level.
Best practices for single-cell RNA-seq cell type annotation including marker-based, reference-based, and automated classification approaches.
Bayesian modeling with PyMC 5: priors, likelihood, NUTS/ADVI sampling, diagnostics (R-hat, ESS), LOO/WAIC comparison, prediction. Hierarchical, logistic, GP variants; predictive checks.
Time-to-event modeling with scikit-survival: Cox PH (elastic net), Random Survival Forests, Boosting, SVMs for censored data. C-index, Brier, time-dependent AUC; Kaplan-Meier, Nelson-Aalen, competing risks. Pipeline/GridSearchCV compatible. Use statsmodels for frequentist, pymc for Bayesian, lifelines for parametric.
>-