Skip to main content
ClaudeWave
Skill199 repo starsupdated 16d ago

bcftools-variant-manipulation

CLI for VCF/BCF: filter, merge, annotate, query, normalize, compute stats. Core post-variant-calling: quality filtering, multi-sample merging, rsID annotation, genotype extraction. Samtools companion in HTSlib. Use GATK for complex indel realignment during calling; use VCFtools for population genetics stats.

Install in Claude Code
Copy
git clone --depth 1 https://github.com/jaechang-hits/SciAgent-Skills /tmp/bcftools-variant-manipulation && cp -r /tmp/bcftools-variant-manipulation/skills/genomics-bioinformatics/variant/bcftools-variant-manipulation ~/.claude/skills/bcftools-variant-manipulation
Then start a new Claude Code session; the skill loads automatically.

SKILL.md

# bcftools — VCF/BCF Variant Manipulation Toolkit

## Overview

bcftools is the standard command-line toolkit for processing VCF (Variant Call Format) and BCF (Binary Call Format) files in the HTSlib ecosystem. It covers the complete post-variant-calling workflow: format conversion, quality filtering, variant normalization, multi-sample merging, annotation with external databases, genotype extraction, and QC statistics. bcftools uses streaming by design — most commands read from stdin and write to stdout, making it ideal for memory-efficient pipelines on large cohorts.

## When to Use

- Filtering variants by quality (QUAL, DP, AF) after variant calling
- Merging VCF files from multiple samples into a joint call set
- Adding rsIDs or gene annotations to variant calls
- Extracting specific fields (genotypes, allele depths) as tabular output
- Normalizing indel representations and splitting multi-allelic records
- Calling variants from pileup output (mpileup + call)
- Computing per-sample and overall VCF QC statistics
- Use `GATK HaplotypeCaller` instead when calling variants with local realignment in human samples
- Use `VCFtools` instead for population genetics statistics (Fst, LD, Hardy-Weinberg)
- Use `bcftools` in the HTSlib pipeline; use `picard` for duplicate-marking and library metrics

## Prerequisites

- **Installation**: bcftools 1.17+ (part of HTSlib suite with samtools)
- **Input requirements**: VCF or BGzipped+tabix-indexed VCF (`.vcf.gz + .vcf.gz.tbi`) for region queries
- **Companion tools**: `samtools` for BAM processing; `tabix` for VCF indexing

> **Check before installing**: The tool may already be available in the current environment (e.g., inside a `pixi` / `conda` env). Run `command -v bcftools` first and skip the install commands below if it returns a path. When running inside a pixi project, invoke the tool via `pixi run bcftools` rather than bare `bcftools`.

```bash
# Bioconda (recommended — installs HTSlib suite)
conda install -c bioconda bcftools

# Homebrew (macOS)
brew install bcftools

# Verify
bcftools --version | head -1
# bcftools 1.20

# Index a VCF for region queries
bcftools index -t variants.vcf.gz   # creates .tbi
bcftools index -c variants.vcf.gz   # creates .csi (for chromosomes > 512 Mb)
```

## Quick Start

```bash
# Typical post-calling workflow: normalize → filter → annotate → extract
bcftools norm -d any -f reference.fa variants.vcf.gz \
  | bcftools filter -i 'QUAL>20 && DP>10' \
  | bcftools annotate -a dbSNP.vcf.gz -c ID \
  | bcftools view -O z -o final.vcf.gz

# Index the output
bcftools index -t final.vcf.gz

# Count variants at each stage
bcftools stats final.vcf.gz | grep "^SN"
```

## Core API

### Module 1: VCF/BCF I/O and Format Conversion

Convert between text VCF and binary BCF; compress and index for random access.

```bash
# VCF → compressed BCF (fastest format for piping)
bcftools view -O b -o variants.bcf variants.vcf

# BCF → VCF (for human-readable output)
bcftools view -O v -o variants.vcf variants.bcf

# VCF → bgzipped + indexed (standard archive format)
bcftools view -O z -W -o variants.vcf.gz variants.vcf
# -W automatically creates .tbi index after writing
```

```bash
# Extract specific samples
bcftools view -s sample1,sample2 -O z -o subset.vcf.gz variants.vcf.gz

# Exclude samples (prefix with ^)
bcftools view -s ^outlier_sample -O z -o cleaned.vcf.gz variants.vcf.gz

# Extract by region (fast; requires index)
bcftools view -r chr1:1000000-2000000 variants.vcf.gz -O v -o chr1_region.vcf

# Streaming pipeline: no intermediate files
samtools mpileup -Ou input.bam | bcftools call -m -Oz -o calls.vcf.gz
```

### Module 2: Variant Filtering

Apply quality thresholds and FLAG-based filters to retain high-confidence calls.

```bash
# Expression-based filter (include)
bcftools filter -i 'QUAL>20 && DP>10' variants.vcf.gz -O z -o filtered.vcf.gz

# Expression-based filter (exclude)
bcftools filter -e 'QUAL<10 || DP<5' variants.vcf.gz -O v -o filtered.vcf

# Soft filter: mark but keep (sets FILTER field to label)
bcftools filter -s LowQual -e 'QUAL<20' variants.vcf.gz -O z -o soft_filtered.vcf.gz
# Variants with QUAL<20 get FILTER="LowQual"; others get FILTER=PASS
```

```bash
# Keep only PASS variants
bcftools view -f PASS variants.vcf.gz -O z -o pass_only.vcf.gz

# SNP-only output
bcftools view --type snps variants.vcf.gz -O z -o snps.vcf.gz

# Indel-only output
bcftools view --type indels variants.vcf.gz -O z -o indels.vcf.gz

# Filter by allele frequency and depth
bcftools filter -i 'AF>0.1 && DP>20 && MQ>40' variants.vcf.gz -O z -o confident.vcf.gz

# Remove SNPs within 3 bp of indels
bcftools filter --SnpGap 3 variants.vcf.gz -O z -o gapfiltered.vcf.gz
```

### Module 3: VCF Query and Extraction

Transform VCF content into tabular text for downstream analysis.

```bash
# Extract chrom, position, ref, alt, quality
bcftools query -f '%CHROM\t%POS\t%REF\t%ALT\t%QUAL\n' variants.vcf.gz > variants.txt

# With header row (-H adds #-prefixed column names)
bcftools query -H -f '%CHROM\t%POS\t%REF\t%ALT\t%QUAL\n' variants.vcf.gz > variants.tsv

# Per-sample genotypes and allele depths
bcftools query -f '[%SAMPLE\t%GT\t%AD\n]' variants.vcf.gz > genotypes.txt
# Output: sample1  0/1  25,18  (ref_depth,alt_depth)
```

```bash
# Rare variants (AF < 1%)
bcftools query -i 'AF<0.01' -f '%CHROM\t%POS\t%REF\t%ALT\t%AF\n' \
    variants.vcf.gz > rare_variants.txt

# Count variants per chromosome
bcftools query -f '%CHROM\n' variants.vcf.gz | sort | uniq -c | sort -rn

# Extract genotype matrix across all samples
bcftools query -f '%CHROM:%POS\t[%GT\t]\n' -H variants.vcf.gz > genotype_matrix.tsv
```

### Module 4: Multi-file Operations

Combine VCF files from multiple samples (merge) or chromosomes (concat).

```bash
# Merge: join VCFs from DIFFERENT sample sets (same variants)
bcftools merge sample1.vcf.gz sample2.vcf.gz sample3.vcf.gz \
    -O z -o cohort.vcf.gz

# Merge with auto-indexing and threading
bcftools merge -O b -W --threads 4 s
sciagent-skill-creatorSkill

|

opentrons-integrationSkill

Opentrons Protocol API v2 for OT-2/Flex: Python protocols for pipetting, serial dilutions, PCR, plate replication; control thermocycler, heater-shaker, magnetic, temperature modules. Use pylabrobot for multi-vendor.

plotly-interactive-visualizationSkill

Interactive visualization with Plotly. 40+ chart types (scatter, line, heatmap, 3D, geographic) with hover, zoom, pan. Two APIs: Plotly Express (DataFrame) and Graph Objects (fine control). For static publication figures use matplotlib; for statistical grammar use seaborn.

seaborn-statistical-visualizationSkill

Statistical visualization on matplotlib + pandas. Distributions (histplot, kdeplot, violin, box), relational (scatter, line), categorical, regression, correlation heatmaps. Auto aggregation/CIs. Use plotly for interactive; matplotlib for low-level.

single-cell-annotationSkill

Best practices for single-cell RNA-seq cell type annotation including marker-based, reference-based, and automated classification approaches.

pymc-bayesian-modelingSkill

Bayesian modeling with PyMC 5: priors, likelihood, NUTS/ADVI sampling, diagnostics (R-hat, ESS), LOO/WAIC comparison, prediction. Hierarchical, logistic, GP variants; predictive checks.

scikit-survival-analysisSkill

Time-to-event modeling with scikit-survival: Cox PH (elastic net), Random Survival Forests, Boosting, SVMs for censored data. C-index, Brier, time-dependent AUC; Kaplan-Meier, Nelson-Aalen, competing risks. Pipeline/GridSearchCV compatible. Use statsmodels for frequentist, pymc for Bayesian, lifelines for parametric.

statistical-analysisSkill

>-