Skip to main content
ClaudeWave
Skill199 estrellas del repoactualizado 16d ago

pysam-genomic-files

Read/write SAM/BAM/CRAM, VCF/BCF, FASTA/FASTQ. Region queries, pileup, variant filtering, read groups. Python htslib wrapper exposing samtools/bcftools CLI. Use STAR/BWA for alignment; GATK/DeepVariant for variant calling.

Instalar en Claude Code
Copiar
git clone --depth 1 https://github.com/jaechang-hits/SciAgent-Skills /tmp/pysam-genomic-files && cp -r /tmp/pysam-genomic-files/skills/genomics-bioinformatics/alignment/pysam-genomic-files ~/.claude/skills/pysam-genomic-files
Después abre una sesión nueva de Claude Code; el skill carga automáticamente.

SKILL.md

# Pysam — Genomic File Toolkit

## Overview

Pysam provides a Pythonic interface to htslib for reading, manipulating, and writing genomic data files. It handles SAM/BAM/CRAM alignments, VCF/BCF variants, and FASTA/FASTQ sequences with efficient region-based random access. Also exposes samtools and bcftools as callable Python functions.

## When to Use

- Reading and querying BAM/CRAM alignment files (region extraction, read filtering)
- Analyzing VCF/BCF variant files (genotype access, variant filtering, annotation)
- Extracting reference sequences from indexed FASTA files
- Calculating per-base coverage and pileup statistics
- Building custom bioinformatics pipelines that combine alignment + variant + sequence data
- Quality control of NGS data (mapping quality, flag filtering, coverage)
- For **alignment from FASTQ** (read mapping), use STAR, BWA, or minimap2 instead
- For **variant calling from BAM**, use GATK or DeepVariant instead

## Prerequisites

```bash
pip install pysam
```

**Note**: Requires htslib C library (bundled with pip install on most platforms). On some Linux systems, may need `libhts-dev` or equivalent. Index files (`.bai`, `.tbi`, `.fai`) required for random access — create with `pysam.index()`, `pysam.tabix_index()`, or `pysam.faidx()`.

## Quick Start

```python
import pysam

# Read BAM file, fetch reads in a region
with pysam.AlignmentFile("sample.bam", "rb") as bam:
    for read in bam.fetch("chr1", 1000, 2000):
        print(f"{read.query_name}: pos={read.reference_start}, mapq={read.mapping_quality}")
    print(f"Total reads in region: {bam.count('chr1', 1000, 2000)}")
```

## Core API

### 1. Alignment Files (SAM/BAM/CRAM)

Read, query, and write aligned sequencing reads.

```python
import pysam

# Open BAM (binary) or SAM (text) file
bam = pysam.AlignmentFile("sample.bam", "rb")  # rb=read BAM, r=read SAM, rc=read CRAM

# Fetch reads overlapping a region (requires .bai index)
for read in bam.fetch("chr1", 10000, 20000):
    print(f"Name: {read.query_name}")
    print(f"  Position: {read.reference_start}-{read.reference_end}")
    print(f"  MAPQ: {read.mapping_quality}")
    print(f"  CIGAR: {read.cigarstring}")
    print(f"  Sequence: {read.query_sequence[:30]}...")
    break

# Count reads in region (fast, no iteration needed)
n_reads = bam.count("chr1", 10000, 20000)
print(f"Reads in region: {n_reads}")

# Filter reads by quality and flags
for read in bam.fetch("chr1", 10000, 20000):
    if read.mapping_quality >= 30 and not read.is_unmapped and not read.is_duplicate:
        pass  # Process high-quality, mapped, non-duplicate reads

bam.close()
```

```python
# Write filtered reads to a new BAM file
with pysam.AlignmentFile("input.bam", "rb") as inbam:
    with pysam.AlignmentFile("filtered.bam", "wb", header=inbam.header) as outbam:
        for read in inbam.fetch("chr1", 10000, 20000):
            if read.mapping_quality >= 30:
                outbam.write(read)

# Index the output
pysam.index("filtered.bam")
print("Created filtered.bam + filtered.bam.bai")
```

### 2. Coverage and Pileup Analysis

Calculate per-base coverage statistics.

```python
import pysam
import numpy as np

bam = pysam.AlignmentFile("sample.bam", "rb")

# Pileup: per-base coverage with read-level detail
for pileup_col in bam.pileup("chr1", 10000, 10100, min_mapping_quality=30):
    bases = [p.alignment.query_sequence[p.query_position]
             for p in pileup_col.pileups if not p.is_del and p.query_position is not None]
    print(f"Pos {pileup_col.reference_pos}: depth={pileup_col.nsegments}, bases={''.join(bases[:5])}")

# Quick coverage count per region (faster than pileup)
coverage = bam.count_coverage("chr1", 10000, 10100, quality_threshold=20)
# Returns tuple of 4 arrays (A, C, G, T counts per position)
total_cov = np.array(coverage).sum(axis=0)
print(f"Mean coverage: {total_cov.mean():.1f}x")

bam.close()
```

### 3. Variant Files (VCF/BCF)

Read, query, and filter genetic variants.

```python
import pysam

# Open VCF/BCF file
vcf = pysam.VariantFile("variants.vcf.gz")

# Iterate all variants
for record in vcf.fetch("chr1", 10000, 50000):
    print(f"{record.chrom}:{record.pos} {record.ref}>{','.join(record.alts or [])}")
    print(f"  QUAL={record.qual}, FILTER={list(record.filter)}")
    print(f"  INFO: {dict(record.info)}")

    # Access genotypes per sample
    for sample in record.samples:
        gt = record.samples[sample]["GT"]
        print(f"  {sample}: GT={gt}")
    break

vcf.close()
```

```python
# Filter variants and write to new VCF
with pysam.VariantFile("variants.vcf.gz") as vcf_in:
    with pysam.VariantFile("filtered.vcf.gz", "wz", header=vcf_in.header) as vcf_out:
        for record in vcf_in:
            if record.qual and record.qual >= 30 and "PASS" in record.filter:
                vcf_out.write(record)

pysam.tabix_index("filtered.vcf.gz", preset="vcf")
print("Created filtered.vcf.gz + filtered.vcf.gz.tbi")
```

### 4. Sequence Files (FASTA/FASTQ)

Random access to reference sequences and sequential reading of raw reads.

```python
import pysam

# FASTA: random access (requires .fai index)
fasta = pysam.FastaFile("reference.fasta")
seq = fasta.fetch("chr1", 10000, 10050)
print(f"Sequence ({len(seq)} bp): {seq}")
print(f"Available contigs: {fasta.references[:5]}")
print(f"Contig lengths: {dict(zip(fasta.references[:3], fasta.lengths[:3]))}")
fasta.close()

# Create FASTA index if needed
# pysam.faidx("reference.fasta")
```

```python
# FASTQ: sequential reading
with pysam.FastxFile("reads.fastq.gz") as fq:
    for i, entry in enumerate(fq):
        print(f"Read {entry.name}: {len(entry.sequence)} bp, mean_qual={sum(entry.get_quality_array())/len(entry.sequence):.1f}")
        if i >= 2:
            break
```

### 5. Read Groups and Sample Information

Extract and filter reads by read group (essential for multi-sample BAM files).

```python
import pysam

bam = pysam.AlignmentFile("multisample.bam", "rb")

# Access read group information from BAM
sciagent-skill-creatorSkill

|

opentrons-integrationSkill

Opentrons Protocol API v2 for OT-2/Flex: Python protocols for pipetting, serial dilutions, PCR, plate replication; control thermocycler, heater-shaker, magnetic, temperature modules. Use pylabrobot for multi-vendor.

plotly-interactive-visualizationSkill

Interactive visualization with Plotly. 40+ chart types (scatter, line, heatmap, 3D, geographic) with hover, zoom, pan. Two APIs: Plotly Express (DataFrame) and Graph Objects (fine control). For static publication figures use matplotlib; for statistical grammar use seaborn.

seaborn-statistical-visualizationSkill

Statistical visualization on matplotlib + pandas. Distributions (histplot, kdeplot, violin, box), relational (scatter, line), categorical, regression, correlation heatmaps. Auto aggregation/CIs. Use plotly for interactive; matplotlib for low-level.

single-cell-annotationSkill

Best practices for single-cell RNA-seq cell type annotation including marker-based, reference-based, and automated classification approaches.

pymc-bayesian-modelingSkill

Bayesian modeling with PyMC 5: priors, likelihood, NUTS/ADVI sampling, diagnostics (R-hat, ESS), LOO/WAIC comparison, prediction. Hierarchical, logistic, GP variants; predictive checks.

scikit-survival-analysisSkill

Time-to-event modeling with scikit-survival: Cox PH (elastic net), Random Survival Forests, Boosting, SVMs for censored data. C-index, Brier, time-dependent AUC; Kaplan-Meier, Nelson-Aalen, competing risks. Pipeline/GridSearchCV compatible. Use statsmodels for frequentist, pymc for Bayesian, lifelines for parametric.

statistical-analysisSkill

>-