Skill284 repo starsupdated 4d ago

biopython-sequence-analysis

This skill provides Biopython modules for sequence-centric bioinformatics workflows including parsing FASTA, FASTQ, GenBank, and GFF files; querying NCBI databases via Entrez; performing local and remote BLAST searches; conducting pairwise and multiple-sequence alignments; and building phylogenetic trees. Use it for gene family studies, phylogenomics, comparative genomics, and NCBI data retrieval pipelines that require format conversion and sequence analysis.

View source Repository: SciAgent-Skills

Install in Claude Code

Copy

git clone --depth 1 https://github.com/jaechang-hits/SciAgent-Skills /tmp/biopython-sequence-analysis && cp -r /tmp/biopython-sequence-analysis/skills/genomics-bioinformatics/biopython-sequence-analysis ~/.claude/skills/biopython-sequence-analysis

Then start a new Claude Code session; the skill loads automatically.

Definition

SKILL.md

# Biopython: Sequence Analysis Toolkit

## Overview

Biopython provides a comprehensive suite of modules for sequence-centric bioinformatics: reading and writing every major biological file format (FASTA, FASTQ, GenBank, GFF), querying NCBI databases programmatically, running BLAST searches and parsing results, aligning sequences pairwise or in multiple-sequence alignments, and building and visualizing phylogenetic trees. This skill focuses on analysis workflows — from NCBI data retrieval through alignment to phylogenetic inference.

For PCR primer design, restriction enzyme digestion, cloning simulation, protein structure analysis (Bio.PDB), and molecular weight/Tm calculations, see **biopython-molecular-biology**.

## When to Use

- Download a gene family from NCBI Nucleotide/Protein, align sequences, and construct a phylogenetic tree
- Parse GenBank or GFF3 annotation files and extract CDS sequences for a set of features
- Run a BLAST search against NCBI `nt` or `nr`, filter significant hits, and fetch their full sequences
- Compute pairwise sequence identities or score alignments with BLOSUM62/PAM250 matrices
- Index a large multi-FASTA or FASTQ file with `SeqIO.index()` for random-access retrieval without loading all sequences into RAM
- Convert between sequence formats (FASTA ↔ GenBank ↔ FASTQ ↔ PHYLIP) in a single call
- Traverse, root, prune, and annotate a Newick or Nexus phylogenetic tree programmatically
- Use **pysam** instead when working with SAM/BAM/CRAM alignment files and mapped reads
- Use **scikit-bio** instead when you need ecological diversity metrics (UniFrac, beta diversity) or ordination methods such as PCoA
- Use **gget** instead for quick gene lookups and one-liner NCBI queries without writing an Entrez pipeline

## Prerequisites

- **Python packages**: `biopython`, `numpy`, `matplotlib`
- **Optional tools**: MUSCLE or ClustalW installed locally (for `Bio.Align.Applications` wrappers)
- **NCBI access**: Set `Entrez.email` before any E-utilities call; obtain a free API key at https://www.ncbi.nlm.nih.gov/account/ for 10 req/s (default is 3 req/s)
- **Local BLAST**: BLAST+ installed separately (`conda install -c bioconda blast`) for offline searches

> **Check before installing**: The tool may already be available in the current environment (e.g., inside a `pixi` / `conda` env). Run `command -v python` first and skip the install commands below if it returns a path. When running inside a pixi project, invoke the tool via `pixi run python` rather than bare `python`.

```bash
pip install biopython numpy matplotlib
conda install -c bioconda blast  # optional, for local BLAST
```

## Quick Start

```python
from Bio import SeqIO, Entrez
from Bio.SeqUtils import gc_fraction

# Fetch a GenBank record and display basic stats
Entrez.email = "your.email@example.com"
handle = Entrez.efetch(db="nucleotide", id="NM_007294", rettype="gb", retmode="text")
record = SeqIO.read(handle, "genbank")
handle.close()

print(f"ID: {record.id}")
print(f"Length: {len(record.seq)} bp")
print(f"GC content: {gc_fraction(record.seq)*100:.1f}%")
print(f"Features: {len(record.features)}")
print(f"First feature: {record.features[0].type} at {record.features[0].location}")
# ID: NM_007294.4
# Length: 7207 bp
# GC content: 47.3%
# Features: 9
```

## Core API

### Module 1: SeqIO — File Parsing and Format Conversion

`Bio.SeqIO` reads and writes every major sequence format (FASTA, FASTQ, GenBank, EMBL, PHYLIP, Nexus). `SeqIO.parse()` returns an iterator of `SeqRecord` objects; `SeqIO.index()` builds an on-disk or in-memory dictionary for large files.

```python
from Bio import SeqIO

# Parse FASTA and FASTQ
fasta_records = list(SeqIO.parse("sequences.fasta", "fasta"))
print(f"FASTA: {len(fasta_records)} sequences")

# FASTQ: access quality scores
for rec in SeqIO.parse("reads.fastq", "fastq"):
    quals = rec.letter_annotations["phred_quality"]
    avg_q = sum(quals) / len(quals)
    print(f"  {rec.id}: {len(rec.seq)} bp, mean Q={avg_q:.1f}")
    break  # show one example

# Parse GenBank with feature access
for rec in SeqIO.parse("chromosome.gb", "genbank"):
    cdss = [f for f in rec.features if f.type == "CDS"]
    print(f"{rec.id}: {len(cdss)} CDS features")
    for cds in cdss[:3]:
        gene = cds.qualifiers.get("gene", ["unknown"])[0]
        print(f"  {gene}: {cds.location}")
```

```python
from Bio import SeqIO

# SeqIO.index() for random access without loading all records
# Useful for large reference FASTA files (genomes, nr database subsets)
idx = SeqIO.index("large_genome.fasta", "fasta")
print(f"Index contains {len(idx)} sequences")

# Retrieve specific sequences by ID in O(1)
target = idx["chr1"]
print(f"chr1: {len(target.seq):,} bp")
region = target.seq[1_000_000:1_001_000]
print(f"Region [1M-1M+1kb]: {region[:60]}...")

# Format conversion: GenBank → FASTA in one call
n = SeqIO.convert("annotation.gb", "genbank", "sequences.fasta", "fasta")
print(f"Converted {n} records to FASTA")
```

### Module 2: Seq and SeqRecord — Sequence Objects and Feature Annotations

`Bio.Seq` represents a biological sequence with in-place operations. `Bio.SeqRecord` wraps a `Seq` with an ID, description, and a dictionary of feature annotations. Features use `FeatureLocation` with strand (+1, -1).

```python
from Bio.Seq import Seq
from Bio.SeqRecord import SeqRecord
from Bio.SeqFeature import SeqFeature, FeatureLocation
from Bio.SeqUtils import gc_fraction, MeltingTemp

dna = Seq("ATGAAACCCGGGTTTTAA")
print(f"GC content: {gc_fraction(dna)*100:.1f}%")
print(f"Reverse complement: {dna.reverse_complement()}")
print(f"Transcript: {dna.transcribe()}")
print(f"Translation: {dna.translate()}")
print(f"Translation (to stop): {dna.translate(to_stop=True)}")

# Tm calculation for a short primer
primer = Seq("ATGAAACCCGGG")
tm = MeltingTemp.Tm_Wallace(primer)
print(f"Tm (Wallace): {tm:.1f}°C")
```

```python
from Bio.Seq import Seq
from Bio.SeqRecord import SeqRecord
from Bio.SeqFeature import SeqFeature, FeatureLo