biopython-molecular-biology
Molecular biology toolkit: sequence manipulation, FASTA/GenBank/PDB I/O, NCBI Entrez, BLAST automation, pairwise/MSA alignment, Bio.PDB, phylogenetic trees. Use for batch processing, custom pipelines, format conversion, PubMed/GenBank queries. For quick gene lookups use gget; for multi-service REST APIs use bioservices.
git clone --depth 1 https://github.com/jaechang-hits/SciAgent-Skills /tmp/biopython-molecular-biology && cp -r /tmp/biopython-molecular-biology/skills/genomics-bioinformatics/biopython-molecular-biology ~/.claude/skills/biopython-molecular-biologySKILL.md
# Biopython: Computational Molecular Biology Toolkit
## Overview
Biopython is the standard open-source Python library for computational molecular biology, providing modular APIs for sequence handling, biological file parsing, NCBI database access, BLAST searches, protein structure analysis, and phylogenetics. It supports Python 3 and requires NumPy.
## When to Use
- Parse and convert biological file formats (FASTA, GenBank, FASTQ, PDB, mmCIF, PHYLIP)
- Fetch sequences or publications from NCBI databases (GenBank, PubMed, Protein) programmatically
- Run and parse BLAST searches (remote NCBI or local BLAST+)
- Perform pairwise or multiple sequence alignments with custom scoring
- Analyze 3D protein structures — distances, angles, DSSP, superimposition
- Build and visualize phylogenetic trees from sequence alignments
- Calculate sequence statistics (GC content, molecular weight, melting temperature)
- Batch-process thousands of sequences with custom filtering logic
- Use `pysam` instead for reading SAM/BAM/CRAM alignment files and working with mapped reads; use `scikit-bio` instead for advanced ecological diversity metrics
## Prerequisites
- **Python packages**: `biopython`, `numpy`, `matplotlib` (for tree visualization)
- **Data requirements**: Sequence files (FASTA, GenBank, FASTQ) or accession IDs for NCBI access
- **Environment**: Python 3.8+; NCBI Entrez requires email registration
```bash
pip install biopython numpy matplotlib
```
## Quick Start
```python
from Bio import SeqIO
from Bio.Seq import Seq
from Bio.SeqUtils import gc_fraction
# Parse a FASTA file and compute basic statistics
records = list(SeqIO.parse("sequences.fasta", "fasta"))
print(f"Sequences loaded: {len(records)}")
seq = records[0].seq
print(f"ID: {records[0].id}")
print(f"Length: {len(seq)} bp")
print(f"GC content: {gc_fraction(seq)*100:.1f}%")
print(f"Reverse complement: {seq.reverse_complement()[:30]}...")
print(f"Protein translation: {seq.translate()[:10]}...")
```
## Core API
### Module 1: Sequence Objects (Bio.Seq)
Create and manipulate DNA, RNA, and protein sequences.
```python
from Bio.Seq import Seq
# Create sequence and perform standard operations
dna = Seq("ATGGCCATTGTAATGGGCCGCTGAAAGGGTGCCCGATAG")
print(f"Length: {len(dna)} bp")
print(f"Complement: {dna.complement()}")
print(f"Reverse complement: {dna.reverse_complement()}")
print(f"Transcription: {dna.transcribe()}")
print(f"Translation: {dna.translate()}")
print(f"Translation (to stop): {dna.translate(to_stop=True)}")
# Length: 39 bp
# Translation: MAIVMGR*KGAR*
# Translation (to stop): MAIVMGR
```
```python
from Bio.Seq import Seq
# Alternative genetic codes (e.g., mitochondrial)
mito_dna = Seq("ATGGCCATTGTAATGGGCCGCTGA")
std_protein = mito_dna.translate(table=1) # Standard
mito_protein = mito_dna.translate(table=2) # Vertebrate mitochondrial
print(f"Standard: {std_protein}")
print(f"Mitochondrial: {mito_protein}")
# Find all start codons
coding_dna = Seq("ATGAAACCCATGGGGTTTAAATAG")
positions = [i for i in range(len(coding_dna) - 2) if coding_dna[i:i+3] == "ATG"]
print(f"ATG positions: {positions}")
# ATG positions: [0, 9]
```
### Module 2: Sequence I/O (Bio.SeqIO)
Read, write, and convert biological file formats.
```python
from Bio import SeqIO
# Parse FASTA file — returns SeqRecord iterator
records = list(SeqIO.parse("sequences.fasta", "fasta"))
print(f"Loaded {len(records)} sequences")
for rec in records[:3]:
print(f" {rec.id}: {len(rec.seq)} bp — {rec.description}")
# Parse GenBank — rich annotation access
for rec in SeqIO.parse("genome.gb", "genbank"):
print(f"{rec.id}: {len(rec.features)} features, {len(rec.seq)} bp")
for feat in rec.features[:5]:
print(f" {feat.type}: {feat.location}")
# Convert between formats
count = SeqIO.convert("input.gb", "genbank", "output.fasta", "fasta")
print(f"Converted {count} records: GenBank → FASTA")
```
```python
from Bio import SeqIO
from Bio.SeqRecord import SeqRecord
from Bio.Seq import Seq
# Write sequences to file
records = [
SeqRecord(Seq("ATCGATCG"), id="seq1", description="test sequence 1"),
SeqRecord(Seq("GCTAGCTA"), id="seq2", description="test sequence 2"),
]
count = SeqIO.write(records, "output.fasta", "fasta")
print(f"Wrote {count} records to output.fasta")
# Filter sequences by length (streaming — memory efficient)
long_seqs = (rec for rec in SeqIO.parse("large_file.fasta", "fasta") if len(rec.seq) >= 200)
count = SeqIO.write(long_seqs, "filtered.fasta", "fasta")
print(f"Kept {count} sequences >= 200 bp")
# Index large FASTA for random access
idx = SeqIO.index("large_file.fasta", "fasta")
print(f"Indexed {len(idx)} sequences")
rec = idx["target_sequence_id"]
print(f"Retrieved: {rec.id}, {len(rec.seq)} bp")
```
### Module 3: NCBI Database Access (Bio.Entrez)
Programmatic search and download from NCBI databases.
```python
from Bio import Entrez, SeqIO
Entrez.email = "your.email@example.com"
# Entrez.api_key = "your_key" # Optional: 10 req/s instead of 3 req/s
# Search PubMed
handle = Entrez.esearch(db="pubmed", term="CRISPR Cas9 2024", retmax=5)
results = Entrez.read(handle)
handle.close()
print(f"Found {results['Count']} articles, retrieved {len(results['IdList'])} IDs")
print(f"IDs: {results['IdList']}")
# Fetch GenBank record by accession
handle = Entrez.efetch(db="nucleotide", id="EU490707", rettype="gb", retmode="text")
record = SeqIO.read(handle, "genbank")
handle.close()
print(f"{record.id}: {record.description}")
print(f"Length: {len(record.seq)} bp, Features: {len(record.features)}")
```
```python
from Bio import Entrez
import time
Entrez.email = "your.email@example.com"
# Batch download with rate limiting
handle = Entrez.esearch(db="protein", term="insulin[Protein Name] AND human[Organism]", retmax=20)
results = Entrez.read(handle)
handle.close()
# Fetch summaries in batch
ids = results["IdList"][:10]
handle = Entrez.esummary(db="protein", id=",".join(ids))
summaries = Entrez.read(handle)
handle.close()
for|
Opentrons Protocol API v2 for OT-2/Flex: Python protocols for pipetting, serial dilutions, PCR, plate replication; control thermocycler, heater-shaker, magnetic, temperature modules. Use pylabrobot for multi-vendor.
Interactive visualization with Plotly. 40+ chart types (scatter, line, heatmap, 3D, geographic) with hover, zoom, pan. Two APIs: Plotly Express (DataFrame) and Graph Objects (fine control). For static publication figures use matplotlib; for statistical grammar use seaborn.
Statistical visualization on matplotlib + pandas. Distributions (histplot, kdeplot, violin, box), relational (scatter, line), categorical, regression, correlation heatmaps. Auto aggregation/CIs. Use plotly for interactive; matplotlib for low-level.
Best practices for single-cell RNA-seq cell type annotation including marker-based, reference-based, and automated classification approaches.
Bayesian modeling with PyMC 5: priors, likelihood, NUTS/ADVI sampling, diagnostics (R-hat, ESS), LOO/WAIC comparison, prediction. Hierarchical, logistic, GP variants; predictive checks.
Time-to-event modeling with scikit-survival: Cox PH (elastic net), Random Survival Forests, Boosting, SVMs for censored data. C-index, Brier, time-dependent AUC; Kaplan-Meier, Nelson-Aalen, competing risks. Pipeline/GridSearchCV compatible. Use statsmodels for frequentist, pymc for Bayesian, lifelines for parametric.
>-