Skip to main content
ClaudeWave
Skill199 estrellas del repoactualizado 16d ago

biopython-molecular-biology

Molecular biology toolkit: sequence manipulation, FASTA/GenBank/PDB I/O, NCBI Entrez, BLAST automation, pairwise/MSA alignment, Bio.PDB, phylogenetic trees. Use for batch processing, custom pipelines, format conversion, PubMed/GenBank queries. For quick gene lookups use gget; for multi-service REST APIs use bioservices.

Instalar en Claude Code
Copiar
git clone --depth 1 https://github.com/jaechang-hits/SciAgent-Skills /tmp/biopython-molecular-biology && cp -r /tmp/biopython-molecular-biology/skills/genomics-bioinformatics/biopython-molecular-biology ~/.claude/skills/biopython-molecular-biology
Después abre una sesión nueva de Claude Code; el skill carga automáticamente.

SKILL.md

# Biopython: Computational Molecular Biology Toolkit

## Overview

Biopython is the standard open-source Python library for computational molecular biology, providing modular APIs for sequence handling, biological file parsing, NCBI database access, BLAST searches, protein structure analysis, and phylogenetics. It supports Python 3 and requires NumPy.

## When to Use

- Parse and convert biological file formats (FASTA, GenBank, FASTQ, PDB, mmCIF, PHYLIP)
- Fetch sequences or publications from NCBI databases (GenBank, PubMed, Protein) programmatically
- Run and parse BLAST searches (remote NCBI or local BLAST+)
- Perform pairwise or multiple sequence alignments with custom scoring
- Analyze 3D protein structures — distances, angles, DSSP, superimposition
- Build and visualize phylogenetic trees from sequence alignments
- Calculate sequence statistics (GC content, molecular weight, melting temperature)
- Batch-process thousands of sequences with custom filtering logic
- Use `pysam` instead for reading SAM/BAM/CRAM alignment files and working with mapped reads; use `scikit-bio` instead for advanced ecological diversity metrics

## Prerequisites

- **Python packages**: `biopython`, `numpy`, `matplotlib` (for tree visualization)
- **Data requirements**: Sequence files (FASTA, GenBank, FASTQ) or accession IDs for NCBI access
- **Environment**: Python 3.8+; NCBI Entrez requires email registration

```bash
pip install biopython numpy matplotlib
```

## Quick Start

```python
from Bio import SeqIO
from Bio.Seq import Seq
from Bio.SeqUtils import gc_fraction

# Parse a FASTA file and compute basic statistics
records = list(SeqIO.parse("sequences.fasta", "fasta"))
print(f"Sequences loaded: {len(records)}")

seq = records[0].seq
print(f"ID: {records[0].id}")
print(f"Length: {len(seq)} bp")
print(f"GC content: {gc_fraction(seq)*100:.1f}%")
print(f"Reverse complement: {seq.reverse_complement()[:30]}...")
print(f"Protein translation: {seq.translate()[:10]}...")
```

## Core API

### Module 1: Sequence Objects (Bio.Seq)

Create and manipulate DNA, RNA, and protein sequences.

```python
from Bio.Seq import Seq

# Create sequence and perform standard operations
dna = Seq("ATGGCCATTGTAATGGGCCGCTGAAAGGGTGCCCGATAG")
print(f"Length: {len(dna)} bp")
print(f"Complement: {dna.complement()}")
print(f"Reverse complement: {dna.reverse_complement()}")
print(f"Transcription: {dna.transcribe()}")
print(f"Translation: {dna.translate()}")
print(f"Translation (to stop): {dna.translate(to_stop=True)}")
# Length: 39 bp
# Translation: MAIVMGR*KGAR*
# Translation (to stop): MAIVMGR
```

```python
from Bio.Seq import Seq

# Alternative genetic codes (e.g., mitochondrial)
mito_dna = Seq("ATGGCCATTGTAATGGGCCGCTGA")
std_protein = mito_dna.translate(table=1)      # Standard
mito_protein = mito_dna.translate(table=2)     # Vertebrate mitochondrial
print(f"Standard:      {std_protein}")
print(f"Mitochondrial: {mito_protein}")

# Find all start codons
coding_dna = Seq("ATGAAACCCATGGGGTTTAAATAG")
positions = [i for i in range(len(coding_dna) - 2) if coding_dna[i:i+3] == "ATG"]
print(f"ATG positions: {positions}")
# ATG positions: [0, 9]
```

### Module 2: Sequence I/O (Bio.SeqIO)

Read, write, and convert biological file formats.

```python
from Bio import SeqIO

# Parse FASTA file — returns SeqRecord iterator
records = list(SeqIO.parse("sequences.fasta", "fasta"))
print(f"Loaded {len(records)} sequences")
for rec in records[:3]:
    print(f"  {rec.id}: {len(rec.seq)} bp — {rec.description}")

# Parse GenBank — rich annotation access
for rec in SeqIO.parse("genome.gb", "genbank"):
    print(f"{rec.id}: {len(rec.features)} features, {len(rec.seq)} bp")
    for feat in rec.features[:5]:
        print(f"  {feat.type}: {feat.location}")

# Convert between formats
count = SeqIO.convert("input.gb", "genbank", "output.fasta", "fasta")
print(f"Converted {count} records: GenBank → FASTA")
```

```python
from Bio import SeqIO
from Bio.SeqRecord import SeqRecord
from Bio.Seq import Seq

# Write sequences to file
records = [
    SeqRecord(Seq("ATCGATCG"), id="seq1", description="test sequence 1"),
    SeqRecord(Seq("GCTAGCTA"), id="seq2", description="test sequence 2"),
]
count = SeqIO.write(records, "output.fasta", "fasta")
print(f"Wrote {count} records to output.fasta")

# Filter sequences by length (streaming — memory efficient)
long_seqs = (rec for rec in SeqIO.parse("large_file.fasta", "fasta") if len(rec.seq) >= 200)
count = SeqIO.write(long_seqs, "filtered.fasta", "fasta")
print(f"Kept {count} sequences >= 200 bp")

# Index large FASTA for random access
idx = SeqIO.index("large_file.fasta", "fasta")
print(f"Indexed {len(idx)} sequences")
rec = idx["target_sequence_id"]
print(f"Retrieved: {rec.id}, {len(rec.seq)} bp")
```

### Module 3: NCBI Database Access (Bio.Entrez)

Programmatic search and download from NCBI databases.

```python
from Bio import Entrez, SeqIO

Entrez.email = "your.email@example.com"
# Entrez.api_key = "your_key"  # Optional: 10 req/s instead of 3 req/s

# Search PubMed
handle = Entrez.esearch(db="pubmed", term="CRISPR Cas9 2024", retmax=5)
results = Entrez.read(handle)
handle.close()
print(f"Found {results['Count']} articles, retrieved {len(results['IdList'])} IDs")
print(f"IDs: {results['IdList']}")

# Fetch GenBank record by accession
handle = Entrez.efetch(db="nucleotide", id="EU490707", rettype="gb", retmode="text")
record = SeqIO.read(handle, "genbank")
handle.close()
print(f"{record.id}: {record.description}")
print(f"Length: {len(record.seq)} bp, Features: {len(record.features)}")
```

```python
from Bio import Entrez
import time

Entrez.email = "your.email@example.com"

# Batch download with rate limiting
handle = Entrez.esearch(db="protein", term="insulin[Protein Name] AND human[Organism]", retmax=20)
results = Entrez.read(handle)
handle.close()

# Fetch summaries in batch
ids = results["IdList"][:10]
handle = Entrez.esummary(db="protein", id=",".join(ids))
summaries = Entrez.read(handle)
handle.close()

for
sciagent-skill-creatorSkill

|

opentrons-integrationSkill

Opentrons Protocol API v2 for OT-2/Flex: Python protocols for pipetting, serial dilutions, PCR, plate replication; control thermocycler, heater-shaker, magnetic, temperature modules. Use pylabrobot for multi-vendor.

plotly-interactive-visualizationSkill

Interactive visualization with Plotly. 40+ chart types (scatter, line, heatmap, 3D, geographic) with hover, zoom, pan. Two APIs: Plotly Express (DataFrame) and Graph Objects (fine control). For static publication figures use matplotlib; for statistical grammar use seaborn.

seaborn-statistical-visualizationSkill

Statistical visualization on matplotlib + pandas. Distributions (histplot, kdeplot, violin, box), relational (scatter, line), categorical, regression, correlation heatmaps. Auto aggregation/CIs. Use plotly for interactive; matplotlib for low-level.

single-cell-annotationSkill

Best practices for single-cell RNA-seq cell type annotation including marker-based, reference-based, and automated classification approaches.

pymc-bayesian-modelingSkill

Bayesian modeling with PyMC 5: priors, likelihood, NUTS/ADVI sampling, diagnostics (R-hat, ESS), LOO/WAIC comparison, prediction. Hierarchical, logistic, GP variants; predictive checks.

scikit-survival-analysisSkill

Time-to-event modeling with scikit-survival: Cox PH (elastic net), Random Survival Forests, Boosting, SVMs for censored data. C-index, Brier, time-dependent AUC; Kaplan-Meier, Nelson-Aalen, competing risks. Pipeline/GridSearchCV compatible. Use statsmodels for frequentist, pymc for Bayesian, lifelines for parametric.

statistical-analysisSkill

>-