bio-batch-processing
The bio-batch-processing skill provides Python patterns for efficiently processing multiple biological sequence files using Biopython's SeqIO module. Use it when needing to iterate over directories of FASTA, FASTQ, or GenBank files, merge sequences across files while tracking sources, or automate repetitive sequence operations at scale.
git clone --depth 1 https://github.com/FreedomIntelligence/OpenClaw-Medical-Skills /tmp/bio-batch-processing && cp -r /tmp/bio-batch-processing/skills/bio-batch-processing ~/.claude/skills/bio-batch-processingSKILL.md
## Version Compatibility
Reference examples tested with: BioPython 1.83+
Before using code patterns, verify installed versions match. If versions differ:
- Python: `pip show <package>` then `help(module.function)` to check signatures
If code throws ImportError, AttributeError, or TypeError, introspect the installed
package and adapt the example to match the actual API rather than retrying.
# Batch Processing
**"Process all my sequence files in a directory"** → Iterate, merge, split, convert, and generate summary statistics across multiple sequence files.
- Python: `SeqIO.parse()`, `Path.glob()` (BioPython, pathlib)
Process multiple sequence files efficiently using Biopython.
## Required Imports
```python
from pathlib import Path
from Bio import SeqIO
```
## Process Multiple Files
### Iterate Over Files in Directory
```python
from pathlib import Path
for fasta_file in Path('data/').glob('*.fasta'):
records = list(SeqIO.parse(fasta_file, 'fasta'))
print(f'{fasta_file.name}: {len(records)} sequences')
```
### Process All FASTQ Files
```python
for fq_file in Path('.').glob('*.fastq'):
count = sum(1 for _ in SeqIO.parse(fq_file, 'fastq'))
print(f'{fq_file.name}: {count} reads')
```
### Recursive File Search
```python
for gb_file in Path('data/').rglob('*.gb'):
print(f'Found: {gb_file}')
```
## Merge Files
### Merge All FASTA Files
```python
from pathlib import Path
def all_records(directory, pattern, format):
for filepath in Path(directory).glob(pattern):
yield from SeqIO.parse(filepath, format)
records = all_records('data/', '*.fasta', 'fasta')
count = SeqIO.write(records, 'merged.fasta', 'fasta')
print(f'Merged {count} records')
```
### Merge with Source Tracking
**Goal:** Combine sequences from multiple files into one, tagging each record with its source filename.
**Approach:** Stream records from each file through a generator that appends source metadata to the description.
**Reference (BioPython 1.83+):**
```python
def records_with_source(directory, pattern, format):
for filepath in Path(directory).glob(pattern):
for record in SeqIO.parse(filepath, format):
record.description = f'{record.description} [source={filepath.name}]'
yield record
records = records_with_source('data/', '*.fasta', 'fasta')
SeqIO.write(records, 'merged_tracked.fasta', 'fasta')
```
### Merge Specific Files
```python
files = ['sample1.fasta', 'sample2.fasta', 'sample3.fasta']
def merge_files(file_list, format):
for filepath in file_list:
yield from SeqIO.parse(filepath, format)
SeqIO.write(merge_files(files, 'fasta'), 'combined.fasta', 'fasta')
```
## Split Files
### Split by Number of Records
**Goal:** Divide a large sequence file into smaller chunks of N records each.
**Approach:** Consume the iterator in fixed-size batches using `islice`, writing each batch to a numbered output file.
**Reference (BioPython 1.83+):**
```python
from itertools import islice
def split_file(input_file, format, records_per_file, output_prefix):
records = SeqIO.parse(input_file, format)
file_num = 1
while True:
batch = list(islice(records, records_per_file))
if not batch:
break
output_file = f'{output_prefix}_{file_num}.{format}'
SeqIO.write(batch, output_file, format)
print(f'Wrote {len(batch)} records to {output_file}')
file_num += 1
split_file('large.fasta', 'fasta', 1000, 'split')
```
### Split by Sequence ID Prefix
**Goal:** Group sequences into separate files based on a shared ID prefix (e.g., sample or chromosome).
**Approach:** Parse all records into a prefix-keyed dictionary, then write each group to its own file.
**Reference (BioPython 1.83+):**
```python
from collections import defaultdict
records_by_prefix = defaultdict(list)
for record in SeqIO.parse('input.fasta', 'fasta'):
prefix = record.id.split('_')[0]
records_by_prefix[prefix].append(record)
for prefix, records in records_by_prefix.items():
SeqIO.write(records, f'{prefix}.fasta', 'fasta')
```
### One Sequence Per File
```python
for record in SeqIO.parse('multi.fasta', 'fasta'):
SeqIO.write(record, f'{record.id}.fasta', 'fasta')
```
## Batch Convert
### Convert All Files in Directory
```python
from pathlib import Path
for gb_file in Path('genbank/').glob('*.gb'):
fasta_file = Path('fasta/') / gb_file.with_suffix('.fasta').name
count = SeqIO.convert(str(gb_file), 'genbank', str(fasta_file), 'fasta')
print(f'{gb_file.name} -> {fasta_file.name}: {count} records')
```
### Batch Convert with Summary
```python
from pathlib import Path
results = []
for input_file in Path('input/').glob('*.gb'):
output_file = Path('output/') / input_file.with_suffix('.fasta').name
count = SeqIO.convert(str(input_file), 'genbank', str(output_file), 'fasta')
results.append({'file': input_file.name, 'records': count})
print(f'Converted {len(results)} files, {sum(r["records"] for r in results)} total records')
```
## Parallel Processing
### Using multiprocessing
```python
from multiprocessing import Pool
from pathlib import Path
def process_file(filepath):
records = list(SeqIO.parse(filepath, 'fasta'))
return {'file': filepath.name, 'count': len(records), 'total_bp': sum(len(r.seq) for r in records)}
files = list(Path('data/').glob('*.fasta'))
with Pool(4) as pool:
results = pool.map(process_file, files)
for r in results:
print(f'{r["file"]}: {r["count"]} seqs, {r["total_bp"]} bp')
```
### Using concurrent.futures
```python
from concurrent.futures import ThreadPoolExecutor
from pathlib import Path
def count_records(filepath):
return filepath.name, sum(1 for _ in SeqIO.parse(filepath, 'fasta'))
files = list(Path('data/').glob('*.fasta'))
with ThreadPoolExecutor(max_workers=4) as executor:
results = executor.map(count_records, files)
for name, count in results:
print(f'{name}: {count}')
```
## Summary Statistics
### Aggregate StCloud laboratory platform for automated protein testing and validation. Use when designing proteins and needing experimental validation including binding assays, expression testing, thermostability measurements, enzyme activity assays, or protein sequence optimization. Also use for submitting experiments via API, tracking experiment status, downloading results, optimizing protein sequences for better expression using computational tools (NetSolP, SoluProt, SolubleMPNN, ESM), or managing protein design workflows with wet-lab validation.
Time-blind friendly planning, executive function support, and daily structure for ADHD brains. Specializes in realistic time estimation, dopamine-aware task design, and building systems that
This skill should be used for time series machine learning tasks including classification, regression, clustering, forecasting, anomaly detection, segmentation, and similarity search. Use when working with temporal data, sequential patterns, or time-indexed observations requiring specialized algorithms beyond standard ML approaches. Particularly suited for univariate and multivariate time series analysis with scikit-learn compatible APIs.
Browse the web for any task — research topics, read articles, interact with web apps, fill forms, take screenshots, extract data, and test web pages. Use whenever a browser would be useful, not just when the user explicitly asks.
AI驱动的综合健康分析系统,整合多维度健康数据、识别异常模式、预测健康风险、提供个性化建议。支持智能问答和AI健康报告生成。
Access AlphaFold's 200M+ AI-predicted protein structures. Retrieve structures by UniProt ID, download PDB/mmCIF files, analyze confidence metrics (pLDDT, PAE), for drug discovery and structural biology.