Skill286 estrellas del repoactualizado 5d ago

bakta-genome-annotation

Bakta is a command-line pipeline for rapid annotation of bacterial and archaeal genomes and plasmids, combining Prodigal, tRNAscan-SE, and DIAMOND/HMM searches against a curated UniRef database to identify coding sequences, non-coding RNAs, CRISPR arrays, and other features. Use it when you need fast NCBI-compatible annotation outputs (GFF3, GenBank, JSON) with regularly updated databases and a circular genome plot for publication; use Prokka for legacy pipelines or non-bacterial organisms, or PGAP for formal NCBI GenBank submission.

Ver fuente Repositorio: SciAgent-Skills

Instalar en Claude Code

Copiar

git clone --depth 1 https://github.com/jaechang-hits/SciAgent-Skills /tmp/bakta-genome-annotation && cp -r /tmp/bakta-genome-annotation/skills/genomics-bioinformatics/annotation/bakta-genome-annotation ~/.claude/skills/bakta-genome-annotation

Después abre una sesión nueva de Claude Code; el skill carga automáticamente.

Definición

SKILL.md

# Bakta Genome Annotation

## Overview

Bakta is a command-line pipeline for rapid, standardized annotation of bacterial and archaeal genomes and plasmids. It combines Prodigal for CDS prediction, tRNAscan-SE/Aragorn/Barrnap/Infernal for non-coding RNA, PILER-CR/PILERCR for CRISPR detection, and a tiered DIAMOND/HMM search against a curated UniRef100 + IPS/UPS database to assign gene names, EC numbers, GO terms, and COG categories. Bakta produces NCBI-compatible outputs (GFF3, GenBank, EMBL, INSDC-formatted FASTA, plus a JSON summary and a circular Circos plot) for a typical 5 Mb genome in 5–15 minutes on 8 CPUs.

## When to Use

- Annotating bacterial or archaeal genome assemblies (Illumina, PacBio, Nanopore) with NCBI-compatible locus tags and product names
- Annotating plasmids and other circular replicons separately with `--plasmid` and `--complete` flags
- Producing JSON-structured annotation outputs that can be parsed without GenBank or GFF3 detours
- Generating a publication-ready circular genome plot via the bundled `bakta_plot` command
- Annotating MAGs (metagenome-assembled genomes) with `--meta` to disable Prodigal training
- Use **Prokka** instead when you need viral/mitochondrial kingdoms or when you must reproduce a legacy Prokka pipeline exactly
- Use **PGAP** instead when submitting to NCBI GenBank with full standards compliance
- Use **Bakta** when you want faster runs, regularly updated UniRef-derived databases, AMRFinderPlus integration, and a JSON summary out of the box

## Prerequisites

- **Software**: Bakta ≥ 1.9, Python 3.8+, Prodigal, tRNAscan-SE, Aragorn, Barrnap, Infernal, DIAMOND, HMMER3, PILER-CR, BLAST+, AMRFinderPlus
- **Database**: Bakta DB (full ~70 GB, or light ~3 GB) downloaded once with `bakta_db download`
- **Python packages** (for output parsing): `biopython`, `pandas`, `matplotlib`
- **Input**: assembled genome in FASTA format (one or more contigs)
- **Hardware**: ≥ 16 GB RAM for full DB, ≥ 4 GB RAM for light DB; ≥ 8 CPUs recommended

> **Check before installing**: The tool may already be available in the current environment (e.g., inside a `pixi` / `conda` env). Run `command -v bakta` first and skip the install commands below if it returns a path. When running inside a pixi project, invoke the tool via `pixi run bakta` rather than bare `bakta`.

```bash
# Install Bakta via conda/mamba (recommended)
mamba install -c conda-forge -c bioconda bakta

# Verify installation
bakta --version
# bakta 1.9.4

# Download the light database (~3 GB, faster, fewer functional hits)
bakta_db download --output db/ --type light

# Or full database (~70 GB, comprehensive UniRef100 coverage)
# bakta_db download --output db/ --type full

# Install Python parsing dependencies
pip install biopython pandas matplotlib
```

## Quick Start

```bash
# Annotate a bacterial genome — results in results/ directory
bakta genome.fasta \
    --db db/bakta_db_light \
    --output results/ \
    --prefix sample1 \
    --threads 8

# Inspect the JSON summary for feature counts
python -c "
import json
with open('results/sample1.json') as f:
    d = json.load(f)
print('Genus:', d['genome'].get('genus'))
print('Length:', d['genome']['size'], 'bp')
print('CDS:', sum(1 for f in d['features'] if f['type'] == 'cds'))
print('tRNA:', sum(1 for f in d['features'] if f['type'] == 'tRNA'))
"
```

## Workflow

### Step 1: Install Bakta and Download the Database

Install Bakta and prepare the reference database. The database download is one-time and reused across runs.

```bash
# Create a dedicated conda environment (avoids dependency conflicts)
mamba create -n bakta_env -c conda-forge -c bioconda bakta python=3.11 -y
mamba activate bakta_env

# Verify Bakta and its dependencies
bakta --version
# bakta 1.9.4

bakta --help | head -20

# Download the light database (sufficient for routine annotation)
mkdir -p db/
bakta_db download --output db/ --type light
# Downloads ~3 GB; expands to ~5 GB on disk

# Verify the database was extracted correctly
ls db/bakta_db_light/
# antifam.h3f  bakta.db  expert  oric.fna  pfam.h3f  rfam-go.tsv  ...

# (Optional) Update AMRFinderPlus DB used by Bakta for AMR gene calling
amrfinder -u

# Install Python parsing tools
pip install biopython pandas matplotlib
```

### Step 2: Prepare the Input Assembly

Bakta requires clean FASTA headers without spaces or special characters. Pre-clean and optionally filter short contigs.

```python
from Bio import SeqIO
import re

input_fasta = "genome.fasta"
records = list(SeqIO.parse(input_fasta, "fasta"))
print(f"Input assembly: {len(records)} contigs")
total_bases = sum(len(r) for r in records)
print(f"Total bases: {total_bases:,}")
print(f"Largest contig: {max(len(r) for r in records):,} bp")

# Bakta preferred: short, alphanumeric, unique IDs
cleaned = []
for i, rec in enumerate(records, 1):
    new_id = f"contig_{i:04d}"
    new_rec = rec.__class__(rec.seq, id=new_id, description="")
    cleaned.append(new_rec)

SeqIO.write(cleaned, "genome_clean.fasta", "fasta")
print(f"Wrote genome_clean.fasta with {len(cleaned)} contigs")
```

```bash
# Filter out short contigs (<200 bp) which contribute little to annotation
awk 'BEGIN{RS=">"; ORS=""} NR>1 {n=split($0, a, "\n"); seq=""; for(i=2;i<=n;i++) seq=seq a[i]; if (length(seq) >= 200) print ">" $0}' \
    genome_clean.fasta > genome_filtered.fasta

echo "Filtered assembly: $(grep -c '>' genome_filtered.fasta) contigs"
```

### Step 3: Run Standard Bakta Annotation

Run Bakta with genus/species hints. Locus tags are auto-generated from the strain field.

```bash
# Standard annotation for a draft bacterial genome
bakta genome_clean.fasta \
    --db db/bakta_db_light \
    --output annotation/ \
    --prefix E_coli_K12 \
    --genus Escherichia \
    --species coli \
    --strain K12 \
    --locus-tag ECOLI \
    --threads 8 \
    --keep-contig-headers

# Expected runtime: 5–15 min for ~5 Mb genome on 8 CPUs (light DB)

echo "Bakta annotation outputs:"
ls annotation/
# E_coli_K12.embl   E_c

Del mismo repositorio

sciagent-skill-creatorSkill

opentrons-integrationSkill

Opentrons Protocol API v2 for OT-2/Flex: Python protocols for pipetting, serial dilutions, PCR, plate replication; control thermocycler, heater-shaker, magnetic, temperature modules. Use pylabrobot for multi-vendor.

plotly-interactive-visualizationSkill

Interactive visualization with Plotly. 40+ chart types (scatter, line, heatmap, 3D, geographic) with hover, zoom, pan. Two APIs: Plotly Express (DataFrame) and Graph Objects (fine control). For static publication figures use matplotlib; for statistical grammar use seaborn.

seaborn-statistical-visualizationSkill

Statistical visualization on matplotlib + pandas. Distributions (histplot, kdeplot, violin, box), relational (scatter, line), categorical, regression, correlation heatmaps. Auto aggregation/CIs. Use plotly for interactive; matplotlib for low-level.

single-cell-annotationSkill

Best practices for single-cell RNA-seq cell type annotation including marker-based, reference-based, and automated classification approaches.

pymc-bayesian-modelingSkill

Bayesian modeling with PyMC 5: priors, likelihood, NUTS/ADVI sampling, diagnostics (R-hat, ESS), LOO/WAIC comparison, prediction. Hierarchical, logistic, GP variants; predictive checks.

scikit-survival-analysisSkill

Time-to-event modeling with scikit-survival: Cox PH (elastic net), Random Survival Forests, Boosting, SVMs for censored data. C-index, Brier, time-dependent AUC; Kaplan-Meier, Nelson-Aalen, competing risks. Pipeline/GridSearchCV compatible. Use statsmodels for frequentist, pymc for Bayesian, lifelines for parametric.

statistical-analysisSkill