Skip to main content
ClaudeWave
Skill199 repo starsupdated 16d ago

roary-pangenome

Compute the bacterial pan-genome from Prokka/Bakta GFF3 annotations with Roary's CD-HIT + BLAST + MCL clustering pipeline. Builds gene presence/absence matrices, core/soft-core/shell/cloud partitions, multi-FASTA core gene alignments (with `-e`), and a pan-genome reference. Use Panaroo for higher-accuracy pan-genomes from highly fragmented assemblies, PIRATE for paralog-aware clustering, or PPanGGOLiN for graph-based partitioning.

Install in Claude Code
Copy
git clone --depth 1 https://github.com/jaechang-hits/SciAgent-Skills /tmp/roary-pangenome && cp -r /tmp/roary-pangenome/skills/genomics-bioinformatics/annotation/roary-pangenome ~/.claude/skills/roary-pangenome
Then start a new Claude Code session; the skill loads automatically.

SKILL.md

# Roary Pan-Genome Pipeline

## Overview

Roary is a high-throughput pan-genome pipeline for prokaryotes that takes per-sample GFF3 annotations (typically from Prokka or Bakta) and produces a clustered gene presence/absence matrix across the entire input set. It first reduces redundancy with CD-HIT iterative clustering, then performs an all-vs-all BLASTP within each pre-cluster, and finally applies MCL graph clustering to define orthologous gene families. The output partitions the gene space into core (≥ 99 %), soft-core (95–99 %), shell (15–95 %), and cloud (< 15 %) genes and optionally builds a concatenated core-gene alignment suitable for phylogenetic inference.

## When to Use

- Computing a pan-genome from a set of bacterial isolate annotations (10–10,000 genomes)
- Producing a `gene_presence_absence.csv` matrix for downstream GWAS, accessory-gene mining, or core-gene phylogenetics
- Building a concatenated core-gene multi-FASTA alignment for ML/Bayesian phylogenetic trees
- Generating a pan-genome reference FASTA to use as a non-redundant gene catalog
- Comparative genomics across closely related strains where >95 % nucleotide identity is expected
- Use **Panaroo** instead when assemblies are highly fragmented or annotations are noisy (Panaroo aggressively cleans annotation errors)
- Use **PIRATE** instead when paralog-aware clustering with multiple identity thresholds is needed
- Use **PPanGGOLiN** instead when graph-based, statistically grounded gene-family partitioning is preferred over fixed-frequency cutoffs

## Prerequisites

- **Software**: Roary ≥ 3.13, Perl 5, BLAST+, CD-HIT, MCL, BEDTools, PRANK or MAFFT (for `-e` core alignment), FastTree (optional)
- **Python packages** (for output parsing): `pandas`, `matplotlib`, `seaborn`, `biopython`, `dendropy`
- **Input**: per-sample GFF3 files with embedded FASTA at the end (Prokka/Bakta default output)
- **Hardware**: ≥ 8 GB RAM for ~50 genomes; 32 GB+ recommended for ~500 genomes
- **Naming**: each GFF3 filename becomes the sample column header in the output matrix; use `sample_id.gff` style

> **Check before installing**: The tool may already be available in the current environment (e.g., inside a `pixi` / `conda` env). Run `command -v roary` first and skip the install commands below if it returns a path. When running inside a pixi project, invoke the tool via `pixi run roary` rather than bare `roary`.

```bash
# Install Roary via conda/mamba (recommended)
mamba install -c conda-forge -c bioconda roary

# Verify installation
roary --version
# 3.13.0

# Verify dependent tools
which cd-hit blastp mcl bedtools mafft
# /opt/conda/bin/cd-hit
# /opt/conda/bin/blastp
# /opt/conda/bin/mcl
# /opt/conda/bin/bedtools
# /opt/conda/bin/mafft

# Install Python parsing dependencies
pip install pandas matplotlib seaborn biopython dendropy
```

## Quick Start

```bash
# Run Roary on all GFF3 files in current directory; emit core gene alignment
roary -e --mafft -p 8 -o pangenome -f roary_out/ *.gff

# Inspect summary statistics
cat roary_out/summary_statistics.txt
# Core genes        (99% <= strains <= 100%) 2823
# Soft core genes   (95% <= strains <  99%)   78
# Shell genes       (15% <= strains <  95%)  1542
# Cloud genes       ( 0% <= strains <  15%)  3104
# Total genes                                 7547
```

## Workflow

### Step 1: Install Roary and Verify Dependencies

Install Roary in a dedicated environment to avoid Perl module conflicts.

```bash
# Create a dedicated conda environment
mamba create -n roary_env -c conda-forge -c bioconda roary mafft fasttree python=3.11 -y
mamba activate roary_env

# Verify Roary
roary --version
# 3.13.0

# Confirm pipeline dependencies are reachable
for tool in cd-hit blastp mcl bedtools mafft FastTree; do
    if command -v $tool >/dev/null; then
        echo "OK: $tool"
    else
        echo "MISSING: $tool"
    fi
done

# Install Python parsing tools
pip install pandas matplotlib seaborn biopython dendropy
```

### Step 2: Prepare Per-Sample GFF3 Files from Prokka or Bakta

Roary requires GFF3 files with an embedded FASTA section (the `##FASTA` block). Both Prokka `.gff` and Bakta `.gff3` outputs satisfy this format.

```bash
# Verify GFF3 files include the embedded FASTA section
mkdir -p roary_input/
cp annotations/*/sample*.gff roary_input/

for GFF in roary_input/*.gff; do
    if grep -q "^##FASTA" "$GFF"; then
        echo "OK: $GFF"
    else
        echo "MISSING ##FASTA in: $GFF"
    fi
done

# Roary uses GFF filename (without .gff) as the sample column name.
# Rename or symlink to ensure unique, descriptive names.
cd roary_input/
ls *.gff | head
# strain_A.gff  strain_B.gff  strain_C.gff
```

```python
# Sanity-check sample size and unique names
from pathlib import Path

gff_files = sorted(Path("roary_input/").glob("*.gff"))
print(f"GFF files: {len(gff_files)}")

names = [g.stem for g in gff_files]
duplicates = [n for n in names if names.count(n) > 1]
assert not duplicates, f"Duplicate sample names: {set(duplicates)}"

# Show first 5
for g in gff_files[:5]:
    size_mb = g.stat().st_size / 1e6
    print(f"  {g.name}: {size_mb:.1f} MB")
```

### Step 3: Run Roary with Core Gene Alignment

The `-e --mafft` combination produces the concatenated core-gene alignment used downstream for phylogenetics.

```bash
# Run Roary on the prepared GFF directory
# -e: extract core gene alignment per gene, then concatenate
# --mafft: use MAFFT for alignment (faster than default PRANK)
# -p 8: 8 parallel processes
# -i 95: min BLASTP percentage identity (default 95)
# -cd 99: % isolates a gene must be in to be "core" (default 99)
roary -e --mafft \
    -p 8 \
    -i 95 \
    -cd 99 \
    -o pangenome \
    -f roary_out/ \
    roary_input/*.gff

# Expected runtime:
#   ~50 genomes:  10–20 min
#   ~500 genomes: 4–8 hours

ls roary_out/
# accessory_binary_genes.fa            gene_presence_absence.csv
# accessory_binary_genes.fa.newick     gene_presence_absence.Rtab
# accessory.header.embl                number_o
sciagent-skill-creatorSkill

|

opentrons-integrationSkill

Opentrons Protocol API v2 for OT-2/Flex: Python protocols for pipetting, serial dilutions, PCR, plate replication; control thermocycler, heater-shaker, magnetic, temperature modules. Use pylabrobot for multi-vendor.

plotly-interactive-visualizationSkill

Interactive visualization with Plotly. 40+ chart types (scatter, line, heatmap, 3D, geographic) with hover, zoom, pan. Two APIs: Plotly Express (DataFrame) and Graph Objects (fine control). For static publication figures use matplotlib; for statistical grammar use seaborn.

seaborn-statistical-visualizationSkill

Statistical visualization on matplotlib + pandas. Distributions (histplot, kdeplot, violin, box), relational (scatter, line), categorical, regression, correlation heatmaps. Auto aggregation/CIs. Use plotly for interactive; matplotlib for low-level.

single-cell-annotationSkill

Best practices for single-cell RNA-seq cell type annotation including marker-based, reference-based, and automated classification approaches.

pymc-bayesian-modelingSkill

Bayesian modeling with PyMC 5: priors, likelihood, NUTS/ADVI sampling, diagnostics (R-hat, ESS), LOO/WAIC comparison, prediction. Hierarchical, logistic, GP variants; predictive checks.

scikit-survival-analysisSkill

Time-to-event modeling with scikit-survival: Cox PH (elastic net), Random Survival Forests, Boosting, SVMs for censored data. C-index, Brier, time-dependent AUC; Kaplan-Meier, Nelson-Aalen, competing risks. Pipeline/GridSearchCV compatible. Use statsmodels for frequentist, pymc for Bayesian, lifelines for parametric.

statistical-analysisSkill

>-