Skip to main content
ClaudeWave
Skill199 repo starsupdated 16d ago

busco-status-interpretation

Guide to interpreting BUSCO completeness statuses: why Duplicated BUSCOs count as complete, parsing output files, computing/comparing completeness across proteomes/genomes, common counting mistakes. Use when running BUSCO QC, comparing assemblies, or reporting completeness. See also: prokka-genome-annotation for annotation workflows feeding BUSCO.

Install in Claude Code
Copy
git clone --depth 1 https://github.com/jaechang-hits/SciAgent-Skills /tmp/busco-status-interpretation && cp -r /tmp/busco-status-interpretation/skills/genomics-bioinformatics/qc/busco-status-interpretation ~/.claude/skills/busco-status-interpretation
Then start a new Claude Code session; the skill loads automatically.

SKILL.md

# BUSCO Status Interpretation Guide

## Overview

BUSCO (Benchmarking Universal Single-Copy Orthologs) is the standard tool for assessing genome, transcriptome, and proteome completeness by searching for conserved single-copy orthologs from the OrthoDB database. Correct interpretation of BUSCO output is essential for genome quality assessment, comparative genomics, and publication-ready reporting. The most common analytical error is excluding Duplicated BUSCOs from completeness counts, which artificially penalizes polyploid organisms and assemblies with legitimate gene duplications.

This guide covers BUSCO status categories, output file formats, parsing strategies, cross-proteome comparisons, lineage dataset selection, and common pitfalls in BUSCO interpretation.

---

## Key Concepts

### BUSCO Status Categories

BUSCO assigns each searched ortholog one of four statuses:

| Status | Abbreviation | Meaning | Count as Complete? |
|---|---|---|---|
| **Complete (single-copy)** | S | Found exactly once in the genome/proteome | YES |
| **Duplicated** | D | Found more than once (multiple copies) | YES |
| **Fragmented** | F | Partial match, likely incomplete gene model | NO |
| **Missing** | M | Not detected at all | NO |

The headline completeness percentage (C%) reported by BUSCO is always S + D combined. Individual category counts (S, D, F, M) are reported for transparency and should be included in publications.

### Why Duplicated Equals Complete

A Duplicated BUSCO means the ortholog IS present and fully intact in the genome or proteome -- it simply exists in more than one copy. This can occur through:

- Whole-genome duplication (common in plants, fish, and amphibians)
- Tandem or segmental duplication events
- Recent polyploidy
- Proteomes containing multiple isoforms per gene

The gene is not incomplete or absent. Excluding Duplicated BUSCOs from completeness counts would incorrectly penalize polyploid organisms, recently duplicated genomes, or proteomes that include isoform-level annotations. The correct completeness formula is always:

```
Completeness (%) = (Complete_single_copy + Duplicated) / Total_BUSCOs * 100
```

A high Duplicated fraction is not inherently problematic -- it is biologically informative. For example, the zebrafish genome (a teleost with an ancient whole-genome duplication) routinely shows 15-25% Duplicated BUSCOs, and this is expected.

### BUSCO Output Formats

BUSCO produces two primary output formats relevant to downstream analysis:

**Short summary format** -- a single-line notation found in `short_summary.*.txt`:

```
C:95.0%[S:90.0%,D:5.0%],F:3.0%,M:2.0%,n:255
```

Where C = Complete (S + D), S = Single-copy, D = Duplicated, F = Fragmented, M = Missing, and n = total BUSCO groups searched.

**Full table format** -- a TSV file (`full_table.tsv`) with per-ortholog results containing columns for BUSCO ID, Status, Sequence, Score, and Length. This file enables detailed per-gene analysis, filtering, and cross-species comparisons.

---

## Decision Framework

When deciding whether and how to use BUSCO for quality assessment:

```
Question: What are you assessing?
├── Genome assembly completeness
│   ├── Draft assembly → Run BUSCO in genome mode
│   └── Polished/final assembly → Run BUSCO in genome mode, report in publication
├── Transcriptome completeness
│   └── De novo assembly → Run BUSCO in transcriptome mode (expect higher D%)
├── Proteome / annotation completeness
│   └── Predicted proteins → Run BUSCO in protein mode
└── Comparing multiple assemblies
    └── Same lineage dataset across all → Use compare_proteome_completeness pattern
```

### Lineage Dataset Selection

| Organism type | Recommended lineage | Example dataset | Notes |
|---|---|---|---|
| Broad eukaryotic screen | eukaryota | `eukaryota_odb10` | Low resolution, useful for initial checks |
| Vertebrate | vertebrata or class-level | `mammalia_odb10`, `actinopterygii_odb10` | Class-level gives better resolution |
| Insect | insecta or order-level | `diptera_odb10`, `hymenoptera_odb10` | Order-level preferred when available |
| Plant | viridiplantae or more specific | `embryophyta_odb10`, `eudicots_odb10` | Plants often show high D% due to polyploidy |
| Fungus | fungi or division-level | `ascomycota_odb10`, `basidiomycota_odb10` | Match to known phylogenetic placement |
| Bacterium | bacteria or phylum-level | `proteobacteria_odb10` | Use `--auto-lineage-prok` for unknown bacteria |

**General rule**: Use the most specific lineage dataset that encompasses your organism. More specific datasets contain more BUSCOs and provide higher resolution, but using a dataset that does not include your organism will produce misleadingly low scores.

---

## Best Practices

1. **Always report all four categories (S, D, F, M)**: Do not report only the headline C% value. Reviewers and readers need the breakdown to assess whether high completeness comes from single-copy genes (expected for haploid organisms) or duplicated genes (expected for polyploids). This is now a standard expectation in genome papers.

2. **Use the same lineage dataset for all comparisons**: When comparing assemblies or proteomes, every run must use the identical lineage dataset and BUSCO version. Mixing lineage datasets (e.g., comparing one assembly run with `eukaryota_odb10` against another with `metazoa_odb10`) produces incomparable results.

3. **Choose the most specific lineage available**: More specific lineage datasets provide more BUSCO markers and finer resolution. A vertebrate genome assessed with `eukaryota_odb10` (255 markers) gives a much coarser picture than one assessed with `mammalia_odb10` (9,226 markers).

4. **Interpret Duplicated percentage in biological context**: High D% in plants, teleost fish, or salmonids is expected due to known whole-genome duplication events. High D% in a haploid bacterium, however, may indicate assembly artifacts (e.g., uncollapsed haplotypes or contamination).

5. **Run BUSCO on the correct
sciagent-skill-creatorSkill

|

opentrons-integrationSkill

Opentrons Protocol API v2 for OT-2/Flex: Python protocols for pipetting, serial dilutions, PCR, plate replication; control thermocycler, heater-shaker, magnetic, temperature modules. Use pylabrobot for multi-vendor.

plotly-interactive-visualizationSkill

Interactive visualization with Plotly. 40+ chart types (scatter, line, heatmap, 3D, geographic) with hover, zoom, pan. Two APIs: Plotly Express (DataFrame) and Graph Objects (fine control). For static publication figures use matplotlib; for statistical grammar use seaborn.

seaborn-statistical-visualizationSkill

Statistical visualization on matplotlib + pandas. Distributions (histplot, kdeplot, violin, box), relational (scatter, line), categorical, regression, correlation heatmaps. Auto aggregation/CIs. Use plotly for interactive; matplotlib for low-level.

single-cell-annotationSkill

Best practices for single-cell RNA-seq cell type annotation including marker-based, reference-based, and automated classification approaches.

pymc-bayesian-modelingSkill

Bayesian modeling with PyMC 5: priors, likelihood, NUTS/ADVI sampling, diagnostics (R-hat, ESS), LOO/WAIC comparison, prediction. Hierarchical, logistic, GP variants; predictive checks.

scikit-survival-analysisSkill

Time-to-event modeling with scikit-survival: Cox PH (elastic net), Random Survival Forests, Boosting, SVMs for censored data. C-index, Brier, time-dependent AUC; Kaplan-Meier, Nelson-Aalen, competing risks. Pipeline/GridSearchCV compatible. Use statsmodels for frequentist, pymc for Bayesian, lifelines for parametric.

statistical-analysisSkill

>-