Skill284 repo starsupdated 4d ago

busco-status-interpretation

This skill provides guidance on interpreting BUSCO completeness assessment results, including why duplicated orthologs count toward completeness, how to parse output files, and strategies for comparing completeness across different proteomes and genomes. Use when running BUSCO quality control assessments, comparing genome or transcriptome assemblies, or preparing completeness metrics for publication, particularly when working with polyploid organisms or genomes with legitimate gene duplications that require correct interpretation.

View source Repository: SciAgent-Skills

Install in Claude Code

Copy

git clone --depth 1 https://github.com/jaechang-hits/SciAgent-Skills /tmp/busco-status-interpretation && cp -r /tmp/busco-status-interpretation/skills/genomics-bioinformatics/qc/busco-status-interpretation ~/.claude/skills/busco-status-interpretation

Then start a new Claude Code session; the skill loads automatically.

Definition

SKILL.md

# BUSCO Status Interpretation Guide

## Overview

BUSCO (Benchmarking Universal Single-Copy Orthologs) is the standard tool for assessing genome, transcriptome, and proteome completeness by searching for conserved single-copy orthologs from the OrthoDB database. Correct interpretation of BUSCO output is essential for genome quality assessment, comparative genomics, and publication-ready reporting. The most common analytical error is excluding Duplicated BUSCOs from completeness counts, which artificially penalizes polyploid organisms and assemblies with legitimate gene duplications.

This guide covers BUSCO status categories, output file formats, parsing strategies, cross-proteome comparisons, lineage dataset selection, and common pitfalls in BUSCO interpretation.

---

## Key Concepts

### BUSCO Status Categories

BUSCO assigns each searched ortholog one of four statuses:

| Status | Abbreviation | Meaning | Count as Complete? |
|---|---|---|---|
| **Complete (single-copy)** | S | Found exactly once in the genome/proteome | YES |
| **Duplicated** | D | Found more than once (multiple copies) | YES |
| **Fragmented** | F | Partial match, likely incomplete gene model | NO |
| **Missing** | M | Not detected at all | NO |

The headline completeness percentage (C%) reported by BUSCO is always S + D combined. Individual category counts (S, D, F, M) are reported for transparency and should be included in publications.

### Why Duplicated Equals Complete

A Duplicated BUSCO means the ortholog IS present and fully intact in the genome or proteome -- it simply exists in more than one copy. This can occur through:

- Whole-genome duplication (common in plants, fish, and amphibians)
- Tandem or segmental duplication events
- Recent polyploidy
- Proteomes containing multiple isoforms per gene

The gene is not incomplete or absent. Excluding Duplicated BUSCOs from completeness counts would incorrectly penalize polyploid organisms, recently duplicated genomes, or proteomes that include isoform-level annotations. The correct completeness formula is always:

```
Completeness (%) = (Complete_single_copy + Duplicated) / Total_BUSCOs * 100
```

A high Duplicated fraction is not inherently problematic -- it is biologically informative. For example, the zebrafish genome (a teleost with an ancient whole-genome duplication) routinely shows 15-25% Duplicated BUSCOs, and this is expected.

### BUSCO Output Formats

BUSCO produces two primary output formats relevant to downstream analysis:

**Short summary format** -- a single-line notation found in `short_summary.*.txt`:

```
C:95.0%[S:90.0%,D:5.0%],F:3.0%,M:2.0%,n:255
```

Where C = Complete (S + D), S = Single-copy, D = Duplicated, F = Fragmented, M = Missing, and n = total BUSCO groups searched.

**Full table format** -- a TSV file (`full_table.tsv`) with per-ortholog results containing columns for BUSCO ID, Status, Sequence, Score, and Length. This file enables detailed per-gene analysis, filtering, and cross-species comparisons.

---

## Decision Framework

When deciding whether and how to use BUSCO for quality assessment:

```
Question: What are you assessing?
├── Genome assembly completeness
│   ├── Draft assembly → Run BUSCO in genome mode
│   └── Polished/final assembly → Run BUSCO in genome mode, report in publication
├── Transcriptome completeness
│   └── De novo assembly → Run BUSCO in transcriptome mode (expect higher D%)
├── Proteome / annotation completeness
│   └── Predicted proteins → Run BUSCO in protein mode
└── Comparing multiple assemblies
    └── Same lineage dataset across all → Use compare_proteome_completeness pattern
```

### Lineage Dataset Selection

| Organism type | Recommended lineage | Example dataset | Notes |
|---|---|---|---|
| Broad eukaryotic screen | eukaryota | `eukaryota_odb10` | Low resolution, useful for initial checks |
| Vertebrate | vertebrata or class-level | `mammalia_odb10`, `actinopterygii_odb10` | Class-level gives better resolution |
| Insect | insecta or order-level | `diptera_odb10`, `hymenoptera_odb10` | Order-level preferred when available |
| Plant | viridiplantae or more specific | `embryophyta_odb10`, `eudicots_odb10` | Plants often show high D% due to polyploidy |
| Fungus | fungi or division-level | `ascomycota_odb10`, `basidiomycota_odb10` | Match to known phylogenetic placement |
| Bacterium | bacteria or phylum-level | `proteobacteria_odb10` | Use `--auto-lineage-prok` for unknown bacteria |

**General rule**: Use the most specific lineage dataset that encompasses your organism. More specific datasets contain more BUSCOs and provide higher resolution, but using a dataset that does not include your organism will produce misleadingly low scores.

---

## Best Practices

1. **Always report all four categories (S, D, F, M)**: Do not report only the headline C% value. Reviewers and readers need the breakdown to assess whether high completeness comes from single-copy genes (expected for haploid organisms) or duplicated genes (expected for polyploids). This is now a standard expectation in genome papers.

2. **Use the same lineage dataset for all comparisons**: When comparing assemblies or proteomes, every run must use the identical lineage dataset and BUSCO version. Mixing lineage datasets (e.g., comparing one assembly run with `eukaryota_odb10` against another with `metazoa_odb10`) produces incomparable results.

3. **Choose the most specific lineage available**: More specific lineage datasets provide more BUSCO markers and finer resolution. A vertebrate genome assessed with `eukaryota_odb10` (255 markers) gives a much coarser picture than one assessed with `mammalia_odb10` (9,226 markers).

4. **Interpret Duplicated percentage in biological context**: High D% in plants, teleost fish, or salmonids is expected due to known whole-genome duplication events. High D% in a haploid bacterium, however, may indicate assembly artifacts (e.g., uncollapsed haplotypes or contamination).

5. **Run BUSCO on the correct