Skill1.6k repo starsupdated today

tooluniverse-microbial-genome-characterization

This Claude Code skill retrieves, quality-controls, and structurally maps microbial genome assemblies using NCBI Datasets, returning assembly inventory, QC metrics (N50, GC content, contig count, assembly level), and replicon inventories (chromosomes and plasmids). Use it to discover available genomes for an organism, retrieve assembly statistics and accession details, compare assemblies by quality, identify reference genomes, or determine assembly completeness, but not for gene-level orthology or de novo assembly from raw reads.

View source Repository: ToolUniverse

Install in Claude Code

Copy

git clone --depth 1 https://github.com/mims-harvard/ToolUniverse /tmp/tooluniverse-microbial-genome-characterization && cp -r /tmp/tooluniverse-microbial-genome-characterization/plugin/skills/tooluniverse-microbial-genome-characterization ~/.claude/skills/tooluniverse-microbial-genome-characterization

Then start a new Claude Code session; the skill loads automatically.

Definition

SKILL.md

# Microbial Genome Assembly Characterization & QC

Discover, quality-control, and structurally map genome ASSEMBLIES for any organism using the keyless NCBI Datasets genome tools. Organism/taxon in → assembly inventory, QC metrics, and chromosome/plasmid map out.

## LOOK UP, DON'T GUESS
When uncertain about an accession, assembly level, replicon count, or N50, CALL the tool. Never report assembly statistics from memory — accessions and metrics change with each RefSeq release. A live NCBI Datasets answer is always more reliable than a guess.

## COMPUTE, DON'T DESCRIBE
When comparing multiple assemblies or ranking by quality, retrieve each via the tools, then write and run Python (pandas) over the returned JSON to sort, score, and tabulate. Don't describe what you would compute — execute it and report actual numbers.

## When to Use This Skill

**Triggers**:
- "What genomes are available for [organism]?" / "Find the reference genome for [taxon]"
- "Assembly stats for GCF_000005845.2" / "What's the N50 / GC content of [accession]?"
- "How many plasmids does [strain] have?" / "List the replicons in [accession]"
- "Compare the assemblies for [species] — which is best quality?"
- "Is [accession] a complete genome or draft?"

**Use Cases**:
1. **Assembly discovery**: enumerate available assemblies for a taxon, optionally only reference-grade
2. **Assembly QC**: pull length, N50, contig count, GC%, level, RefSeq category for an accession
3. **Replicon mapping**: list chromosomes and plasmids with their RefSeq/GenBank accessions and lengths
4. **Assembly comparison**: rank candidate assemblies of one species by completeness and contiguity
5. **Reference selection**: identify the designated reference/representative genome for a taxon

**NOT this skill** (point elsewhere):
- Gene-level orthology, synteny, conservation → `tooluniverse-comparative-genomics`
- Plant gene structure / annotation → `tooluniverse-plant-genomics`
- De novo assembly from sequencing reads → no ToolUniverse tool exists; say so
- Pure taxonomy name → lineage lookups with no genome question → use NCBI taxonomy tools directly

---

## Tools (all keyless, verified live)

| Tool | Key params | Returns |
|------|-----------|---------|
| `NCBIDatasets_suggest_taxonomy` | `query` (organism name string) | candidate matches: `scientific_name`, `tax_id`, `rank`, `group_name` |
| `NCBIDatasets_get_taxonomy` | `tax_id` (string/int) | `organism_name`, `rank`, `lineage`, `children` |
| `NCBIDatasets_list_genomes_by_taxon` | `taxon` (name OR taxid), `limit`, `reference_only` (bool) | assembly list (accession, assembly_level, refseq_category, total_sequence_length, contig_n50, gc_percent, number_of_chromosomes, number_of_contigs); `metadata.total_available` = full count |
| `NCBIDatasets_get_genome_assembly` | `accession` (GCF_/GCA_) | full QC: total_sequence_length, number_of_chromosomes, number_of_contigs, contig_n50, scaffold_n50, gc_percent, assembly_level, assembly_status, refseq_category, release_date, submitter, annotation_provider |
| `NCBIDatasets_get_sequence_reports` | `accession` (GCF_/GCA_) | per-replicon list: chr_name, role, refseq_accession, genbank_accession, length, gc_percent |

> Param note: `get_taxonomy` requires `tax_id` (NOT `taxon`). `list_genomes_by_taxon` accepts either a name or a taxid in its `taxon` field. Always pass an accession to the assembly/sequence-report tools.

---

## Workflow

### Phase 0 — Resolve the organism (skip if you already have an accession)
If the user gives an organism name, resolve it to a tax id first:

```
NCBIDatasets_suggest_taxonomy {"query": "Escherichia coli"}
```

Pick the candidate whose `scientific_name`/`rank` matches the user's intent (species vs. a specific strain). Optionally confirm lineage/children with `NCBIDatasets_get_taxonomy {"tax_id": "562"}`.

If the user already gave a GCF_/GCA_ accession, skip to Phase 2.

### Phase 1 — Inventory the assemblies
List what exists for the taxon. Start `reference_only: true` to surface the curated reference/representative genome(s); set it to `false` to see the full set.

```
NCBIDatasets_list_genomes_by_taxon {"taxon": "562", "limit": 5, "reference_only": true}
```

Read `metadata.total_available` for the true count (large taxa return thousands — the `data` array is only the first `limit` rows). Note each candidate's `assembly_level`, `refseq_category`, `contig_n50`, and `number_of_contigs`.

### Phase 2 — Select the assembly
Prefer, in order:
1. `refseq_category == "reference genome"` (NCBI's single designated reference)
2. `refseq_category == "representative genome"`
3. Highest `assembly_level` (Complete Genome > Chromosome > Scaffold > Contig)
4. Highest `contig_n50` and lowest `number_of_contigs` among same-level candidates
5. A GCF_ (RefSeq) accession over its paired GCA_ (GenBank) when both exist — RefSeq is the curated copy

### Phase 3 — Pull assembly QC metrics
```
NCBIDatasets_get_genome_assembly {"accession": "GCF_000005845.2"}
```
Report: total length, # chromosomes, # contigs, contig N50, scaffold N50, GC%, assembly level, RefSeq category, release date, annotation provider.

### Phase 4 — Map the replicons (chromosomes + plasmids)
```
NCBIDatasets_get_sequence_reports {"accession": "GCF_000005845.2"}
```
Each row is one replicon. Distinguish chromosomes from plasmids by `chr_name` / `role`: a row named like `pO157`, `pOSAK1`, or with a plasmid-style name is a plasmid; `chromosome` rows are chromosomes. To answer "how many plasmids", count the non-chromosome assembled-molecule rows.

### Phase 5 — Compare candidates (optional)
When the user wants the best of several assemblies, fetch each accession, build a pandas table, and sort by (assembly_level rank, then contig_n50 desc, then number_of_contigs asc). Report the winner with the metrics that decided it.

---

## Interpretation Table

**Assembly level** (contiguity, best → worst):

| Level | Meaning |
|-------|---------|
| Complete Genome | Every replicon (each chromoso

More from this repository

setup-tooluniverseSkill

Install and configure ToolUniverse for any use case — MCP server (chat-based), CLI (command line with 9 subcommands), or Python SDK (Coding API with 3 calling patterns). Covers uv/uvx setup, MCP configuration for 12+ AI clients (Cursor, Claude Desktop, Windsurf, VS Code, Codex, Gemini CLI, Trae, Cline, etc.), full CLI reference (tu list/grep/find/info/run/test/status/build/serve), Coding API quickstart, agentic tools, code executor, API key walkthrough, skill installation, and upgrading. Use when user asks how to set up ToolUniverse, which access mode to use (MCP vs CLI vs SDK), configuring MCP servers, using the CLI, troubleshooting installation, upgrading, or mentions installing ToolUniverse or setting up scientific tools. Also triggers for "how do I use ToolUniverse", "what's the best way to access tools", "command line", "tu command", "coding API", "tu build".

tooluniverse-acmg-variant-classificationSkill

Systematic ACMG/AMP germline variant classification with all 28 criteria (PVS1, PS1-4, PM1-6, PP1-5, BA1, BS1-4, BP1-7) for clinical significance. Produces 5-tier verdict (Pathogenic / Likely Pathogenic / VUS / Likely Benign / Benign) with cited evidence per criterion. Use for variant interpretation, VUS resolution, and pathogenicity assessment. Combines ClinVar, gnomAD, computational predictors, and gene-mechanism context.

tooluniverse-admet-predictionSkill

Comprehensive ADMET (Absorption, Distribution, Metabolism, Excretion, Toxicity) profiling for drug candidates. Integrates ADMET-AI predictions, SwissADME drug-likeness, PubChemTox experimental toxicity, ChEMBL clinical data, Lipinski rule-of-five, and CYP interaction data. Use for drug-likeness assessment, BBB penetration, bioavailability, hepatotoxicity prediction, ADME/PK profiling, or screening compound libraries before lab testing.

tooluniverse-adverse-event-detectionSkill

Detect and analyze adverse drug event signals using FDA FAERS reports, drug labels, and disproportionality statistics (PRR, ROR, IC). Generates quantitative safety signal scores (0-100) with evidence grading. Use for post-market surveillance, pharmacovigilance, drug safety assessment, regulatory submissions, and detecting rare AE signals not visible in clinical trials.

tooluniverse-adverse-outcome-pathwaySkill

Map environmental and industrial chemicals to adverse outcome pathways (AOPs) — molecular initiating event to organ-level toxicity. Uses AOPWiki, GHS classification, IARC carcinogen status, and LD50 data. Use for environmental/industrial chemical risk assessment, regulatory-grade hazard characterization, and AOP stressor mapping. Distinct from drug-safety analysis (use tooluniverse-pharmacovigilance for drugs).

tooluniverse-aging-senescenceSkill

Aging biology, cellular senescence, and longevity research. Covers senescence markers (p16/CDKN2A, SASP, SA-beta-gal), aging hallmarks, senolytic drug discovery (dasatinib+quercetin, fisetin, navitoclax), epigenetic clocks, telomere biology, and longevity GWAS. Use for senescence-pathway analysis, age-related disease genetics, senolytic-target discovery, and centenarian-genetics queries. Distinguishes correlative vs causal evidence (knockout, intervention).

tooluniverse-antibody-engineeringSkill

Therapeutic antibody engineering and optimization, lead-to-clinical-candidate. Covers sequence humanization (germline alignment, framework retention), affinity maturation, developability (aggregation, stability, PTMs), structure modeling (AlphaFold/PDB CDR analysis), immunogenicity prediction, and manufacturing feasibility. Use for biologic-drug optimization, mAb design review, biosimilar engineering, and clinical-precedent comparison.

tooluniverse-binder-discoverySkill

Discover novel small-molecule binders for protein targets using structure-based and ligand-based screening. Covers druggability assessment, known-ligand mining (ChEMBL, BindingDB), similarity expansion, ADMET filtering, and synthesis feasibility. Use for hit identification, virtual screening, target-to-compounds workflows, and lead-finding before commit-to-medchem.