tooluniverse-fastq-qc
This skill performs quality control on raw sequencing reads using FastQC, MultiQC, and seqkit, then optionally trims adapters or low-quality bases with fastp or Cutadapt based on user confirmation. Use it when analyzing FASTQ files to assess read quality, detect contamination, and decide whether trimming is needed before downstream analysis like alignment or variant calling.
git clone --depth 1 https://github.com/mims-harvard/ToolUniverse /tmp/tooluniverse-fastq-qc && cp -r /tmp/tooluniverse-fastq-qc/plugin/skills/tooluniverse-fastq-qc ~/.claude/skills/tooluniverse-fastq-qcSKILL.md
# FASTQ Quality Control & Trimming Decisions
Run quality control on raw sequencing reads, interpret the report, and make
an evidence-based decision about whether to trim — using real local
command-line tools (FastQC, MultiQC, fastp, Cutadapt, seqkit).
## Honesty contract (read first)
This skill drives **real binaries**. It must never fabricate QC numbers.
1. **Preflight before anything.** Check whether the required tools are on
PATH. If a required tool is missing, emit the install plan and STOP.
Do not estimate, guess, or describe hypothetical QC results.
2. **Never auto-trim.** Trimming is a *decision*. QC-only is the default.
Only trim after inspecting adapter content / per-base quality, and only
when the user has confirmed `--mode trim`.
3. **Never overwrite raw FASTQs.** All outputs go to a separate `--workdir`.
The input directory is read-only. Trimmed reads are written as NEW files.
4. **If you cannot run, say so.** "FastQC is not installed; here is the
install plan" is the correct answer — not a made-up PASS/FAIL table.
## When to use vs. not
**Use this skill when the user wants to:**
- Run FastQC / fastp QC on one or more FASTQ (`.fastq`, `.fq`, `.gz`) files
- Interpret a FastQC report (per-base quality, adapter content, etc.)
- Decide whether adapter or quality trimming is needed before downstream work
- Summarize many samples into one MultiQC report
- Count reads, get length/GC stats, or subsample with seqkit
- Trim adapters/low-quality bases with fastp or Cutadapt (explicitly)
**Do NOT use this skill for (route elsewhere):**
- Differential expression / DEG / fold-change analysis -> `tooluniverse-rnaseq-deseq2`
- Read alignment, coverage depth, samtools, BWA -> `tooluniverse-sequence-analysis`
- Variant calling, VCF, VAF, mutation analysis -> `tooluniverse-variant-analysis`
- Single-cell / scRNA QC (per-cell metrics, scanpy) -> `tooluniverse-single-cell`
## Essential inputs to confirm
Before running, confirm with the user (ask if unstated):
1. **FASTQ paths** — exact path(s). One file = single-end; an R1+R2 pair =
paired-end (e.g. `*_R1.fastq.gz` / `*_R2.fastq.gz`).
2. **QC-only or trim?** Default is QC-only. Only trim on explicit request.
3. **Known adapters / primers?** Standard Illumina adapters are auto-detected
by fastp; amplicon/primer sequences usually need explicit Cutadapt removal.
4. **Organism** — only needed if a contamination / over-representation screen
is requested (needs a reference; see Limitations).
5. **Output directory** — a `--workdir` SEPARATE from the input folder.
6. **Read provenance** — are these raw, already-trimmed, or UMI-tagged reads?
Already-trimmed reads should NOT be trimmed again; UMIs must be handled
before trimming or you corrupt the UMI.
## Preflight (do this first, every time)
The bundled script preflights for you, but the decision logic is:
```python
import shutil
for tool in ("fastqc", "fastp", "seqkit"):
print(tool, shutil.which(tool) or "MISSING")
```
`command -v fastqc` / `shutil.which("fastqc")` returning nothing means the
tool is absent. If a **required** tool (FastQC for QC; FastQC+fastp for trim)
is missing, emit:
```
mamba install -c bioconda -c conda-forge fastqc fastp seqkit multiqc
# or
conda install -c bioconda -c conda-forge fastqc fastp seqkit multiqc
```
and stop. Do not proceed to fabricate output.
## Tool roles
| Tool | Role | Install (bioconda) |
|-----------|----------------------------------------------------------------|--------------------|
| FastQC | Per-file raw read QC; produces the module PASS/WARN/FAIL report | `fastqc` |
| MultiQC | Aggregates many FastQC (and fastp) reports into one summary | `multiqc` |
| fastp | All-in-one QC + adapter + quality trimming (fast, auto-detect) | `fastp` |
| Cutadapt | Explicit, precise adapter/primer removal (amplicons, custom) | `cutadapt` |
| seqkit | Read counts, length/GC stats, subsampling | `seqkit` |
Rule of thumb: **FastQC to diagnose, fastp to fix general adapter/quality,
Cutadapt to fix a known primer/adapter precisely, seqkit to count/stat.**
## Bundled orchestration script
`scripts/run_fastq_qc.py` does the preflight + run-if-available + plan-if-missing
flow, with workspace isolation built in.
```bash
# QC only (default) — never modifies reads
python scripts/run_fastq_qc.py \
--fastq reads/sample_R1.fastq.gz reads/sample_R2.fastq.gz \
--workdir /tmp/fastq_qc_run
# QC + trim (explicit) — fastp writes NEW trimmed files into --workdir
python scripts/run_fastq_qc.py \
--fastq reads/sample_R1.fastq.gz reads/sample_R2.fastq.gz \
--workdir /tmp/fastq_qc_run \
--mode trim
```
Behavior:
- **Preflights** FastQC (+ fastp in trim mode) and seqkit. If a required
tool is missing it prints the install plan and exits 0 — no fabricated QC.
- Runs **FastQC** (always) + **seqkit stats** (if present) into `--workdir`.
- In `--mode trim`, runs **fastp** writing `*.trimmed.fastq.gz` into
`--workdir/trimmed/` — raw inputs are never touched.
- **Refuses** to run if `--workdir` equals an input directory (overwrite guard).
For a project-level summary after FastQC, run MultiQC over the workdir:
```bash
multiqc /tmp/fastq_qc_run -o /tmp/fastq_qc_run/multiqc
```
## INTERPRETATION — FastQC module -> meaning -> action
This table is the core value-add. Map each FastQC module to what PASS/WARN/FAIL
means and what to actually do. (See `references/fastqc_interpretation.md` for the
long form with thresholds and worked cases.)
| FastQC module | Typical PASS | WARN / FAIL means | Suggested action |
|------------------------------|---------------------------|----------------------------------------------------------------|------------------|
| Per base sequence quality | All positions Q>Install and configure ToolUniverse for any use case — MCP server (chat-based), CLI (command line with 9 subcommands), or Python SDK (Coding API with 3 calling patterns). Covers uv/uvx setup, MCP configuration for 12+ AI clients (Cursor, Claude Desktop, Windsurf, VS Code, Codex, Gemini CLI, Trae, Cline, etc.), full CLI reference (tu list/grep/find/info/run/test/status/build/serve), Coding API quickstart, agentic tools, code executor, API key walkthrough, skill installation, and upgrading. Use when user asks how to set up ToolUniverse, which access mode to use (MCP vs CLI vs SDK), configuring MCP servers, using the CLI, troubleshooting installation, upgrading, or mentions installing ToolUniverse or setting up scientific tools. Also triggers for "how do I use ToolUniverse", "what's the best way to access tools", "command line", "tu command", "coding API", "tu build".
Systematic ACMG/AMP germline variant classification with all 28 criteria (PVS1, PS1-4, PM1-6, PP1-5, BA1, BS1-4, BP1-7) for clinical significance. Produces 5-tier verdict (Pathogenic / Likely Pathogenic / VUS / Likely Benign / Benign) with cited evidence per criterion. Use for variant interpretation, VUS resolution, and pathogenicity assessment. Combines ClinVar, gnomAD, computational predictors, and gene-mechanism context.
Comprehensive ADMET (Absorption, Distribution, Metabolism, Excretion, Toxicity) profiling for drug candidates. Integrates ADMET-AI predictions, SwissADME drug-likeness, PubChemTox experimental toxicity, ChEMBL clinical data, Lipinski rule-of-five, and CYP interaction data. Use for drug-likeness assessment, BBB penetration, bioavailability, hepatotoxicity prediction, ADME/PK profiling, or screening compound libraries before lab testing.
Detect and analyze adverse drug event signals using FDA FAERS reports, drug labels, and disproportionality statistics (PRR, ROR, IC). Generates quantitative safety signal scores (0-100) with evidence grading. Use for post-market surveillance, pharmacovigilance, drug safety assessment, regulatory submissions, and detecting rare AE signals not visible in clinical trials.
Map environmental and industrial chemicals to adverse outcome pathways (AOPs) — molecular initiating event to organ-level toxicity. Uses AOPWiki, GHS classification, IARC carcinogen status, and LD50 data. Use for environmental/industrial chemical risk assessment, regulatory-grade hazard characterization, and AOP stressor mapping. Distinct from drug-safety analysis (use tooluniverse-pharmacovigilance for drugs).
Aging biology, cellular senescence, and longevity research. Covers senescence markers (p16/CDKN2A, SASP, SA-beta-gal), aging hallmarks, senolytic drug discovery (dasatinib+quercetin, fisetin, navitoclax), epigenetic clocks, telomere biology, and longevity GWAS. Use for senescence-pathway analysis, age-related disease genetics, senolytic-target discovery, and centenarian-genetics queries. Distinguishes correlative vs causal evidence (knockout, intervention).
Therapeutic antibody engineering and optimization, lead-to-clinical-candidate. Covers sequence humanization (germline alignment, framework retention), affinity maturation, developability (aggregation, stability, PTMs), structure modeling (AlphaFold/PDB CDR analysis), immunogenicity prediction, and manufacturing feasibility. Use for biologic-drug optimization, mAb design review, biosimilar engineering, and clinical-precedent comparison.
Discover novel small-molecule binders for protein targets using structure-based and ligand-based screening. Covers druggability assessment, known-ligand mining (ChEMBL, BindingDB), similarity expansion, ADMET filtering, and synthesis feasibility. Use for hit identification, virtual screening, target-to-compounds workflows, and lead-finding before commit-to-medchem.