Skip to main content
ClaudeWave
Skill199 repo starsupdated 16d ago

multiqc-qc-reports

Aggregates QC from 150+ bioinformatics tools into one interactive HTML report. Scans FastQC, samtools, STAR, HISAT2, Trim Galore, featureCounts, Kallisto, Salmon, Picard, GATK logs; merges per-sample stats with plots. For NGS pipeline-wide QC. Use FastQC directly for single-sample; MultiQC for multi-sample reporting.

Install in Claude Code
Copy
git clone --depth 1 https://github.com/jaechang-hits/SciAgent-Skills /tmp/multiqc-qc-reports && cp -r /tmp/multiqc-qc-reports/skills/genomics-bioinformatics/qc/multiqc-qc-reports ~/.claude/skills/multiqc-qc-reports
Then start a new Claude Code session; the skill loads automatically.

SKILL.md

# MultiQC — Multi-Sample QC Report Aggregator

## Overview

MultiQC automatically searches directories for QC log files from 150+ bioinformatics tools and aggregates statistics across all samples into a single interactive HTML report. It parses outputs from FastQC, samtools flagstat, STAR, HISAT2, Trim Galore, Salmon, Kallisto, featureCounts, Picard, GATK, and many more — eliminating the need to manually review per-sample QC files. Reports include interactive bar plots, scatter plots, heatmaps, and tables with configurable warnings and pass/fail thresholds.

## When to Use

- Reviewing QC metrics across 10+ samples at once after FastQC, alignment, or quantification
- Final QC checkpoint before differential expression or variant analysis
- Sharing QC summaries with collaborators or including in publications
- Identifying batch effects, outlier samples, or failed sequencing runs
- Combining QC from multi-step pipelines (trimming → alignment → quantification) into one view
- Use FastQC directly instead for initial single-sample QC exploration
- For custom QC metrics not from standard tools, use Python/R directly; MultiQC parses tool outputs only

## Prerequisites

- **Python packages**: `multiqc`
- **Input requirements**: Output files from bioinformatics tools (FastQC `.zip`, samtools `.flagstat`, STAR `Log.final.out`, etc.) — MultiQC finds them automatically
- **Environment**: Python 3.8+

> **Check before installing**: The tool may already be available in the current environment (e.g., inside a `pixi` / `conda` env). Run `command -v multiqc` first and skip the install commands below if it returns a path. When running inside a pixi project, invoke the tool via `pixi run multiqc` rather than bare `multiqc`.

```bash
pip install multiqc

# Verify
multiqc --version
# MultiQC v1.25.0

# With conda (recommended for bioinformatics)
conda install -c bioconda multiqc
```

## Workflow

### Step 1: Generate Tool-Specific QC Files

MultiQC aggregates existing output — first run your QC tools.

```bash
# FastQC on all FASTQ files
mkdir -p qc/fastqc
fastqc data/*.fastq.gz -o qc/fastqc/ -t 8

# samtools flagstat on all BAM files
for bam in results/*.bam; do
    samtools flagstat $bam > qc/$(basename $bam .bam).flagstat
done
echo "QC files generated: $(ls qc/ | wc -l)"
```

### Step 2: Run MultiQC on a Directory

MultiQC recursively scans for recognized QC files.

```bash
# Basic run: scan current directory recursively
multiqc .

# Specify output directory and report name
multiqc . -o reports/ -n project_qc_report

# Scan specific subdirectories only
multiqc qc/fastqc/ results/star/ logs/trimming/ -o reports/

# Output: reports/project_qc_report.html
echo "Report: reports/project_qc_report.html"
```

### Step 3: Configure Report Behavior

Use `multiqc_config.yaml` to set custom thresholds, sample naming, and module order.

```yaml
# multiqc_config.yaml — place in working directory
title: "RNA-seq QC Report — Project X"
subtitle: "Analysis date: 2026-02"
intro_text: "Quality control summary for all 48 samples."

# Sample name cleaning: remove path prefixes and suffixes
fn_clean_exts:
  - ".fastq.gz"
  - "_R1"
  - ".sorted"

# Thresholds for pass/warn/fail coloring
general_stats_addcols:
  FastQC:
    pct_duplication:
      max: 40
      warn: 30

# Module run order
module_order:
  - fastqc
  - trimgalore
  - star
  - featurecounts
  - samtools
```

```bash
# Run with config file
multiqc . --config multiqc_config.yaml -o reports/
```

### Step 4: Use MultiQC Modules and Filters

Control which tools and samples are included.

```bash
# Run only specific modules
multiqc . --module fastqc --module samtools

# Exclude specific modules
multiqc . --exclude fastqc

# Include only files matching a pattern
multiqc . --filename "*.flagstat" --filename "*_fastqc.zip"

# Ignore specific directories or files
multiqc . --ignore "tmp/" --ignore "*.bam"

# Add sample name regex substitution
multiqc . --replace-names "sample_" ""
```

### Step 5: Export Data for Downstream Analysis

Extract machine-readable statistics from the MultiQC report.

```bash
# Export data tables (CSV, JSON, YAML, TSV)
multiqc . -o reports/ --data-format json
# Generates: reports/multiqc_data/multiqc_data.json

# Export flat CSV tables per tool
multiqc . -o reports/ --export
ls reports/multiqc_data/
# multiqc_fastqc.txt, multiqc_samtools_stats.txt, ...

# Extract general stats as pandas DataFrame
python3 - << 'EOF'
import json
import pandas as pd
with open("reports/multiqc_data/multiqc_general_stats.json") as f:
    data = json.load(f)
df = pd.DataFrame(data).T
print(df.head())
print(f"Shape: {df.shape}")
EOF
```

### Step 6: Automate in Pipeline Scripts

Integrate MultiQC as the final step of any QC pipeline.

```bash
#!/bin/bash
# Complete RNA-seq QC pipeline → MultiQC summary
SAMPLES=(ctrl_rep1 ctrl_rep2 treat_rep1 treat_rep2)
OUTDIR="pipeline_output"
mkdir -p $OUTDIR/{fastqc,star,featurecounts,flagstat}

for sample in "${SAMPLES[@]}"; do
    # FastQC
    fastqc data/${sample}.fastq.gz -o $OUTDIR/fastqc/ -t 4
    # STAR alignment
    STAR --runThreadN 8 --genomeDir refs/star_index \
         --readFilesIn data/${sample}.fastq.gz \
         --outSAMtype BAM SortedByCoordinate \
         --outFileNamePrefix $OUTDIR/star/${sample}/
    # samtools flagstat
    samtools flagstat $OUTDIR/star/${sample}/Aligned.sortedByCoord.out.bam \
        > $OUTDIR/flagstat/${sample}.flagstat
done

# Final MultiQC report
multiqc $OUTDIR/ -o $OUTDIR/qc_report/ -n "full_pipeline_qc"
echo "Report ready: $OUTDIR/qc_report/full_pipeline_qc.html"
```

## Key Parameters

| Parameter | Default | Range/Options | Effect |
|-----------|---------|---------------|--------|
| `-o, --outdir` | `.` | directory path | Output directory for report and data |
| `-n, --filename` | `multiqc_report` | any string | Report filename (without extension) |
| `-m, --module` | all | tool name | Run only specified module(s) |
| `--ignore` | — | glob pattern | Ignore matching files or directori
sciagent-skill-creatorSkill

|

opentrons-integrationSkill

Opentrons Protocol API v2 for OT-2/Flex: Python protocols for pipetting, serial dilutions, PCR, plate replication; control thermocycler, heater-shaker, magnetic, temperature modules. Use pylabrobot for multi-vendor.

plotly-interactive-visualizationSkill

Interactive visualization with Plotly. 40+ chart types (scatter, line, heatmap, 3D, geographic) with hover, zoom, pan. Two APIs: Plotly Express (DataFrame) and Graph Objects (fine control). For static publication figures use matplotlib; for statistical grammar use seaborn.

seaborn-statistical-visualizationSkill

Statistical visualization on matplotlib + pandas. Distributions (histplot, kdeplot, violin, box), relational (scatter, line), categorical, regression, correlation heatmaps. Auto aggregation/CIs. Use plotly for interactive; matplotlib for low-level.

single-cell-annotationSkill

Best practices for single-cell RNA-seq cell type annotation including marker-based, reference-based, and automated classification approaches.

pymc-bayesian-modelingSkill

Bayesian modeling with PyMC 5: priors, likelihood, NUTS/ADVI sampling, diagnostics (R-hat, ESS), LOO/WAIC comparison, prediction. Hierarchical, logistic, GP variants; predictive checks.

scikit-survival-analysisSkill

Time-to-event modeling with scikit-survival: Cox PH (elastic net), Random Survival Forests, Boosting, SVMs for censored data. C-index, Brier, time-dependent AUC; Kaplan-Meier, Nelson-Aalen, competing risks. Pipeline/GridSearchCV compatible. Use statsmodels for frequentist, pymc for Bayesian, lifelines for parametric.

statistical-analysisSkill

>-