Skill286 estrellas del repoactualizado 5d ago

multiqc-qc-reports

MultiQC aggregates quality control metrics from over 150 bioinformatics tools including FastQC, samtools, STAR, HISAT2, Trim Galore, featureCounts, Kallisto, Salmon, Picard, and GATK into a single interactive HTML report with plots and statistics. Use it to review QC across multiple samples after alignment or quantification steps, identify batch effects or failed runs, and share QC summaries with collaborators.

Ver fuente Repositorio: SciAgent-Skills

Instalar en Claude Code

Copiar

git clone --depth 1 https://github.com/jaechang-hits/SciAgent-Skills /tmp/multiqc-qc-reports && cp -r /tmp/multiqc-qc-reports/skills/genomics-bioinformatics/qc/multiqc-qc-reports ~/.claude/skills/multiqc-qc-reports

Después abre una sesión nueva de Claude Code; el skill carga automáticamente.

Definición

SKILL.md

# MultiQC — Multi-Sample QC Report Aggregator

## Overview

MultiQC automatically searches directories for QC log files from 150+ bioinformatics tools and aggregates statistics across all samples into a single interactive HTML report. It parses outputs from FastQC, samtools flagstat, STAR, HISAT2, Trim Galore, Salmon, Kallisto, featureCounts, Picard, GATK, and many more — eliminating the need to manually review per-sample QC files. Reports include interactive bar plots, scatter plots, heatmaps, and tables with configurable warnings and pass/fail thresholds.

## When to Use

- Reviewing QC metrics across 10+ samples at once after FastQC, alignment, or quantification
- Final QC checkpoint before differential expression or variant analysis
- Sharing QC summaries with collaborators or including in publications
- Identifying batch effects, outlier samples, or failed sequencing runs
- Combining QC from multi-step pipelines (trimming → alignment → quantification) into one view
- Use FastQC directly instead for initial single-sample QC exploration
- For custom QC metrics not from standard tools, use Python/R directly; MultiQC parses tool outputs only

## Prerequisites

- **Python packages**: `multiqc`
- **Input requirements**: Output files from bioinformatics tools (FastQC `.zip`, samtools `.flagstat`, STAR `Log.final.out`, etc.) — MultiQC finds them automatically
- **Environment**: Python 3.8+

> **Check before installing**: The tool may already be available in the current environment (e.g., inside a `pixi` / `conda` env). Run `command -v multiqc` first and skip the install commands below if it returns a path. When running inside a pixi project, invoke the tool via `pixi run multiqc` rather than bare `multiqc`.

```bash
pip install multiqc

# Verify
multiqc --version
# MultiQC v1.25.0

# With conda (recommended for bioinformatics)
conda install -c bioconda multiqc
```

## Workflow

### Step 1: Generate Tool-Specific QC Files

MultiQC aggregates existing output — first run your QC tools.

```bash
# FastQC on all FASTQ files
mkdir -p qc/fastqc
fastqc data/*.fastq.gz -o qc/fastqc/ -t 8

# samtools flagstat on all BAM files
for bam in results/*.bam; do
    samtools flagstat $bam > qc/$(basename $bam .bam).flagstat
done
echo "QC files generated: $(ls qc/ | wc -l)"
```

### Step 2: Run MultiQC on a Directory

MultiQC recursively scans for recognized QC files.

```bash
# Basic run: scan current directory recursively
multiqc .

# Specify output directory and report name
multiqc . -o reports/ -n project_qc_report

# Scan specific subdirectories only
multiqc qc/fastqc/ results/star/ logs/trimming/ -o reports/

# Output: reports/project_qc_report.html
echo "Report: reports/project_qc_report.html"
```

### Step 3: Configure Report Behavior

Use `multiqc_config.yaml` to set custom thresholds, sample naming, and module order.

```yaml
# multiqc_config.yaml — place in working directory
title: "RNA-seq QC Report — Project X"
subtitle: "Analysis date: 2026-02"
intro_text: "Quality control summary for all 48 samples."

# Sample name cleaning: remove path prefixes and suffixes
fn_clean_exts:
  - ".fastq.gz"
  - "_R1"
  - ".sorted"

# Thresholds for pass/warn/fail coloring
general_stats_addcols:
  FastQC:
    pct_duplication:
      max: 40
      warn: 30

# Module run order
module_order:
  - fastqc
  - trimgalore
  - star
  - featurecounts
  - samtools
```

```bash
# Run with config file
multiqc . --config multiqc_config.yaml -o reports/
```

### Step 4: Use MultiQC Modules and Filters

Control which tools and samples are included.

```bash
# Run only specific modules
multiqc . --module fastqc --module samtools

# Exclude specific modules
multiqc . --exclude fastqc

# Include only files matching a pattern
multiqc . --filename "*.flagstat" --filename "*_fastqc.zip"

# Ignore specific directories or files
multiqc . --ignore "tmp/" --ignore "*.bam"

# Add sample name regex substitution
multiqc . --replace-names "sample_" ""
```

### Step 5: Export Data for Downstream Analysis

Extract machine-readable statistics from the MultiQC report.

```bash
# Export data tables (CSV, JSON, YAML, TSV)
multiqc . -o reports/ --data-format json
# Generates: reports/multiqc_data/multiqc_data.json

# Export flat CSV tables per tool
multiqc . -o reports/ --export
ls reports/multiqc_data/
# multiqc_fastqc.txt, multiqc_samtools_stats.txt, ...

# Extract general stats as pandas DataFrame
python3 - << 'EOF'
import json
import pandas as pd
with open("reports/multiqc_data/multiqc_general_stats.json") as f:
    data = json.load(f)
df = pd.DataFrame(data).T
print(df.head())
print(f"Shape: {df.shape}")
EOF
```

### Step 6: Automate in Pipeline Scripts

Integrate MultiQC as the final step of any QC pipeline.

```bash
#!/bin/bash
# Complete RNA-seq QC pipeline → MultiQC summary
SAMPLES=(ctrl_rep1 ctrl_rep2 treat_rep1 treat_rep2)
OUTDIR="pipeline_output"
mkdir -p $OUTDIR/{fastqc,star,featurecounts,flagstat}

for sample in "${SAMPLES[@]}"; do
    # FastQC
    fastqc data/${sample}.fastq.gz -o $OUTDIR/fastqc/ -t 4
    # STAR alignment
    STAR --runThreadN 8 --genomeDir refs/star_index \
         --readFilesIn data/${sample}.fastq.gz \
         --outSAMtype BAM SortedByCoordinate \
         --outFileNamePrefix $OUTDIR/star/${sample}/
    # samtools flagstat
    samtools flagstat $OUTDIR/star/${sample}/Aligned.sortedByCoord.out.bam \
        > $OUTDIR/flagstat/${sample}.flagstat
done

# Final MultiQC report
multiqc $OUTDIR/ -o $OUTDIR/qc_report/ -n "full_pipeline_qc"
echo "Report ready: $OUTDIR/qc_report/full_pipeline_qc.html"
```

## Key Parameters

| Parameter | Default | Range/Options | Effect |
|-----------|---------|---------------|--------|
| `-o, --outdir` | `.` | directory path | Output directory for report and data |
| `-n, --filename` | `multiqc_report` | any string | Report filename (without extension) |
| `-m, --module` | all | tool name | Run only specified module(s) |
| `--ignore` | — | glob pattern | Ignore matching files or directori

Del mismo repositorio

sciagent-skill-creatorSkill

opentrons-integrationSkill

Opentrons Protocol API v2 for OT-2/Flex: Python protocols for pipetting, serial dilutions, PCR, plate replication; control thermocycler, heater-shaker, magnetic, temperature modules. Use pylabrobot for multi-vendor.

plotly-interactive-visualizationSkill

Interactive visualization with Plotly. 40+ chart types (scatter, line, heatmap, 3D, geographic) with hover, zoom, pan. Two APIs: Plotly Express (DataFrame) and Graph Objects (fine control). For static publication figures use matplotlib; for statistical grammar use seaborn.

seaborn-statistical-visualizationSkill

Statistical visualization on matplotlib + pandas. Distributions (histplot, kdeplot, violin, box), relational (scatter, line), categorical, regression, correlation heatmaps. Auto aggregation/CIs. Use plotly for interactive; matplotlib for low-level.

single-cell-annotationSkill

Best practices for single-cell RNA-seq cell type annotation including marker-based, reference-based, and automated classification approaches.

pymc-bayesian-modelingSkill

Bayesian modeling with PyMC 5: priors, likelihood, NUTS/ADVI sampling, diagnostics (R-hat, ESS), LOO/WAIC comparison, prediction. Hierarchical, logistic, GP variants; predictive checks.

scikit-survival-analysisSkill

Time-to-event modeling with scikit-survival: Cox PH (elastic net), Random Survival Forests, Boosting, SVMs for censored data. C-index, Brier, time-dependent AUC; Kaplan-Meier, Nelson-Aalen, competing risks. Pipeline/GridSearchCV compatible. Use statsmodels for frequentist, pymc for Bayesian, lifelines for parametric.

statistical-analysisSkill