fastp-fastq-preprocessing
All-in-one FASTQ QC and adapter trimming. Auto-detects Illumina adapters, filters low-quality reads, corrects paired-end overlaps, emits HTML+JSON QC in one pass. 3-10x faster than Trim Galore/Trimmomatic. First step before STAR, BWA-MEM2, or Salmon.
git clone --depth 1 https://github.com/jaechang-hits/SciAgent-Skills /tmp/fastp-fastq-preprocessing && cp -r /tmp/fastp-fastq-preprocessing/skills/genomics-bioinformatics/qc/fastp-fastq-preprocessing ~/.claude/skills/fastp-fastq-preprocessingSKILL.md
# fastp — Fast FASTQ Quality Control and Adapter Trimming
## Overview
fastp performs adapter trimming, quality filtering, and QC reporting for Illumina FASTQ files in a single multi-threaded pass. It automatically detects adapter sequences from paired-end read overlaps — eliminating the need to specify adapters manually. fastp corrects mismatches in paired-end overlap regions, filters reads by quality score and length, removes polyX tails (polyA for RNA-seq), and generates interactive HTML and machine-readable JSON QC reports. Being 3–10× faster than Trim Galore and Trimmomatic while providing comparable or better results, fastp has become the standard preprocessing step before alignment in WGS, RNA-seq, and ChIP-seq pipelines.
## When to Use
- Trimming Illumina adapters and low-quality bases before alignment in any NGS pipeline (RNA-seq, WGS, WES, ChIP-seq, ATAC-seq)
- Generating per-sample QC reports (HTML + JSON) as the first step of a pipeline, before MultiQC aggregation
- Processing paired-end reads where adapter auto-detection from overlap is preferred over manual adapter specification
- Removing polyA tails from RNA-seq reads from 3′ end-enriched protocols (Smart-seq, QuantSeq)
- Splitting a FASTQ file by UMI or by index for demultiplexing workflows
- Use **Trim Galore** as an alternative when TrimGalore's detailed per-base quality report from FastQC is required alongside trimming
- Use **Trimmomatic** as an alternative for fine-grained control of sliding-window trimming steps
## Prerequisites
- **Software**: fastp (conda or pre-compiled binary)
- **Input**: raw Illumina FASTQ files (single-end or paired-end, .fastq or .fastq.gz)
> **Check before installing**: The tool may already be available in the current environment (e.g., inside a `pixi` / `conda` env). Run `command -v fastp` first and skip the install commands below if it returns a path. When running inside a pixi project, invoke the tool via `pixi run fastp` rather than bare `fastp`.
```bash
# Install with conda
conda install -c bioconda fastp
# Or download pre-compiled binary (Linux)
wget https://github.com/OpenGene/fastp/releases/download/v0.24.0/fastp
chmod +x fastp
./fastp --version
# fastp 0.24.0
# Verify
fastp --version
```
## Quick Start
```bash
# Paired-end adapter trimming with QC report
fastp \
-i sample_R1.fastq.gz \
-I sample_R2.fastq.gz \
-o sample_R1.trimmed.fastq.gz \
-O sample_R2.trimmed.fastq.gz \
-h sample_qc.html \
-j sample_qc.json \
--thread 8
echo "Trimmed reads in: sample_R1.trimmed.fastq.gz"
```
## Workflow
### Step 1: Single-End Adapter Trimming
Run fastp on single-end FASTQ with automatic adapter detection.
```bash
# Single-end with auto adapter detection
fastp \
-i sample.fastq.gz \
-o sample.trimmed.fastq.gz \
-h sample_qc.html \
-j sample_qc.json \
--thread 8 \
--qualified_quality_phred 20 \
--length_required 36
echo "Input reads: $(zcat sample.fastq.gz | wc -l | awk '{print $1/4}')"
echo "Output reads: $(zcat sample.trimmed.fastq.gz | wc -l | awk '{print $1/4}')"
```
### Step 2: Paired-End Adapter Trimming
Process paired-end FASTQ files with overlap-based adapter detection and correction.
```bash
# Paired-end with overlap-based adapter auto-detection
fastp \
-i sample_R1.fastq.gz \
-I sample_R2.fastq.gz \
-o sample_R1.trimmed.fastq.gz \
-O sample_R2.trimmed.fastq.gz \
-h sample_qc.html \
-j sample_qc.json \
--thread 8 \
--correction \
--detect_adapter_for_pe \
--qualified_quality_phred 20 \
--length_required 36
# Specify adapters explicitly (if auto-detection fails)
# fastp -i R1.fq.gz -I R2.fq.gz \
# --adapter_sequence AGATCGGAAGAGCACACGTCTGAACTCCAGTCA \
# --adapter_sequence_r2 AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT \
# -o R1.out.fq.gz -O R2.out.fq.gz
```
### Step 3: Quality Filtering and Read Length Trimming
Configure quality and length thresholds for stricter or more lenient filtering.
```bash
# Strict quality filtering (e.g., for variant calling)
fastp \
-i sample_R1.fastq.gz \
-I sample_R2.fastq.gz \
-o sample_R1.filtered.fastq.gz \
-O sample_R2.filtered.fastq.gz \
-h sample_qc.html \
-j sample_qc.json \
--thread 8 \
--qualified_quality_phred 25 \
--unqualified_percent_limit 20 \
--length_required 50 \
--max_len1 150 \
--max_len2 150 \
--low_complexity_filter \
--complexity_threshold 30
echo "Filtering complete. Check sample_qc.html for pass/fail rates."
```
### Step 4: RNA-seq polyA Tail Removal
Remove polyA tails from 3′-enriched RNA-seq protocols before alignment.
```bash
# Remove polyA tails (QuantSeq 3′ mRNA-seq)
fastp \
-i quantseq_R1.fastq.gz \
-o quantseq_R1.trimmed.fastq.gz \
-h quantseq_qc.html \
-j quantseq_qc.json \
--thread 8 \
--trim_poly_x \
--poly_x_min_len 10 \
--qualified_quality_phred 20 \
--length_required 25
# For Smart-seq2 paired-end with polyA
fastp \
-i smartseq_R1.fastq.gz \
-I smartseq_R2.fastq.gz \
-o smartseq_R1.trimmed.fastq.gz \
-O smartseq_R2.trimmed.fastq.gz \
--trim_poly_x --poly_x_min_len 10 \
--thread 8 \
-h smartseq_qc.html -j smartseq_qc.json
```
### Step 5: Parse QC Report JSON for Pipeline Monitoring
Extract key QC metrics from fastp's JSON output for automated quality gates.
```python
import json
from pathlib import Path
def parse_fastp_json(json_path: str) -> dict:
with open(json_path) as f:
data = json.load(f)
before = data["summary"]["before_filtering"]
after = data["summary"]["after_filtering"]
return {
"total_reads_in": before["total_reads"],
"total_reads_out": after["total_reads"],
"pct_passed": after["total_reads"] / before["total_reads"] * 100,
"q30_rate_before": before["q30_rate"] * 100,
"q30_rate_after": after["q30_rate"] * 100,
"mean_len_before": before["read1_mean_length"],
"mean_len_afte|
Opentrons Protocol API v2 for OT-2/Flex: Python protocols for pipetting, serial dilutions, PCR, plate replication; control thermocycler, heater-shaker, magnetic, temperature modules. Use pylabrobot for multi-vendor.
Interactive visualization with Plotly. 40+ chart types (scatter, line, heatmap, 3D, geographic) with hover, zoom, pan. Two APIs: Plotly Express (DataFrame) and Graph Objects (fine control). For static publication figures use matplotlib; for statistical grammar use seaborn.
Statistical visualization on matplotlib + pandas. Distributions (histplot, kdeplot, violin, box), relational (scatter, line), categorical, regression, correlation heatmaps. Auto aggregation/CIs. Use plotly for interactive; matplotlib for low-level.
Best practices for single-cell RNA-seq cell type annotation including marker-based, reference-based, and automated classification approaches.
Bayesian modeling with PyMC 5: priors, likelihood, NUTS/ADVI sampling, diagnostics (R-hat, ESS), LOO/WAIC comparison, prediction. Hierarchical, logistic, GP variants; predictive checks.
Time-to-event modeling with scikit-survival: Cox PH (elastic net), Random Survival Forests, Boosting, SVMs for censored data. C-index, Brier, time-dependent AUC; Kaplan-Meier, Nelson-Aalen, competing risks. Pipeline/GridSearchCV compatible. Use statsmodels for frequentist, pymc for Bayesian, lifelines for parametric.
>-