Skip to main content
ClaudeWave
Skill199 repo starsupdated 16d ago

fastp-fastq-preprocessing

All-in-one FASTQ QC and adapter trimming. Auto-detects Illumina adapters, filters low-quality reads, corrects paired-end overlaps, emits HTML+JSON QC in one pass. 3-10x faster than Trim Galore/Trimmomatic. First step before STAR, BWA-MEM2, or Salmon.

Install in Claude Code
Copy
git clone --depth 1 https://github.com/jaechang-hits/SciAgent-Skills /tmp/fastp-fastq-preprocessing && cp -r /tmp/fastp-fastq-preprocessing/skills/genomics-bioinformatics/qc/fastp-fastq-preprocessing ~/.claude/skills/fastp-fastq-preprocessing
Then start a new Claude Code session; the skill loads automatically.

SKILL.md

# fastp — Fast FASTQ Quality Control and Adapter Trimming

## Overview

fastp performs adapter trimming, quality filtering, and QC reporting for Illumina FASTQ files in a single multi-threaded pass. It automatically detects adapter sequences from paired-end read overlaps — eliminating the need to specify adapters manually. fastp corrects mismatches in paired-end overlap regions, filters reads by quality score and length, removes polyX tails (polyA for RNA-seq), and generates interactive HTML and machine-readable JSON QC reports. Being 3–10× faster than Trim Galore and Trimmomatic while providing comparable or better results, fastp has become the standard preprocessing step before alignment in WGS, RNA-seq, and ChIP-seq pipelines.

## When to Use

- Trimming Illumina adapters and low-quality bases before alignment in any NGS pipeline (RNA-seq, WGS, WES, ChIP-seq, ATAC-seq)
- Generating per-sample QC reports (HTML + JSON) as the first step of a pipeline, before MultiQC aggregation
- Processing paired-end reads where adapter auto-detection from overlap is preferred over manual adapter specification
- Removing polyA tails from RNA-seq reads from 3′ end-enriched protocols (Smart-seq, QuantSeq)
- Splitting a FASTQ file by UMI or by index for demultiplexing workflows
- Use **Trim Galore** as an alternative when TrimGalore's detailed per-base quality report from FastQC is required alongside trimming
- Use **Trimmomatic** as an alternative for fine-grained control of sliding-window trimming steps

## Prerequisites

- **Software**: fastp (conda or pre-compiled binary)
- **Input**: raw Illumina FASTQ files (single-end or paired-end, .fastq or .fastq.gz)

> **Check before installing**: The tool may already be available in the current environment (e.g., inside a `pixi` / `conda` env). Run `command -v fastp` first and skip the install commands below if it returns a path. When running inside a pixi project, invoke the tool via `pixi run fastp` rather than bare `fastp`.

```bash
# Install with conda
conda install -c bioconda fastp

# Or download pre-compiled binary (Linux)
wget https://github.com/OpenGene/fastp/releases/download/v0.24.0/fastp
chmod +x fastp
./fastp --version
# fastp 0.24.0

# Verify
fastp --version
```

## Quick Start

```bash
# Paired-end adapter trimming with QC report
fastp \
    -i sample_R1.fastq.gz \
    -I sample_R2.fastq.gz \
    -o sample_R1.trimmed.fastq.gz \
    -O sample_R2.trimmed.fastq.gz \
    -h sample_qc.html \
    -j sample_qc.json \
    --thread 8

echo "Trimmed reads in: sample_R1.trimmed.fastq.gz"
```

## Workflow

### Step 1: Single-End Adapter Trimming

Run fastp on single-end FASTQ with automatic adapter detection.

```bash
# Single-end with auto adapter detection
fastp \
    -i sample.fastq.gz \
    -o sample.trimmed.fastq.gz \
    -h sample_qc.html \
    -j sample_qc.json \
    --thread 8 \
    --qualified_quality_phred 20 \
    --length_required 36

echo "Input reads:   $(zcat sample.fastq.gz | wc -l | awk '{print $1/4}')"
echo "Output reads:  $(zcat sample.trimmed.fastq.gz | wc -l | awk '{print $1/4}')"
```

### Step 2: Paired-End Adapter Trimming

Process paired-end FASTQ files with overlap-based adapter detection and correction.

```bash
# Paired-end with overlap-based adapter auto-detection
fastp \
    -i sample_R1.fastq.gz \
    -I sample_R2.fastq.gz \
    -o sample_R1.trimmed.fastq.gz \
    -O sample_R2.trimmed.fastq.gz \
    -h sample_qc.html \
    -j sample_qc.json \
    --thread 8 \
    --correction \
    --detect_adapter_for_pe \
    --qualified_quality_phred 20 \
    --length_required 36

# Specify adapters explicitly (if auto-detection fails)
# fastp -i R1.fq.gz -I R2.fq.gz \
#   --adapter_sequence AGATCGGAAGAGCACACGTCTGAACTCCAGTCA \
#   --adapter_sequence_r2 AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT \
#   -o R1.out.fq.gz -O R2.out.fq.gz
```

### Step 3: Quality Filtering and Read Length Trimming

Configure quality and length thresholds for stricter or more lenient filtering.

```bash
# Strict quality filtering (e.g., for variant calling)
fastp \
    -i sample_R1.fastq.gz \
    -I sample_R2.fastq.gz \
    -o sample_R1.filtered.fastq.gz \
    -O sample_R2.filtered.fastq.gz \
    -h sample_qc.html \
    -j sample_qc.json \
    --thread 8 \
    --qualified_quality_phred 25 \
    --unqualified_percent_limit 20 \
    --length_required 50 \
    --max_len1 150 \
    --max_len2 150 \
    --low_complexity_filter \
    --complexity_threshold 30

echo "Filtering complete. Check sample_qc.html for pass/fail rates."
```

### Step 4: RNA-seq polyA Tail Removal

Remove polyA tails from 3′-enriched RNA-seq protocols before alignment.

```bash
# Remove polyA tails (QuantSeq 3′ mRNA-seq)
fastp \
    -i quantseq_R1.fastq.gz \
    -o quantseq_R1.trimmed.fastq.gz \
    -h quantseq_qc.html \
    -j quantseq_qc.json \
    --thread 8 \
    --trim_poly_x \
    --poly_x_min_len 10 \
    --qualified_quality_phred 20 \
    --length_required 25

# For Smart-seq2 paired-end with polyA
fastp \
    -i smartseq_R1.fastq.gz \
    -I smartseq_R2.fastq.gz \
    -o smartseq_R1.trimmed.fastq.gz \
    -O smartseq_R2.trimmed.fastq.gz \
    --trim_poly_x --poly_x_min_len 10 \
    --thread 8 \
    -h smartseq_qc.html -j smartseq_qc.json
```

### Step 5: Parse QC Report JSON for Pipeline Monitoring

Extract key QC metrics from fastp's JSON output for automated quality gates.

```python
import json
from pathlib import Path

def parse_fastp_json(json_path: str) -> dict:
    with open(json_path) as f:
        data = json.load(f)
    
    before = data["summary"]["before_filtering"]
    after = data["summary"]["after_filtering"]
    
    return {
        "total_reads_in":  before["total_reads"],
        "total_reads_out": after["total_reads"],
        "pct_passed":      after["total_reads"] / before["total_reads"] * 100,
        "q30_rate_before": before["q30_rate"] * 100,
        "q30_rate_after":  after["q30_rate"] * 100,
        "mean_len_before": before["read1_mean_length"],
        "mean_len_afte
sciagent-skill-creatorSkill

|

opentrons-integrationSkill

Opentrons Protocol API v2 for OT-2/Flex: Python protocols for pipetting, serial dilutions, PCR, plate replication; control thermocycler, heater-shaker, magnetic, temperature modules. Use pylabrobot for multi-vendor.

plotly-interactive-visualizationSkill

Interactive visualization with Plotly. 40+ chart types (scatter, line, heatmap, 3D, geographic) with hover, zoom, pan. Two APIs: Plotly Express (DataFrame) and Graph Objects (fine control). For static publication figures use matplotlib; for statistical grammar use seaborn.

seaborn-statistical-visualizationSkill

Statistical visualization on matplotlib + pandas. Distributions (histplot, kdeplot, violin, box), relational (scatter, line), categorical, regression, correlation heatmaps. Auto aggregation/CIs. Use plotly for interactive; matplotlib for low-level.

single-cell-annotationSkill

Best practices for single-cell RNA-seq cell type annotation including marker-based, reference-based, and automated classification approaches.

pymc-bayesian-modelingSkill

Bayesian modeling with PyMC 5: priors, likelihood, NUTS/ADVI sampling, diagnostics (R-hat, ESS), LOO/WAIC comparison, prediction. Hierarchical, logistic, GP variants; predictive checks.

scikit-survival-analysisSkill

Time-to-event modeling with scikit-survival: Cox PH (elastic net), Random Survival Forests, Boosting, SVMs for censored data. C-index, Brier, time-dependent AUC; Kaplan-Meier, Nelson-Aalen, competing risks. Pipeline/GridSearchCV compatible. Use statsmodels for frequentist, pymc for Bayesian, lifelines for parametric.

statistical-analysisSkill

>-