salmon-rna-quantification
Ultra-fast RNA-seq transcript/gene quantification via quasi-mapping (no BAM). Builds a k-mer index from transcriptome FASTA, quantifies in minutes. Outputs TPM/count tables (quant.sf) with optional GC- and sequence-bias correction. Integrates with tximeta/tximport for DESeq2/edgeR. Use STAR when a genome-aligned BAM is needed.
git clone --depth 1 https://github.com/jaechang-hits/SciAgent-Skills /tmp/salmon-rna-quantification && cp -r /tmp/salmon-rna-quantification/skills/genomics-bioinformatics/rnaseq/salmon-rna-quantification ~/.claude/skills/salmon-rna-quantificationSKILL.md
# Salmon — Fast RNA-seq Quantification
## Overview
Salmon quantifies transcript abundance from RNA-seq reads using quasi-mapping — matching reads to a k-mer index of the transcriptome without full genome alignment. This makes Salmon 20–50× faster than alignment-based tools while producing accurate TPM and estimated count values. Salmon corrects for sequence-specific bias (`--seqBias`), GC-content bias (`--gcBias`), and fragment length distribution automatically. Output `quant.sf` files integrate directly with `tximeta` (R) or `pydeseq2` (Python) for differential expression analysis. For improved accuracy, decoy-aware indexing uses the full genome to identify spurious quasi-mappings.
## When to Use
- Performing fast RNA-seq quantification when you do not need a genome-aligned BAM file
- Running large-scale RNA-seq studies where alignment speed is a bottleneck (Salmon is 20-50× faster than STAR + featureCounts)
- Computing TPM and estimated counts from bulk RNA-seq for differential expression with DESeq2 or edgeR
- Correcting for GC bias, fragment length, and sequence context bias with `--gcBias --seqBias`
- Estimating transcript-level uncertainty via bootstrap resampling with `--numBootstraps`
- Use **STAR** instead when you need a genome-aligned BAM for downstream tools (variant calling, deeptools, IGV visualization)
- Use **Kallisto** as an alternative for similar speed; Salmon provides better bias correction and decoy-aware indexing
## Prerequisites
- **Software**: Salmon ≥ 1.10 (conda or pre-compiled binary)
- **Reference**: transcriptome FASTA (cDNA sequences, e.g., GENCODE or Ensembl) + genome FASTA for decoy-aware indexing
- **Python packages**: `pandas` for parsing output; `pydeseq2` for differential expression
> **Check before installing**: The tool may already be available in the current environment (e.g., inside a `pixi` / `conda` env). Run `command -v salmon` first and skip the install commands below if it returns a path. When running inside a pixi project, invoke the tool via `pixi run salmon` rather than bare `salmon`.
```bash
# Install with conda (recommended)
conda install -c bioconda salmon
# Verify
salmon --version
# salmon 1.10.3
# Or download pre-compiled binary
wget https://github.com/COMBINE-lab/salmon/releases/download/v1.10.0/salmon-1.10.0_linux_x86_64.tar.gz
tar xzvf salmon-1.10.0_linux_x86_64.tar.gz
export PATH="$PWD/salmon-latest_linux_x86_64/bin:$PATH"
```
## Quick Start
```bash
# 1. Build transcriptome index (~5 min)
salmon index -t transcriptome.fa -i salmon_index/ -p 8
# 2. Quantify paired-end reads (~2-5 min per sample)
salmon quant \
-i salmon_index/ \
-l A \
-1 sample_R1.fastq.gz \
-2 sample_R2.fastq.gz \
-p 8 \
--gcBias --validateMappings \
-o results/sample1/
# Output: results/sample1/quant.sf
head results/sample1/quant.sf
```
## Workflow
### Step 1: Download Transcriptome Reference
Fetch a transcript FASTA from GENCODE or Ensembl (cDNA sequences only — not genome).
```bash
# Human transcriptome from GENCODE (recommended)
wget https://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_47/gencode.v47.transcripts.fa.gz
gunzip gencode.v47.transcripts.fa.gz
# Count transcripts
grep -c "^>" gencode.v47.transcripts.fa
# ~252,000 transcripts
echo "Reference ready."
ls -lh gencode.v47.transcripts.fa
```
### Step 2: Build Salmon Index
Index the transcriptome for quasi-mapping. Add genome decoys for improved accuracy.
```bash
# Standard index (fast, sufficient for most analyses)
salmon index \
-t gencode.v47.transcripts.fa \
-i salmon_index/ \
-p 8
echo "Standard index complete."
# Decoy-aware index (recommended for accuracy — uses full genome as decoy)
# Step 1: create decoy list from genome chromosome names
grep "^>" GRCh38.primary_assembly.genome.fa | cut -d " " -f 1 | sed 's/>//' > decoys.txt
# Step 2: concatenate transcriptome + genome
cat gencode.v47.transcripts.fa GRCh38.primary_assembly.genome.fa > gentrome.fa
# Step 3: build decoy-aware index
salmon index \
-t gentrome.fa \
-d decoys.txt \
-i salmon_decoy_index/ \
-p 8
echo "Decoy-aware index complete."
```
### Step 3: Quantify Single-End Reads
Run Salmon on single-end FASTQ files.
```bash
# Single-end quantification
salmon quant \
-i salmon_index/ \
-l A \
-r sample1.fastq.gz \
-p 8 \
--seqBias \
--validateMappings \
-o results/sample1/
echo "Mapping rate: $(grep 'Mapping rate' results/sample1/logs/salmon_quant.log | tail -1)"
echo "Output: results/sample1/quant.sf"
```
### Step 4: Quantify Paired-End Reads with Bias Correction
Run Salmon on paired-end FASTQ files with recommended bias correction flags.
```bash
# Paired-end with GC bias + sequence bias correction
salmon quant \
-i salmon_decoy_index/ \
-l A \
-1 sample1_R1.fastq.gz \
-2 sample1_R2.fastq.gz \
-p 8 \
--gcBias \
--seqBias \
--validateMappings \
--numBootstraps 100 \
-o results/sample1/
# quant.sf columns: Name, Length, EffectiveLength, TPM, NumReads
head results/sample1/quant.sf
```
### Step 5: Load and Summarize Quantification Output
Parse `quant.sf` to build a gene-level count matrix for differential expression.
```python
import pandas as pd
from pathlib import Path
# Load single-sample output
quant = pd.read_csv("results/sample1/quant.sf", sep="\t")
print(f"Transcripts quantified: {len(quant)}")
print(f"Total estimated reads: {quant['NumReads'].sum():.0f}")
print(f"Transcripts with TPM > 1: {(quant['TPM'] > 1).sum()}")
print(quant.sort_values("TPM", ascending=False).head())
# Build a multi-sample TPM matrix
samples = ["ctrl_1", "ctrl_2", "treat_1", "treat_2"]
tpm_matrix = pd.DataFrame({
s: pd.read_csv(f"results/{s}/quant.sf", sep="\t").set_index("Name")["TPM"]
for s in samples
})
print(f"\nTPM matrix: {tpm_matrix.shape}")
tpm_matrix.to_csv("tpm_matrix.tsv", sep="\t")
```
### Step 6: Aggregate to Gene Level and Run DESeq2
Summarize transcript-level estimates t|
Opentrons Protocol API v2 for OT-2/Flex: Python protocols for pipetting, serial dilutions, PCR, plate replication; control thermocycler, heater-shaker, magnetic, temperature modules. Use pylabrobot for multi-vendor.
Interactive visualization with Plotly. 40+ chart types (scatter, line, heatmap, 3D, geographic) with hover, zoom, pan. Two APIs: Plotly Express (DataFrame) and Graph Objects (fine control). For static publication figures use matplotlib; for statistical grammar use seaborn.
Statistical visualization on matplotlib + pandas. Distributions (histplot, kdeplot, violin, box), relational (scatter, line), categorical, regression, correlation heatmaps. Auto aggregation/CIs. Use plotly for interactive; matplotlib for low-level.
Best practices for single-cell RNA-seq cell type annotation including marker-based, reference-based, and automated classification approaches.
Bayesian modeling with PyMC 5: priors, likelihood, NUTS/ADVI sampling, diagnostics (R-hat, ESS), LOO/WAIC comparison, prediction. Hierarchical, logistic, GP variants; predictive checks.
Time-to-event modeling with scikit-survival: Cox PH (elastic net), Random Survival Forests, Boosting, SVMs for censored data. C-index, Brier, time-dependent AUC; Kaplan-Meier, Nelson-Aalen, competing risks. Pipeline/GridSearchCV compatible. Use statsmodels for frequentist, pymc for Bayesian, lifelines for parametric.
>-