Skip to main content
ClaudeWave
Skill199 estrellas del repoactualizado 16d ago

salmon-rna-quantification

Ultra-fast RNA-seq transcript/gene quantification via quasi-mapping (no BAM). Builds a k-mer index from transcriptome FASTA, quantifies in minutes. Outputs TPM/count tables (quant.sf) with optional GC- and sequence-bias correction. Integrates with tximeta/tximport for DESeq2/edgeR. Use STAR when a genome-aligned BAM is needed.

Instalar en Claude Code
Copiar
git clone --depth 1 https://github.com/jaechang-hits/SciAgent-Skills /tmp/salmon-rna-quantification && cp -r /tmp/salmon-rna-quantification/skills/genomics-bioinformatics/rnaseq/salmon-rna-quantification ~/.claude/skills/salmon-rna-quantification
Después abre una sesión nueva de Claude Code; el skill carga automáticamente.

SKILL.md

# Salmon — Fast RNA-seq Quantification

## Overview

Salmon quantifies transcript abundance from RNA-seq reads using quasi-mapping — matching reads to a k-mer index of the transcriptome without full genome alignment. This makes Salmon 20–50× faster than alignment-based tools while producing accurate TPM and estimated count values. Salmon corrects for sequence-specific bias (`--seqBias`), GC-content bias (`--gcBias`), and fragment length distribution automatically. Output `quant.sf` files integrate directly with `tximeta` (R) or `pydeseq2` (Python) for differential expression analysis. For improved accuracy, decoy-aware indexing uses the full genome to identify spurious quasi-mappings.

## When to Use

- Performing fast RNA-seq quantification when you do not need a genome-aligned BAM file
- Running large-scale RNA-seq studies where alignment speed is a bottleneck (Salmon is 20-50× faster than STAR + featureCounts)
- Computing TPM and estimated counts from bulk RNA-seq for differential expression with DESeq2 or edgeR
- Correcting for GC bias, fragment length, and sequence context bias with `--gcBias --seqBias`
- Estimating transcript-level uncertainty via bootstrap resampling with `--numBootstraps`
- Use **STAR** instead when you need a genome-aligned BAM for downstream tools (variant calling, deeptools, IGV visualization)
- Use **Kallisto** as an alternative for similar speed; Salmon provides better bias correction and decoy-aware indexing

## Prerequisites

- **Software**: Salmon ≥ 1.10 (conda or pre-compiled binary)
- **Reference**: transcriptome FASTA (cDNA sequences, e.g., GENCODE or Ensembl) + genome FASTA for decoy-aware indexing
- **Python packages**: `pandas` for parsing output; `pydeseq2` for differential expression

> **Check before installing**: The tool may already be available in the current environment (e.g., inside a `pixi` / `conda` env). Run `command -v salmon` first and skip the install commands below if it returns a path. When running inside a pixi project, invoke the tool via `pixi run salmon` rather than bare `salmon`.

```bash
# Install with conda (recommended)
conda install -c bioconda salmon

# Verify
salmon --version
# salmon 1.10.3

# Or download pre-compiled binary
wget https://github.com/COMBINE-lab/salmon/releases/download/v1.10.0/salmon-1.10.0_linux_x86_64.tar.gz
tar xzvf salmon-1.10.0_linux_x86_64.tar.gz
export PATH="$PWD/salmon-latest_linux_x86_64/bin:$PATH"
```

## Quick Start

```bash
# 1. Build transcriptome index (~5 min)
salmon index -t transcriptome.fa -i salmon_index/ -p 8

# 2. Quantify paired-end reads (~2-5 min per sample)
salmon quant \
    -i salmon_index/ \
    -l A \
    -1 sample_R1.fastq.gz \
    -2 sample_R2.fastq.gz \
    -p 8 \
    --gcBias --validateMappings \
    -o results/sample1/

# Output: results/sample1/quant.sf
head results/sample1/quant.sf
```

## Workflow

### Step 1: Download Transcriptome Reference

Fetch a transcript FASTA from GENCODE or Ensembl (cDNA sequences only — not genome).

```bash
# Human transcriptome from GENCODE (recommended)
wget https://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_47/gencode.v47.transcripts.fa.gz
gunzip gencode.v47.transcripts.fa.gz

# Count transcripts
grep -c "^>" gencode.v47.transcripts.fa
# ~252,000 transcripts

echo "Reference ready."
ls -lh gencode.v47.transcripts.fa
```

### Step 2: Build Salmon Index

Index the transcriptome for quasi-mapping. Add genome decoys for improved accuracy.

```bash
# Standard index (fast, sufficient for most analyses)
salmon index \
    -t gencode.v47.transcripts.fa \
    -i salmon_index/ \
    -p 8
echo "Standard index complete."

# Decoy-aware index (recommended for accuracy — uses full genome as decoy)
# Step 1: create decoy list from genome chromosome names
grep "^>" GRCh38.primary_assembly.genome.fa | cut -d " " -f 1 | sed 's/>//' > decoys.txt

# Step 2: concatenate transcriptome + genome
cat gencode.v47.transcripts.fa GRCh38.primary_assembly.genome.fa > gentrome.fa

# Step 3: build decoy-aware index
salmon index \
    -t gentrome.fa \
    -d decoys.txt \
    -i salmon_decoy_index/ \
    -p 8
echo "Decoy-aware index complete."
```

### Step 3: Quantify Single-End Reads

Run Salmon on single-end FASTQ files.

```bash
# Single-end quantification
salmon quant \
    -i salmon_index/ \
    -l A \
    -r sample1.fastq.gz \
    -p 8 \
    --seqBias \
    --validateMappings \
    -o results/sample1/

echo "Mapping rate: $(grep 'Mapping rate' results/sample1/logs/salmon_quant.log | tail -1)"
echo "Output: results/sample1/quant.sf"
```

### Step 4: Quantify Paired-End Reads with Bias Correction

Run Salmon on paired-end FASTQ files with recommended bias correction flags.

```bash
# Paired-end with GC bias + sequence bias correction
salmon quant \
    -i salmon_decoy_index/ \
    -l A \
    -1 sample1_R1.fastq.gz \
    -2 sample1_R2.fastq.gz \
    -p 8 \
    --gcBias \
    --seqBias \
    --validateMappings \
    --numBootstraps 100 \
    -o results/sample1/

# quant.sf columns: Name, Length, EffectiveLength, TPM, NumReads
head results/sample1/quant.sf
```

### Step 5: Load and Summarize Quantification Output

Parse `quant.sf` to build a gene-level count matrix for differential expression.

```python
import pandas as pd
from pathlib import Path

# Load single-sample output
quant = pd.read_csv("results/sample1/quant.sf", sep="\t")
print(f"Transcripts quantified: {len(quant)}")
print(f"Total estimated reads: {quant['NumReads'].sum():.0f}")
print(f"Transcripts with TPM > 1: {(quant['TPM'] > 1).sum()}")
print(quant.sort_values("TPM", ascending=False).head())

# Build a multi-sample TPM matrix
samples = ["ctrl_1", "ctrl_2", "treat_1", "treat_2"]
tpm_matrix = pd.DataFrame({
    s: pd.read_csv(f"results/{s}/quant.sf", sep="\t").set_index("Name")["TPM"]
    for s in samples
})
print(f"\nTPM matrix: {tpm_matrix.shape}")
tpm_matrix.to_csv("tpm_matrix.tsv", sep="\t")
```

### Step 6: Aggregate to Gene Level and Run DESeq2

Summarize transcript-level estimates t
sciagent-skill-creatorSkill

|

opentrons-integrationSkill

Opentrons Protocol API v2 for OT-2/Flex: Python protocols for pipetting, serial dilutions, PCR, plate replication; control thermocycler, heater-shaker, magnetic, temperature modules. Use pylabrobot for multi-vendor.

plotly-interactive-visualizationSkill

Interactive visualization with Plotly. 40+ chart types (scatter, line, heatmap, 3D, geographic) with hover, zoom, pan. Two APIs: Plotly Express (DataFrame) and Graph Objects (fine control). For static publication figures use matplotlib; for statistical grammar use seaborn.

seaborn-statistical-visualizationSkill

Statistical visualization on matplotlib + pandas. Distributions (histplot, kdeplot, violin, box), relational (scatter, line), categorical, regression, correlation heatmaps. Auto aggregation/CIs. Use plotly for interactive; matplotlib for low-level.

single-cell-annotationSkill

Best practices for single-cell RNA-seq cell type annotation including marker-based, reference-based, and automated classification approaches.

pymc-bayesian-modelingSkill

Bayesian modeling with PyMC 5: priors, likelihood, NUTS/ADVI sampling, diagnostics (R-hat, ESS), LOO/WAIC comparison, prediction. Hierarchical, logistic, GP variants; predictive checks.

scikit-survival-analysisSkill

Time-to-event modeling with scikit-survival: Cox PH (elastic net), Random Survival Forests, Boosting, SVMs for censored data. C-index, Brier, time-dependent AUC; Kaplan-Meier, Nelson-Aalen, competing risks. Pipeline/GridSearchCV compatible. Use statsmodels for frequentist, pymc for Bayesian, lifelines for parametric.

statistical-analysisSkill

>-