featurecounts-rna-counting
Counts RNA-seq reads overlapping GTF gene features. Takes sorted STAR BAMs plus GTF; outputs a per-gene tab-delimited matrix across samples. Handles strandedness (0/1/2), paired-end, multi-sample batch counting in one command, and outputs assignment statistics. Use Salmon for alignment-free quantification; use featureCounts when STAR BAMs already exist.
git clone --depth 1 https://github.com/jaechang-hits/SciAgent-Skills /tmp/featurecounts-rna-counting && cp -r /tmp/featurecounts-rna-counting/skills/genomics-bioinformatics/rnaseq/featurecounts-rna-counting ~/.claude/skills/featurecounts-rna-countingSKILL.md
# featureCounts — RNA-seq Read Counting
## Overview
featureCounts (part of the Subread package) assigns sequencing reads in BAM files to genomic features defined in a GTF/GFF annotation. It counts how many reads overlap each gene (or exon, intron, or custom feature), producing a gene × sample count matrix suitable for differential expression analysis with DESeq2 or edgeR. featureCounts processes multiple BAM files in a single command, reporting read assignment statistics (assigned, unassigned by category) alongside the count matrix. It is the standard counting step after STAR alignment in RNA-seq pipelines.
## When to Use
- Generating gene-level count matrices from STAR-aligned BAM files for DESeq2 or edgeR
- Counting reads from multiple samples simultaneously in a single featureCounts command
- Handling stranded RNA-seq libraries where sense/antisense assignment matters
- Producing exon-level or custom-feature counts (e.g., for splicing analysis with DEXSeq)
- Verifying strandedness of an RNA-seq library when protocol documentation is unavailable
- Use **Salmon** instead when no BAM file exists and fast pseudoalignment is preferred
- Use **HTSeq-count** as an alternative with slower but more flexible counting modes
## Prerequisites
- **Software**: Subread package (contains `featureCounts`)
- **Input**: Sorted BAM files from STAR or HISAT2, plus a matching GTF annotation file
> **Check before installing**: The tool may already be available in the current environment (e.g., inside a `pixi` / `conda` env). Run `command -v featureCounts` first and skip the install commands below if it returns a path. When running inside a pixi project, invoke the tool via `pixi run featureCounts` rather than bare `featureCounts`.
```bash
# Install with conda (recommended)
conda install -c bioconda subread
# Verify
featureCounts -v
# featureCounts v2.0.6
# Alternative: install via apt (Ubuntu/Debian)
sudo apt-get install subread
```
## Quick Start
```bash
# Count reads for multiple samples (unstranded paired-end RNA-seq)
featureCounts \
-a gencode.v47.annotation.gtf \
-o counts/gene_counts.txt \
-T 8 \
-p --countReadPairs \
results/sample1/Aligned.sortedByCoord.out.bam \
results/sample2/Aligned.sortedByCoord.out.bam
echo "Count matrix: counts/gene_counts.txt"
head -3 counts/gene_counts.txt
```
## Workflow
### Step 1: Prepare BAM Files and GTF
Ensure BAM files are sorted and indexed, and the GTF matches the genome assembly.
```bash
# Verify BAM files are sorted
samtools view -H results/sample1/Aligned.sortedByCoord.out.bam | grep "SO:"
# Expected: SO:coordinate
# List BAMs to count
ls results/*/Aligned.sortedByCoord.out.bam | head -5
# Download GENCODE GTF (same version used for STAR indexing)
wget https://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_47/gencode.v47.primary_assembly.annotation.gtf.gz
gunzip gencode.v47.primary_assembly.annotation.gtf.gz
echo "GTF lines: $(wc -l < gencode.v47.primary_assembly.annotation.gtf)"
```
### Step 2: Determine Library Strandedness
Test strandedness using a small read count to set the `-s` parameter correctly.
```bash
# Quick strandedness check: count 1 sample with all 3 modes
# Compare assigned rates: highest = correct mode
for strand in 0 1 2; do
echo "=== Strandedness -s $strand ==="
featureCounts \
-a gencode.v47.primary_assembly.annotation.gtf \
-o /tmp/test_s${strand}.txt \
-T 4 \
-p --countReadPairs \
-s $strand \
results/sample1/Aligned.sortedByCoord.out.bam 2>&1 \
| grep "Successfully assigned"
done
# Rule: 0=unstranded if similar rates; 1 or 2 if one is much higher
```
### Step 3: Count Unstranded Paired-End RNA-seq
Standard configuration for unstranded libraries (most polyA-selected RNA-seq).
```bash
mkdir -p counts
# Multi-sample batch counting: pass all BAMs as positional arguments
featureCounts \
-a gencode.v47.primary_assembly.annotation.gtf \
-o counts/gene_counts.txt \
-T 8 \
-p \
--countReadPairs \
-s 0 \
-t exon \
-g gene_id \
results/ctrl_1/Aligned.sortedByCoord.out.bam \
results/ctrl_2/Aligned.sortedByCoord.out.bam \
results/treat_1/Aligned.sortedByCoord.out.bam \
results/treat_2/Aligned.sortedByCoord.out.bam
echo "Count matrix: $(wc -l < counts/gene_counts.txt) genes"
# Also generates: counts/gene_counts.txt.summary (assignment stats)
cat counts/gene_counts.txt.summary
```
### Step 4: Count Stranded Libraries
For strand-specific libraries (TruSeq Stranded, QuantSeq), set the correct strandedness.
```bash
# Reverse-stranded library (most TruSeq Stranded protocols): -s 2
featureCounts \
-a gencode.v47.primary_assembly.annotation.gtf \
-o counts/gene_counts_stranded.txt \
-T 8 \
-p --countReadPairs \
-s 2 \
results/*/Aligned.sortedByCoord.out.bam
# Forward-stranded (e.g., Lexogen QuantSeq, Takara SMARTer): -s 1
# featureCounts ... -s 1 ...
echo "Stranded count complete."
head -2 counts/gene_counts_stranded.txt
```
### Step 5: Load Count Matrix into Python for DESeq2
Parse the featureCounts output file and prepare for differential expression.
```python
import pandas as pd
# featureCounts output has 6 metadata columns before count columns
counts_raw = pd.read_csv("counts/gene_counts.txt", sep="\t", comment="#")
print(f"Columns: {list(counts_raw.columns)}")
# Metadata columns: Geneid, Chr, Start, End, Strand, Length
# Count columns start at index 6
count_cols = counts_raw.columns[6:] # BAM file paths as column names
counts = counts_raw.set_index("Geneid")[count_cols].copy()
# Rename columns to sample names (strip path and file extension)
import re
counts.columns = [re.sub(r".*/|Aligned\.sortedByCoord\.out\.bam", "", col)
for col in counts.columns]
print(f"Count matrix shape: {counts.shape}") # (genes × samples)
print(f"Samples: {list(counts.columns)}")
print(f"Genes with counts > 0: {(counts.sum(axis=1) > 0).sum()}")
counts.t|
Opentrons Protocol API v2 for OT-2/Flex: Python protocols for pipetting, serial dilutions, PCR, plate replication; control thermocycler, heater-shaker, magnetic, temperature modules. Use pylabrobot for multi-vendor.
Interactive visualization with Plotly. 40+ chart types (scatter, line, heatmap, 3D, geographic) with hover, zoom, pan. Two APIs: Plotly Express (DataFrame) and Graph Objects (fine control). For static publication figures use matplotlib; for statistical grammar use seaborn.
Statistical visualization on matplotlib + pandas. Distributions (histplot, kdeplot, violin, box), relational (scatter, line), categorical, regression, correlation heatmaps. Auto aggregation/CIs. Use plotly for interactive; matplotlib for low-level.
Best practices for single-cell RNA-seq cell type annotation including marker-based, reference-based, and automated classification approaches.
Bayesian modeling with PyMC 5: priors, likelihood, NUTS/ADVI sampling, diagnostics (R-hat, ESS), LOO/WAIC comparison, prediction. Hierarchical, logistic, GP variants; predictive checks.
Time-to-event modeling with scikit-survival: Cox PH (elastic net), Random Survival Forests, Boosting, SVMs for censored data. C-index, Brier, time-dependent AUC; Kaplan-Meier, Nelson-Aalen, competing risks. Pipeline/GridSearchCV compatible. Use statsmodels for frequentist, pymc for Bayesian, lifelines for parametric.
>-