Skip to main content
ClaudeWave
Skill199 estrellas del repoactualizado 16d ago

featurecounts-rna-counting

Counts RNA-seq reads overlapping GTF gene features. Takes sorted STAR BAMs plus GTF; outputs a per-gene tab-delimited matrix across samples. Handles strandedness (0/1/2), paired-end, multi-sample batch counting in one command, and outputs assignment statistics. Use Salmon for alignment-free quantification; use featureCounts when STAR BAMs already exist.

Instalar en Claude Code
Copiar
git clone --depth 1 https://github.com/jaechang-hits/SciAgent-Skills /tmp/featurecounts-rna-counting && cp -r /tmp/featurecounts-rna-counting/skills/genomics-bioinformatics/rnaseq/featurecounts-rna-counting ~/.claude/skills/featurecounts-rna-counting
Después abre una sesión nueva de Claude Code; el skill carga automáticamente.

SKILL.md

# featureCounts — RNA-seq Read Counting

## Overview

featureCounts (part of the Subread package) assigns sequencing reads in BAM files to genomic features defined in a GTF/GFF annotation. It counts how many reads overlap each gene (or exon, intron, or custom feature), producing a gene × sample count matrix suitable for differential expression analysis with DESeq2 or edgeR. featureCounts processes multiple BAM files in a single command, reporting read assignment statistics (assigned, unassigned by category) alongside the count matrix. It is the standard counting step after STAR alignment in RNA-seq pipelines.

## When to Use

- Generating gene-level count matrices from STAR-aligned BAM files for DESeq2 or edgeR
- Counting reads from multiple samples simultaneously in a single featureCounts command
- Handling stranded RNA-seq libraries where sense/antisense assignment matters
- Producing exon-level or custom-feature counts (e.g., for splicing analysis with DEXSeq)
- Verifying strandedness of an RNA-seq library when protocol documentation is unavailable
- Use **Salmon** instead when no BAM file exists and fast pseudoalignment is preferred
- Use **HTSeq-count** as an alternative with slower but more flexible counting modes

## Prerequisites

- **Software**: Subread package (contains `featureCounts`)
- **Input**: Sorted BAM files from STAR or HISAT2, plus a matching GTF annotation file

> **Check before installing**: The tool may already be available in the current environment (e.g., inside a `pixi` / `conda` env). Run `command -v featureCounts` first and skip the install commands below if it returns a path. When running inside a pixi project, invoke the tool via `pixi run featureCounts` rather than bare `featureCounts`.

```bash
# Install with conda (recommended)
conda install -c bioconda subread

# Verify
featureCounts -v
# featureCounts v2.0.6

# Alternative: install via apt (Ubuntu/Debian)
sudo apt-get install subread
```

## Quick Start

```bash
# Count reads for multiple samples (unstranded paired-end RNA-seq)
featureCounts \
    -a gencode.v47.annotation.gtf \
    -o counts/gene_counts.txt \
    -T 8 \
    -p --countReadPairs \
    results/sample1/Aligned.sortedByCoord.out.bam \
    results/sample2/Aligned.sortedByCoord.out.bam

echo "Count matrix: counts/gene_counts.txt"
head -3 counts/gene_counts.txt
```

## Workflow

### Step 1: Prepare BAM Files and GTF

Ensure BAM files are sorted and indexed, and the GTF matches the genome assembly.

```bash
# Verify BAM files are sorted
samtools view -H results/sample1/Aligned.sortedByCoord.out.bam | grep "SO:"
# Expected: SO:coordinate

# List BAMs to count
ls results/*/Aligned.sortedByCoord.out.bam | head -5

# Download GENCODE GTF (same version used for STAR indexing)
wget https://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_47/gencode.v47.primary_assembly.annotation.gtf.gz
gunzip gencode.v47.primary_assembly.annotation.gtf.gz

echo "GTF lines: $(wc -l < gencode.v47.primary_assembly.annotation.gtf)"
```

### Step 2: Determine Library Strandedness

Test strandedness using a small read count to set the `-s` parameter correctly.

```bash
# Quick strandedness check: count 1 sample with all 3 modes
# Compare assigned rates: highest = correct mode
for strand in 0 1 2; do
    echo "=== Strandedness -s $strand ==="
    featureCounts \
        -a gencode.v47.primary_assembly.annotation.gtf \
        -o /tmp/test_s${strand}.txt \
        -T 4 \
        -p --countReadPairs \
        -s $strand \
        results/sample1/Aligned.sortedByCoord.out.bam 2>&1 \
        | grep "Successfully assigned"
done
# Rule: 0=unstranded if similar rates; 1 or 2 if one is much higher
```

### Step 3: Count Unstranded Paired-End RNA-seq

Standard configuration for unstranded libraries (most polyA-selected RNA-seq).

```bash
mkdir -p counts

# Multi-sample batch counting: pass all BAMs as positional arguments
featureCounts \
    -a gencode.v47.primary_assembly.annotation.gtf \
    -o counts/gene_counts.txt \
    -T 8 \
    -p \
    --countReadPairs \
    -s 0 \
    -t exon \
    -g gene_id \
    results/ctrl_1/Aligned.sortedByCoord.out.bam \
    results/ctrl_2/Aligned.sortedByCoord.out.bam \
    results/treat_1/Aligned.sortedByCoord.out.bam \
    results/treat_2/Aligned.sortedByCoord.out.bam

echo "Count matrix: $(wc -l < counts/gene_counts.txt) genes"
# Also generates: counts/gene_counts.txt.summary (assignment stats)
cat counts/gene_counts.txt.summary
```

### Step 4: Count Stranded Libraries

For strand-specific libraries (TruSeq Stranded, QuantSeq), set the correct strandedness.

```bash
# Reverse-stranded library (most TruSeq Stranded protocols): -s 2
featureCounts \
    -a gencode.v47.primary_assembly.annotation.gtf \
    -o counts/gene_counts_stranded.txt \
    -T 8 \
    -p --countReadPairs \
    -s 2 \
    results/*/Aligned.sortedByCoord.out.bam

# Forward-stranded (e.g., Lexogen QuantSeq, Takara SMARTer): -s 1
# featureCounts ... -s 1 ...

echo "Stranded count complete."
head -2 counts/gene_counts_stranded.txt
```

### Step 5: Load Count Matrix into Python for DESeq2

Parse the featureCounts output file and prepare for differential expression.

```python
import pandas as pd

# featureCounts output has 6 metadata columns before count columns
counts_raw = pd.read_csv("counts/gene_counts.txt", sep="\t", comment="#")
print(f"Columns: {list(counts_raw.columns)}")

# Metadata columns: Geneid, Chr, Start, End, Strand, Length
# Count columns start at index 6
count_cols = counts_raw.columns[6:]  # BAM file paths as column names
counts = counts_raw.set_index("Geneid")[count_cols].copy()

# Rename columns to sample names (strip path and file extension)
import re
counts.columns = [re.sub(r".*/|Aligned\.sortedByCoord\.out\.bam", "", col)
                  for col in counts.columns]

print(f"Count matrix shape: {counts.shape}")  # (genes × samples)
print(f"Samples: {list(counts.columns)}")
print(f"Genes with counts > 0: {(counts.sum(axis=1) > 0).sum()}")
counts.t
sciagent-skill-creatorSkill

|

opentrons-integrationSkill

Opentrons Protocol API v2 for OT-2/Flex: Python protocols for pipetting, serial dilutions, PCR, plate replication; control thermocycler, heater-shaker, magnetic, temperature modules. Use pylabrobot for multi-vendor.

plotly-interactive-visualizationSkill

Interactive visualization with Plotly. 40+ chart types (scatter, line, heatmap, 3D, geographic) with hover, zoom, pan. Two APIs: Plotly Express (DataFrame) and Graph Objects (fine control). For static publication figures use matplotlib; for statistical grammar use seaborn.

seaborn-statistical-visualizationSkill

Statistical visualization on matplotlib + pandas. Distributions (histplot, kdeplot, violin, box), relational (scatter, line), categorical, regression, correlation heatmaps. Auto aggregation/CIs. Use plotly for interactive; matplotlib for low-level.

single-cell-annotationSkill

Best practices for single-cell RNA-seq cell type annotation including marker-based, reference-based, and automated classification approaches.

pymc-bayesian-modelingSkill

Bayesian modeling with PyMC 5: priors, likelihood, NUTS/ADVI sampling, diagnostics (R-hat, ESS), LOO/WAIC comparison, prediction. Hierarchical, logistic, GP variants; predictive checks.

scikit-survival-analysisSkill

Time-to-event modeling with scikit-survival: Cox PH (elastic net), Random Survival Forests, Boosting, SVMs for censored data. C-index, Brier, time-dependent AUC; Kaplan-Meier, Nelson-Aalen, competing risks. Pipeline/GridSearchCV compatible. Use statsmodels for frequentist, pymc for Bayesian, lifelines for parametric.

statistical-analysisSkill

>-