Skip to main content
ClaudeWave
Skill199 estrellas del repoactualizado 16d ago

star-rna-seq-aligner

Splice-aware RNA-seq aligner producing sorted BAM and splice junction tables. Builds genome index, runs two-pass alignment for better junctions. Outputs sorted BAM, junctions (SJ.out.tab), stats (Log.final.out), optional gene counts. Use Salmon for fast pseudoalignment; STAR when a BAM is needed for variant calling, IGV, or ENCODE pipelines.

Instalar en Claude Code
Copiar
git clone --depth 1 https://github.com/jaechang-hits/SciAgent-Skills /tmp/star-rna-seq-aligner && cp -r /tmp/star-rna-seq-aligner/skills/genomics-bioinformatics/alignment/star-rna-seq-aligner ~/.claude/skills/star-rna-seq-aligner
Después abre una sesión nueva de Claude Code; el skill carga automáticamente.

SKILL.md

# STAR — Spliced RNA-seq Aligner

## Overview

STAR (Spliced Transcripts Alignment to a Reference) aligns RNA-seq reads to a genome in a splice-aware manner, identifying novel and annotated splice junctions in a single pass. It generates coordinate-sorted BAM files compatible with samtools, IGV, deeptools, and GATK. STAR's 2-pass mode re-aligns reads using junctions discovered in the first pass, improving sensitivity for novel splice sites. With `--quantMode GeneCounts`, STAR simultaneously produces gene-level read count tables without requiring a separate featureCounts or HTSeq step.

## When to Use

- Aligning bulk RNA-seq reads to a reference genome when downstream tools require a BAM file (variant calling, visualization, deeptools)
- Running ENCODE-compliant RNA-seq pipelines that mandate genome alignment
- Discovering novel splice junctions and alternative splicing events in the dataset
- Generating gene count tables alongside BAM alignment in a single step with `--quantMode GeneCounts`
- Processing long reads or reads with high mismatch rates by tuning `--outFilterMismatchNmax`
- Use **Salmon** instead when you only need transcript/gene quantification and do not need a BAM file — Salmon is 20-50× faster

## Prerequisites

- **Software**: STAR ≥ 2.7.0 (conda or compiled binary)
- **Reference files**: genome FASTA + GTF annotation (same assembly)
- **RAM**: 30–32 GB for human/mouse genome index; 8–16 GB for smaller genomes
- **Disk**: ~25 GB for human genome index, ~5–10 GB per sample BAM

> **Check before installing**: The tool may already be available in the current environment (e.g., inside a `pixi` / `conda` env). Run `command -v STAR` first and skip the install commands below if it returns a path. When running inside a pixi project, invoke the tool via `pixi run STAR` rather than bare `STAR`.

```bash
# Install with conda (recommended)
conda install -c bioconda star

# Verify
STAR --version
# STAR_2.7.11a

# Or compile from source
git clone https://github.com/alexdobin/STAR
cd STAR/source && make STAR
```

## Quick Start

```bash
# 1. Generate genome index (~30 min, run once)
STAR --runMode genomeGenerate \
     --runThreadN 8 \
     --genomeDir genome/star_index \
     --genomeFastaFiles genome/GRCh38.fa \
     --sjdbGTFfile genome/gencode.v47.gtf \
     --sjdbOverhang 100    # ReadLength - 1

# 2. Align paired-end reads (~10-20 min)
STAR --runThreadN 8 \
     --genomeDir genome/star_index \
     --readFilesIn sample_R1.fastq.gz sample_R2.fastq.gz \
     --readFilesCommand zcat \
     --outSAMtype BAM SortedByCoordinate \
     --outFileNamePrefix results/sample/

# 3. Index the BAM
samtools index results/sample/Aligned.sortedByCoord.out.bam
```

## Workflow

### Step 1: Prepare Reference Files

Download a genome FASTA and matching GTF annotation (same assembly version).

```bash
# Download GRCh38 genome and GENCODE annotation
wget https://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_47/GRCh38.primary_assembly.genome.fa.gz
wget https://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_47/gencode.v47.primary_assembly.annotation.gtf.gz

gunzip GRCh38.primary_assembly.genome.fa.gz gencode.v47.primary_assembly.annotation.gtf.gz
mkdir -p genome/star_index

echo "Genome and GTF ready."
ls -lh GRCh38.primary_assembly.genome.fa gencode.v47.primary_assembly.annotation.gtf
```

### Step 2: Generate Genome Index

Build the STAR genome index — required once per genome/read-length combination.

```bash
# Standard human genome index (requires ~32 GB RAM)
STAR --runMode genomeGenerate \
     --runThreadN 16 \
     --genomeDir genome/star_index/ \
     --genomeFastaFiles GRCh38.primary_assembly.genome.fa \
     --sjdbGTFfile gencode.v47.primary_assembly.annotation.gtf \
     --sjdbOverhang 100

# For small genomes (e.g., E. coli ~4.6 Mb), reduce genomeSAindexNbases
# STAR --runMode genomeGenerate \
#      --genomeSAindexNbases 11 \
#      --genomeDir genome/ecoli_index/ ...

echo "Index complete: $(ls genome/star_index/ | wc -l) files"
```

### Step 3: Align RNA-seq Reads

Align single-end or paired-end FASTQ files to the indexed genome.

```bash
# Single-end alignment
STAR --runThreadN 8 \
     --genomeDir genome/star_index/ \
     --readFilesIn sample1.fastq.gz \
     --readFilesCommand zcat \
     --outSAMtype BAM SortedByCoordinate \
     --outSAMattributes NH HI AS NM MD \
     --outFileNamePrefix results/sample1/

# Paired-end alignment
STAR --runThreadN 8 \
     --genomeDir genome/star_index/ \
     --readFilesIn sample1_R1.fastq.gz sample1_R2.fastq.gz \
     --readFilesCommand zcat \
     --outSAMtype BAM SortedByCoordinate \
     --outSAMattributes NH HI AS NM MD \
     --outFileNamePrefix results/sample1/

echo "BAM: results/sample1/Aligned.sortedByCoord.out.bam"
```

### Step 4: Run 2-Pass Alignment for Improved Sensitivity

Two-pass mode collects splice junctions from the first pass and uses them as annotation for the second pass.

```bash
# First pass — collect splice junctions
STAR --runThreadN 8 \
     --genomeDir genome/star_index/ \
     --readFilesIn sample1_R1.fastq.gz sample1_R2.fastq.gz \
     --readFilesCommand zcat \
     --outSAMtype None \
     --outFileNamePrefix pass1/sample1/

# Second pass — realign with all junctions from pass 1
SJ_FILES=$(ls pass1/*/SJ.out.tab | tr '\n' ' ')

STAR --runThreadN 8 \
     --genomeDir genome/star_index/ \
     --readFilesIn sample1_R1.fastq.gz sample1_R2.fastq.gz \
     --readFilesCommand zcat \
     --sjdbFileChrStartEnd $SJ_FILES \
     --outSAMtype BAM SortedByCoordinate \
     --outFileNamePrefix results/sample1/

# Alternative: single-command 2-pass
STAR --runThreadN 8 \
     --genomeDir genome/star_index/ \
     --readFilesIn sample1_R1.fastq.gz sample1_R2.fastq.gz \
     --readFilesCommand zcat \
     --twopassMode Basic \
     --outSAMtype BAM SortedByCoordinate \
     --outFileNamePrefix results/sample1/
```

### Step 5: Check Alignment Statistics

Parse the alignment log to assess mapping rate and read
sciagent-skill-creatorSkill

|

opentrons-integrationSkill

Opentrons Protocol API v2 for OT-2/Flex: Python protocols for pipetting, serial dilutions, PCR, plate replication; control thermocycler, heater-shaker, magnetic, temperature modules. Use pylabrobot for multi-vendor.

plotly-interactive-visualizationSkill

Interactive visualization with Plotly. 40+ chart types (scatter, line, heatmap, 3D, geographic) with hover, zoom, pan. Two APIs: Plotly Express (DataFrame) and Graph Objects (fine control). For static publication figures use matplotlib; for statistical grammar use seaborn.

seaborn-statistical-visualizationSkill

Statistical visualization on matplotlib + pandas. Distributions (histplot, kdeplot, violin, box), relational (scatter, line), categorical, regression, correlation heatmaps. Auto aggregation/CIs. Use plotly for interactive; matplotlib for low-level.

single-cell-annotationSkill

Best practices for single-cell RNA-seq cell type annotation including marker-based, reference-based, and automated classification approaches.

pymc-bayesian-modelingSkill

Bayesian modeling with PyMC 5: priors, likelihood, NUTS/ADVI sampling, diagnostics (R-hat, ESS), LOO/WAIC comparison, prediction. Hierarchical, logistic, GP variants; predictive checks.

scikit-survival-analysisSkill

Time-to-event modeling with scikit-survival: Cox PH (elastic net), Random Survival Forests, Boosting, SVMs for censored data. C-index, Brier, time-dependent AUC; Kaplan-Meier, Nelson-Aalen, competing risks. Pipeline/GridSearchCV compatible. Use statsmodels for frequentist, pymc for Bayesian, lifelines for parametric.

statistical-analysisSkill

>-