Skip to main content
ClaudeWave
Skill199 repo starsupdated 16d ago

gtars

Rust-backed Python library for fast genomic token arithmetic and BED processing. High-performance BED I/O, interval set ops (intersect, merge, complement, subtract), region tokenization against a universe, universe construction. Use for preprocessing large BED collections and ML token vocabularies.

Install in Claude Code
Copy
git clone --depth 1 https://github.com/jaechang-hits/SciAgent-Skills /tmp/gtars && cp -r /tmp/gtars/skills/genomics-bioinformatics/interval-ops/gtars ~/.claude/skills/gtars
Then start a new Claude Code session; the skill loads automatically.

SKILL.md

# GTARS: Fast Genomic Token Arithmetic and BED File Processing

## Overview

GTARS is a Python library with a Rust-backed core for high-performance genomic interval operations. It provides BED file I/O, set-theoretic interval operations (intersection, union, merge, complement, subtract), genomic region tokenization against a reference universe, and utilities for building consensus universe BED files. GTARS is designed for workflows that process hundreds to thousands of BED files efficiently, serving as a preprocessing engine for ML pipelines (including geniml) and general bioinformatics pipelines.

## When to Use

- Read and write large BED files efficiently, leveraging Rust-backed parsing for speed over pure Python alternatives
- Compute genomic interval intersections, merges, complements, or subtracts between BED file pairs or sets
- Tokenize a collection of genomic regions against a fixed universe vocabulary for ML input preparation
- Build consensus universe BED files from a collection of sample BED files
- Count overlap statistics between two BED files without launching bedtools processes
- Preprocess ATAC-seq, ChIP-seq, or ENCODE peak files before feeding into geniml or other ML tools
- For full BED/BAM/SAM reading with CIGAR-level detail, use `pysam-genomic-files` instead

## Prerequisites

- **Python packages**: `gtars`, `numpy` (optional, for array conversion)
- **Data requirements**: BED format files (3+ columns: chr, start, end); optionally a universe BED file for tokenization
- **Environment**: Python 3.8+; Rust toolchain not required (pre-built wheels available on PyPI)

```bash
pip install gtars
```

## Quick Start

```python
from gtars import GenomicIntervalSet

# Load a BED file and inspect
gis = GenomicIntervalSet("peaks.bed")
print(f"Loaded {len(gis)} intervals")
print(f"First interval: {gis[0]}")
# First interval: chr1:1000-2000

# Intersect two BED files
gis2 = GenomicIntervalSet("other_peaks.bed")
overlap = gis.intersect(gis2)
print(f"Overlapping intervals: {len(overlap)}")
```

## Core API

### Module 1: GenomicIntervalSet — BED File I/O

Primary data structure for loading, indexing, and writing genomic intervals.

```python
from gtars import GenomicIntervalSet

# Load from file
gis = GenomicIntervalSet("peaks.bed")
print(f"Intervals loaded: {len(gis)}")
print(f"Chromosomes present: {gis.chromosomes}")

# Access by index
interval = gis[0]
print(f"chr={interval.chr}, start={interval.start}, end={interval.end}")

# Iterate all intervals
for iv in gis:
    width = iv.end - iv.start
    if width > 5000:
        print(f"Wide interval: {iv.chr}:{iv.start}-{iv.end} ({width} bp)")
        break
```

```python
from gtars import GenomicIntervalSet, GenomicInterval

# Create from a list of GenomicInterval objects
intervals = [
    GenomicInterval("chr1", 100, 500),
    GenomicInterval("chr1", 600, 1200),
    GenomicInterval("chr2", 300, 800),
]
gis = GenomicIntervalSet(intervals)
print(f"Created GenomicIntervalSet with {len(gis)} intervals")

# Write to BED file
gis.to_bed("output_intervals.bed")
print("Saved output_intervals.bed")
```

### Module 2: Interval Set Operations

Compute intersections, unions, merges, subtracts, and complements between BED sets.

```python
from gtars import GenomicIntervalSet

peaks_a = GenomicIntervalSet("condition_A.bed")
peaks_b = GenomicIntervalSet("condition_B.bed")

# Intersection: intervals present in both sets
shared = peaks_a.intersect(peaks_b)
print(f"Shared intervals: {len(shared)}")

# Subtraction: intervals in A not overlapping B
a_only = peaks_a.subtract(peaks_b)
print(f"A-specific intervals: {len(a_only)}")

# Union (merge both sets, then merge overlapping)
combined = peaks_a.union(peaks_b)
print(f"Union intervals: {len(combined)}")
```

```python
from gtars import GenomicIntervalSet

# Merge overlapping/adjacent intervals within a single set
gis = GenomicIntervalSet("fragmented_peaks.bed")
print(f"Before merge: {len(gis)} intervals")

merged = gis.merge()
print(f"After merge:  {len(merged)} intervals")

# Complement: genome-wide intervals NOT covered by peaks
# Requires chromosome sizes
chrom_sizes = {"chr1": 248956422, "chr2": 242193529, "chrX": 156040895}
complement = gis.complement(chrom_sizes)
print(f"Complement (uncovered) intervals: {len(complement)}")
```

### Module 3: Tokenization

Convert genomic intervals to integer token IDs against a reference universe vocabulary.

```python
from gtars import Tokenizer

# Initialize tokenizer with a universe BED file
tokenizer = Tokenizer("universe.bed")
print(f"Universe vocabulary size: {len(tokenizer)}")

# Tokenize a single BED file → list of token IDs
from gtars import GenomicIntervalSet
gis = GenomicIntervalSet("sample_peaks.bed")
tokens = tokenizer.tokenize(gis)
print(f"Token IDs: {tokens[:10]} ...")
print(f"Total tokens: {len(tokens)}")
```

```python
from gtars import Tokenizer, GenomicIntervalSet

tokenizer = Tokenizer("universe.bed")

# Tokenize and convert to numpy array for ML
import numpy as np
gis = GenomicIntervalSet("sample_peaks.bed")
tokens = tokenizer.tokenize(gis)
token_array = np.array(tokens)
print(f"Token array shape: {token_array.shape}, dtype: {token_array.dtype}")

# Build a binary presence/absence vector over the full universe
vocab_size = len(tokenizer)
presence_vector = np.zeros(vocab_size, dtype=np.float32)
presence_vector[token_array] = 1.0
print(f"Presence vector shape: {presence_vector.shape}")
print(f"Fraction of universe covered: {presence_vector.mean():.4f}")
```

### Module 4: Universe Building

Construct a consensus non-overlapping universe from a collection of BED files.

```python
from gtars import UniverseBuilder

# Build universe from multiple BED files
bed_files = ["sample_1.bed", "sample_2.bed", "sample_3.bed", "sample_4.bed"]
builder = UniverseBuilder()
universe = builder.build(bed_files)

print(f"Universe regions: {len(universe)}")
universe.to_bed("consensus_universe.bed")
print("Saved consensus_universe.bed")
```

```python
from gtars impo
sciagent-skill-creatorSkill

|

opentrons-integrationSkill

Opentrons Protocol API v2 for OT-2/Flex: Python protocols for pipetting, serial dilutions, PCR, plate replication; control thermocycler, heater-shaker, magnetic, temperature modules. Use pylabrobot for multi-vendor.

plotly-interactive-visualizationSkill

Interactive visualization with Plotly. 40+ chart types (scatter, line, heatmap, 3D, geographic) with hover, zoom, pan. Two APIs: Plotly Express (DataFrame) and Graph Objects (fine control). For static publication figures use matplotlib; for statistical grammar use seaborn.

seaborn-statistical-visualizationSkill

Statistical visualization on matplotlib + pandas. Distributions (histplot, kdeplot, violin, box), relational (scatter, line), categorical, regression, correlation heatmaps. Auto aggregation/CIs. Use plotly for interactive; matplotlib for low-level.

single-cell-annotationSkill

Best practices for single-cell RNA-seq cell type annotation including marker-based, reference-based, and automated classification approaches.

pymc-bayesian-modelingSkill

Bayesian modeling with PyMC 5: priors, likelihood, NUTS/ADVI sampling, diagnostics (R-hat, ESS), LOO/WAIC comparison, prediction. Hierarchical, logistic, GP variants; predictive checks.

scikit-survival-analysisSkill

Time-to-event modeling with scikit-survival: Cox PH (elastic net), Random Survival Forests, Boosting, SVMs for censored data. C-index, Brier, time-dependent AUC; Kaplan-Meier, Nelson-Aalen, competing risks. Pipeline/GridSearchCV compatible. Use statsmodels for frequentist, pymc for Bayesian, lifelines for parametric.

statistical-analysisSkill

>-