Skill284 repo starsupdated 4d ago

gtars

GTARS is a Rust-backed Python library for high-performance genomic interval operations on BED files. Use it to efficiently read and write large BED collections, compute set-theoretic operations like intersections and merges, tokenize genomic regions against a reference universe for machine learning pipelines, and preprocess ATAC-seq or ChIP-seq peak files before feeding them into downstream analysis tools.

View source Repository: SciAgent-Skills

Install in Claude Code

Copy

git clone --depth 1 https://github.com/jaechang-hits/SciAgent-Skills /tmp/gtars && cp -r /tmp/gtars/skills/genomics-bioinformatics/interval-ops/gtars ~/.claude/skills/gtars

Then start a new Claude Code session; the skill loads automatically.

Definition

SKILL.md

# GTARS: Fast Genomic Token Arithmetic and BED File Processing

## Overview

GTARS is a Python library with a Rust-backed core for high-performance genomic interval operations. It provides BED file I/O, set-theoretic interval operations (intersection, union, merge, complement, subtract), genomic region tokenization against a reference universe, and utilities for building consensus universe BED files. GTARS is designed for workflows that process hundreds to thousands of BED files efficiently, serving as a preprocessing engine for ML pipelines (including geniml) and general bioinformatics pipelines.

## When to Use

- Read and write large BED files efficiently, leveraging Rust-backed parsing for speed over pure Python alternatives
- Compute genomic interval intersections, merges, complements, or subtracts between BED file pairs or sets
- Tokenize a collection of genomic regions against a fixed universe vocabulary for ML input preparation
- Build consensus universe BED files from a collection of sample BED files
- Count overlap statistics between two BED files without launching bedtools processes
- Preprocess ATAC-seq, ChIP-seq, or ENCODE peak files before feeding into geniml or other ML tools
- For full BED/BAM/SAM reading with CIGAR-level detail, use `pysam-genomic-files` instead

## Prerequisites

- **Python packages**: `gtars`, `numpy` (optional, for array conversion)
- **Data requirements**: BED format files (3+ columns: chr, start, end); optionally a universe BED file for tokenization
- **Environment**: Python 3.8+; Rust toolchain not required (pre-built wheels available on PyPI)

```bash
pip install gtars
```

## Quick Start

```python
from gtars import GenomicIntervalSet

# Load a BED file and inspect
gis = GenomicIntervalSet("peaks.bed")
print(f"Loaded {len(gis)} intervals")
print(f"First interval: {gis[0]}")
# First interval: chr1:1000-2000

# Intersect two BED files
gis2 = GenomicIntervalSet("other_peaks.bed")
overlap = gis.intersect(gis2)
print(f"Overlapping intervals: {len(overlap)}")
```

## Core API

### Module 1: GenomicIntervalSet — BED File I/O

Primary data structure for loading, indexing, and writing genomic intervals.

```python
from gtars import GenomicIntervalSet

# Load from file
gis = GenomicIntervalSet("peaks.bed")
print(f"Intervals loaded: {len(gis)}")
print(f"Chromosomes present: {gis.chromosomes}")

# Access by index
interval = gis[0]
print(f"chr={interval.chr}, start={interval.start}, end={interval.end}")

# Iterate all intervals
for iv in gis:
    width = iv.end - iv.start
    if width > 5000:
        print(f"Wide interval: {iv.chr}:{iv.start}-{iv.end} ({width} bp)")
        break
```

```python
from gtars import GenomicIntervalSet, GenomicInterval

# Create from a list of GenomicInterval objects
intervals = [
    GenomicInterval("chr1", 100, 500),
    GenomicInterval("chr1", 600, 1200),
    GenomicInterval("chr2", 300, 800),
]
gis = GenomicIntervalSet(intervals)
print(f"Created GenomicIntervalSet with {len(gis)} intervals")

# Write to BED file
gis.to_bed("output_intervals.bed")
print("Saved output_intervals.bed")
```

### Module 2: Interval Set Operations

Compute intersections, unions, merges, subtracts, and complements between BED sets.

```python
from gtars import GenomicIntervalSet

peaks_a = GenomicIntervalSet("condition_A.bed")
peaks_b = GenomicIntervalSet("condition_B.bed")

# Intersection: intervals present in both sets
shared = peaks_a.intersect(peaks_b)
print(f"Shared intervals: {len(shared)}")

# Subtraction: intervals in A not overlapping B
a_only = peaks_a.subtract(peaks_b)
print(f"A-specific intervals: {len(a_only)}")

# Union (merge both sets, then merge overlapping)
combined = peaks_a.union(peaks_b)
print(f"Union intervals: {len(combined)}")
```

```python
from gtars import GenomicIntervalSet

# Merge overlapping/adjacent intervals within a single set
gis = GenomicIntervalSet("fragmented_peaks.bed")
print(f"Before merge: {len(gis)} intervals")

merged = gis.merge()
print(f"After merge:  {len(merged)} intervals")

# Complement: genome-wide intervals NOT covered by peaks
# Requires chromosome sizes
chrom_sizes = {"chr1": 248956422, "chr2": 242193529, "chrX": 156040895}
complement = gis.complement(chrom_sizes)
print(f"Complement (uncovered) intervals: {len(complement)}")
```

### Module 3: Tokenization

Convert genomic intervals to integer token IDs against a reference universe vocabulary.

```python
from gtars import Tokenizer

# Initialize tokenizer with a universe BED file
tokenizer = Tokenizer("universe.bed")
print(f"Universe vocabulary size: {len(tokenizer)}")

# Tokenize a single BED file → list of token IDs
from gtars import GenomicIntervalSet
gis = GenomicIntervalSet("sample_peaks.bed")
tokens = tokenizer.tokenize(gis)
print(f"Token IDs: {tokens[:10]} ...")
print(f"Total tokens: {len(tokens)}")
```

```python
from gtars import Tokenizer, GenomicIntervalSet

tokenizer = Tokenizer("universe.bed")

# Tokenize and convert to numpy array for ML
import numpy as np
gis = GenomicIntervalSet("sample_peaks.bed")
tokens = tokenizer.tokenize(gis)
token_array = np.array(tokens)
print(f"Token array shape: {token_array.shape}, dtype: {token_array.dtype}")

# Build a binary presence/absence vector over the full universe
vocab_size = len(tokenizer)
presence_vector = np.zeros(vocab_size, dtype=np.float32)
presence_vector[token_array] = 1.0
print(f"Presence vector shape: {presence_vector.shape}")
print(f"Fraction of universe covered: {presence_vector.mean():.4f}")
```

### Module 4: Universe Building

Construct a consensus non-overlapping universe from a collection of BED files.

```python
from gtars import UniverseBuilder

# Build universe from multiple BED files
bed_files = ["sample_1.bed", "sample_2.bed", "sample_3.bed", "sample_4.bed"]
builder = UniverseBuilder()
universe = builder.build(bed_files)

print(f"Universe regions: {len(universe)}")
universe.to_bed("consensus_universe.bed")
print("Saved consensus_universe.bed")
```

```python
from gtars impo