Skip to main content
ClaudeWave
Skill199 repo starsupdated 16d ago

geniml

Python library for genomic interval ML. Train/apply region2vec embeddings turning BED regions into vectors, index interval datasets for ML, search embedding space with BEDSpace, and evaluate embedding quality. Use for chromatin accessibility clustering, regulatory element classification, and cross-sample region comparison.

Install in Claude Code
Copy
git clone --depth 1 https://github.com/jaechang-hits/SciAgent-Skills /tmp/geniml && cp -r /tmp/geniml/skills/genomics-bioinformatics/interval-ops/geniml ~/.claude/skills/geniml
Then start a new Claude Code session; the skill loads automatically.

SKILL.md

# Geniml: Genomic Interval Machine Learning

## Overview

Geniml is a Python library that bridges genomic interval biology and machine learning. It provides region2vec for learning dense vector representations of genomic regions from BED files, BEDSpace for nearest-neighbor search in embedding space, dataset classes for ML-ready genomic interval loading, and evaluation utilities for embedding quality. Geniml is designed for researchers who want to apply modern ML techniques to chromatin accessibility, histone modification, or other region-based genomic data.

## When to Use

- Learn dense embeddings of genomic regions from a collection of BED files to enable ML-based analysis (region2vec)
- Cluster chromatin accessibility peaks or histone modification sites by embedding similarity
- Search for genomic regions similar to a query region using approximate nearest-neighbor search (BEDSpace)
- Build training datasets for ML models from BED-format genomic intervals with a PyTorch-compatible interface
- Compare embedding quality across training runs or datasets using quantitative metrics
- Integrate genomic region representations into custom neural network architectures
- For basic BED file parsing and set operations without ML, use `gtars` or `pysam-genomic-files` instead

## Prerequisites

- **Python packages**: `geniml`, `torch`, `numpy`, `pandas`, `anndata`
- **Data requirements**: BED files (minimum 3 columns: chr, start, end); optionally a pre-built universe file
- **Environment**: Python 3.8+; GPU optional but recommended for region2vec training on large datasets

```bash
pip install geniml torch numpy pandas anndata
```

## Quick Start

```python
from geniml.region2vec import Region2VecExModel
from geniml.io import RegionSet

# Load a collection of BED files and train region2vec embeddings
region_sets = [RegionSet("sample1.bed"), RegionSet("sample2.bed"), RegionSet("sample3.bed")]
model = Region2VecExModel("path/to/universe.bed")
model.train(region_sets, epochs=10, batch_size=32)

# Get embedding for a specific region
embedding = model.encode("chr1", 1000000, 1500000)
print(f"Embedding shape: {embedding.shape}")
# Embedding shape: (100,)
```

## Core API

### Module 1: RegionSet — Genomic Interval I/O

Load BED files into geniml's primary data structure for downstream operations.

```python
from geniml.io import RegionSet

# Load a BED file
rs = RegionSet("peaks.bed")
print(f"Loaded {len(rs)} regions")
print(f"First region: {rs[0]}")          # Region object: chr, start, end
print(f"Chromosomes: {set(r.chr for r in rs)}")

# Convert to list of Region objects
regions = list(rs)
for r in regions[:3]:
    print(f"  {r.chr}:{r.start}-{r.end}")
```

```python
from geniml.io import RegionSet

# Create RegionSet from a list of (chr, start, end) tuples
regions_data = [
    ("chr1", 100000, 101000),
    ("chr1", 200000, 201500),
    ("chr2",  50000,  51200),
]
rs = RegionSet(regions_data)
print(f"RegionSet with {len(rs)} regions from list")

# Access by index
r = rs[0]
print(f"chr={r.chr}, start={r.start}, end={r.end}, width={r.end - r.start}")
```

### Module 2: Universe Building

A universe defines the set of consensus regions used as the vocabulary for region2vec. Build it from a collection of BED files.

```python
from geniml.universe import UniverseBuilder
from geniml.io import RegionSet

# Collect BED files representing diverse samples
bed_files = ["sample1.bed", "sample2.bed", "sample3.bed", "sample4.bed"]
region_sets = [RegionSet(f) for f in bed_files]

# Build universe (consensus non-overlapping regions)
builder = UniverseBuilder()
universe = builder.build(region_sets)

# Save universe to BED file
universe.to_bed("universe.bed")
print(f"Universe size: {len(universe)} consensus regions")
```

```python
from geniml.universe import UniverseBuilder
from geniml.io import RegionSet

# Build universe with custom parameters
builder = UniverseBuilder(
    fraction=0.5,    # Region must appear in >= 50% of samples to be included
    merge_dist=0,    # Merge adjacent regions within this distance (bp)
)
bed_files = [f"sample_{i}.bed" for i in range(1, 11)]
region_sets = [RegionSet(f) for f in bed_files]
universe = builder.build(region_sets)
print(f"Filtered universe: {len(universe)} regions (fraction >= 0.5)")
```

### Module 3: Region2Vec — Training Embeddings

Train word2vec-style embeddings on genomic regions, treating each BED file as a "document" and each region as a "word."

```python
from geniml.region2vec import Region2VecExModel
from geniml.io import RegionSet

# Initialize model with a pre-built universe
model = Region2VecExModel(universe="universe.bed", embedding_dim=100)

# Load training data (collection of BED files = corpus)
bed_files = [f"atac_{i}.bed" for i in range(1, 51)]
region_sets = [RegionSet(f) for f in bed_files]

# Train
model.train(
    region_sets,
    epochs=20,
    batch_size=64,
    window_size=5,
    min_count=1,
)
print("Training complete")

# Save trained model
model.save("region2vec_model/")
print("Model saved to region2vec_model/")
```

```python
from geniml.region2vec import Region2VecExModel

# Load a pre-trained model
model = Region2VecExModel.load("region2vec_model/")

# Encode a single genomic region
embedding = model.encode("chr1", 1_000_000, 1_500_000)
print(f"Single region embedding shape: {embedding.shape}")
# Single region embedding shape: (100,)

# Encode an entire BED file → matrix of region embeddings
from geniml.io import RegionSet
rs = RegionSet("query_peaks.bed")
embeddings = model.encode_region_set(rs)
print(f"BED file embeddings shape: {embeddings.shape}")
# BED file embeddings shape: (N_regions, 100)
```

### Module 4: BEDSpace — Embedding Nearest-Neighbor Search

Index a corpus of BED file embeddings for fast similarity search.

```python
from geniml.bedspace import BEDSpace
from geniml.region2vec import Region2VecExModel
from geniml.io import RegionSet

# Build a BEDSpace index from a set of BED files
model = Region2VecExModel.load("region2ve
sciagent-skill-creatorSkill

|

opentrons-integrationSkill

Opentrons Protocol API v2 for OT-2/Flex: Python protocols for pipetting, serial dilutions, PCR, plate replication; control thermocycler, heater-shaker, magnetic, temperature modules. Use pylabrobot for multi-vendor.

plotly-interactive-visualizationSkill

Interactive visualization with Plotly. 40+ chart types (scatter, line, heatmap, 3D, geographic) with hover, zoom, pan. Two APIs: Plotly Express (DataFrame) and Graph Objects (fine control). For static publication figures use matplotlib; for statistical grammar use seaborn.

seaborn-statistical-visualizationSkill

Statistical visualization on matplotlib + pandas. Distributions (histplot, kdeplot, violin, box), relational (scatter, line), categorical, regression, correlation heatmaps. Auto aggregation/CIs. Use plotly for interactive; matplotlib for low-level.

single-cell-annotationSkill

Best practices for single-cell RNA-seq cell type annotation including marker-based, reference-based, and automated classification approaches.

pymc-bayesian-modelingSkill

Bayesian modeling with PyMC 5: priors, likelihood, NUTS/ADVI sampling, diagnostics (R-hat, ESS), LOO/WAIC comparison, prediction. Hierarchical, logistic, GP variants; predictive checks.

scikit-survival-analysisSkill

Time-to-event modeling with scikit-survival: Cox PH (elastic net), Random Survival Forests, Boosting, SVMs for censored data. C-index, Brier, time-dependent AUC; Kaplan-Meier, Nelson-Aalen, competing risks. Pipeline/GridSearchCV compatible. Use statsmodels for frequentist, pymc for Bayesian, lifelines for parametric.

statistical-analysisSkill

>-