Skill284 repo starsupdated 4d ago

arboreto-grn-inference

This Claude Code skill performs gene regulatory network inference from expression matrices using GRNBoost2 (gradient boosting) or GENIE3 (Random Forest algorithms). Use it when inferring transcription factor-to-target relationships from bulk or single-cell RNA-seq data, generating the regulatory adjacency matrix for pySCENIC pipeline workflows, or comparing network structure across experimental conditions. The skill handles parallelized computation via Dask, scaling from standard computers to high-performance clusters.

View source Repository: SciAgent-Skills

Install in Claude Code

Copy

git clone --depth 1 https://github.com/jaechang-hits/SciAgent-Skills /tmp/arboreto-grn-inference && cp -r /tmp/arboreto-grn-inference/skills/genomics-bioinformatics/arboreto-grn-inference ~/.claude/skills/arboreto-grn-inference

Then start a new Claude Code session; the skill loads automatically.

Definition

SKILL.md

# Arboreto GRN Inference

## Overview

Arboreto infers gene regulatory networks (GRNs) from gene expression data using parallelized tree-based regression. For each target gene, it trains a regression model with all other genes (or a specified TF list) as features and emits TF-target-importance triplets. It provides two interchangeable algorithms -- GRNBoost2 (gradient boosting, fast) and GENIE3 (Random Forest, classic) -- sharing identical input/output formats. Computation is Dask-parallelized, scaling from laptop cores to HPC clusters.

## When to Use

- Inferring transcription factor-to-target gene regulatory relationships from bulk RNA-seq expression data
- Building gene regulatory networks from single-cell RNA-seq count matrices (cells as rows, genes as columns)
- Generating the adjacency matrix (Step 1) of the pySCENIC regulatory analysis pipeline
- Comparing regulatory network structure across experimental conditions (e.g., control vs treatment)
- Producing consensus regulatory networks by running inference across multiple random seeds
- Validating GRN results by comparing GRNBoost2 and GENIE3 outputs on the same dataset
- For downstream regulon identification and activity scoring, use arboreto output with pySCENIC
- For single-cell preprocessing (QC, normalization, clustering) before GRN inference, use **scanpy-scrna-seq**

## Prerequisites

- **Python packages**: `arboreto`, `pandas`, `numpy`, `dask`, `distributed`, `scikit-learn`, `scipy`
- **Data requirements**: Gene expression matrix (genes as columns, observations as rows) in TSV/CSV; optionally a TF list file (one gene name per line)
- **Environment**: Python 3.8+; optional `networkx`, `matplotlib` for visualization

```bash
pip install arboreto distributed networkx matplotlib
```

## Quick Start

Complete GRN inference in a single block. The `if __name__ == '__main__':` guard is required because Dask spawns worker processes via multiprocessing.

```python
import pandas as pd
from arboreto.algo import grnboost2
from arboreto.utils import load_tf_names

if __name__ == '__main__':
    # Load expression matrix (observations x genes)
    expression_matrix = pd.read_csv('expression_data.tsv', sep='\t')
    tf_names = load_tf_names('tf_list.txt')  # optional TF filter

    # Infer GRN (uses all local cores by default)
    network = grnboost2(expression_data=expression_matrix,
                        tf_names=tf_names, seed=777)

    # Filter top links and save
    top_network = network[network['importance'] > 1.0]
    top_network.to_csv('grn_output.tsv', sep='\t', index=False, header=False)
    print(f"Inferred {len(network)} links, kept {len(top_network)} above threshold")
    # Example: Inferred 185432 links, kept 12876 above threshold
```

## Workflow

### Step 1: Load Expression Data

Arboreto accepts a pandas DataFrame (recommended) or NumPy array. Rows are observations (cells or samples), columns are genes. Gene names must be column headers.

```python
import pandas as pd

# From TSV (genes as columns, observations as rows)
expression_matrix = pd.read_csv('expression_data.tsv', sep='\t')
print(f"Shape: {expression_matrix.shape}  "
      f"({expression_matrix.shape[0]} observations x {expression_matrix.shape[1]} genes)")
# Example: Shape: (5000, 18654)  (5000 observations x 18654 genes)

# From AnnData (e.g., after scanpy preprocessing)
import anndata as ad
adata = ad.read_h5ad('preprocessed.h5ad')
expression_matrix = pd.DataFrame(
    adata.X.toarray() if hasattr(adata.X, 'toarray') else adata.X,
    columns=adata.var_names.tolist()
)
print(f"Converted AnnData: {expression_matrix.shape}")
# Example: Converted AnnData: (5000, 18654)
```

### Step 2: Load Transcription Factor List

Providing a TF list restricts regulators to known transcription factors, reducing computation time and improving biological relevance. If omitted, all genes are treated as potential regulators.

```python
from arboreto.utils import load_tf_names

# From file (one TF name per line)
tf_names = load_tf_names('human_tfs.txt')
print(f"Loaded {len(tf_names)} transcription factors")
# Example: Loaded 1639 transcription factors

# Or define directly
tf_names = ['MYC', 'TP53', 'SOX2', 'NANOG', 'POU5F1']

# Verify TFs exist in expression matrix columns
tf_in_data = [tf for tf in tf_names if tf in expression_matrix.columns]
print(f"TFs found in expression data: {len(tf_in_data)}/{len(tf_names)}")
# Example: TFs found in expression data: 1583/1639
```

### Step 3: Configure Dask Client (Optional)

By default arboreto creates an internal Dask client using all local cores. Create an explicit client for resource control, monitoring, or cluster deployment.

```python
from distributed import LocalCluster, Client

# Custom local client with resource limits
local_cluster = LocalCluster(
    n_workers=8,
    threads_per_worker=1,   # avoid GIL contention in scikit-learn
    memory_limit='4GB'       # per worker
)
client = Client(local_cluster)
print(f"Dashboard: {client.dashboard_link}")
# Example: Dashboard: http://127.0.0.1:8787/status
```

### Step 4: Run GRN Inference

Call `grnboost2()` (recommended) or `genie3()`. Both share the same signature and output format.

```python
from arboreto.algo import grnboost2

if __name__ == '__main__':
    network = grnboost2(
        expression_data=expression_matrix,
        tf_names=tf_names,         # 'all' if no TF list
        client_or_address=client,  # omit to use default local scheduler
        seed=777,                  # for reproducibility
        verbose=True               # print progress
    )
    print(f"Inferred {len(network)} regulatory links")
    print(network.head())
    # Example:
    #       TF  target  importance
    # 0   MYC   CDK4       3.214
    # 1   MYC   CCND1      2.871
    # 2  TP53  CDKN1A      2.654
```

### Step 5: Filter Results

Raw output contains links for every TF-target pair with non-zero importance. Filter to retain high-confidence regulatory interactions.

```python
# Strategy 1: Impo