arboreto-grn-inference
GRN inference from expression via GRNBoost2 (gradient boosting) or GENIE3 (Random Forest). Load matrix, filter by TFs, infer TF-target-importance links, save network. Dask-parallelized to single-cell scale. Core SCENIC component.
git clone --depth 1 https://github.com/jaechang-hits/SciAgent-Skills /tmp/arboreto-grn-inference && cp -r /tmp/arboreto-grn-inference/skills/genomics-bioinformatics/arboreto-grn-inference ~/.claude/skills/arboreto-grn-inferenceSKILL.md
# Arboreto GRN Inference
## Overview
Arboreto infers gene regulatory networks (GRNs) from gene expression data using parallelized tree-based regression. For each target gene, it trains a regression model with all other genes (or a specified TF list) as features and emits TF-target-importance triplets. It provides two interchangeable algorithms -- GRNBoost2 (gradient boosting, fast) and GENIE3 (Random Forest, classic) -- sharing identical input/output formats. Computation is Dask-parallelized, scaling from laptop cores to HPC clusters.
## When to Use
- Inferring transcription factor-to-target gene regulatory relationships from bulk RNA-seq expression data
- Building gene regulatory networks from single-cell RNA-seq count matrices (cells as rows, genes as columns)
- Generating the adjacency matrix (Step 1) of the pySCENIC regulatory analysis pipeline
- Comparing regulatory network structure across experimental conditions (e.g., control vs treatment)
- Producing consensus regulatory networks by running inference across multiple random seeds
- Validating GRN results by comparing GRNBoost2 and GENIE3 outputs on the same dataset
- For downstream regulon identification and activity scoring, use arboreto output with pySCENIC
- For single-cell preprocessing (QC, normalization, clustering) before GRN inference, use **scanpy-scrna-seq**
## Prerequisites
- **Python packages**: `arboreto`, `pandas`, `numpy`, `dask`, `distributed`, `scikit-learn`, `scipy`
- **Data requirements**: Gene expression matrix (genes as columns, observations as rows) in TSV/CSV; optionally a TF list file (one gene name per line)
- **Environment**: Python 3.8+; optional `networkx`, `matplotlib` for visualization
```bash
pip install arboreto distributed networkx matplotlib
```
## Quick Start
Complete GRN inference in a single block. The `if __name__ == '__main__':` guard is required because Dask spawns worker processes via multiprocessing.
```python
import pandas as pd
from arboreto.algo import grnboost2
from arboreto.utils import load_tf_names
if __name__ == '__main__':
# Load expression matrix (observations x genes)
expression_matrix = pd.read_csv('expression_data.tsv', sep='\t')
tf_names = load_tf_names('tf_list.txt') # optional TF filter
# Infer GRN (uses all local cores by default)
network = grnboost2(expression_data=expression_matrix,
tf_names=tf_names, seed=777)
# Filter top links and save
top_network = network[network['importance'] > 1.0]
top_network.to_csv('grn_output.tsv', sep='\t', index=False, header=False)
print(f"Inferred {len(network)} links, kept {len(top_network)} above threshold")
# Example: Inferred 185432 links, kept 12876 above threshold
```
## Workflow
### Step 1: Load Expression Data
Arboreto accepts a pandas DataFrame (recommended) or NumPy array. Rows are observations (cells or samples), columns are genes. Gene names must be column headers.
```python
import pandas as pd
# From TSV (genes as columns, observations as rows)
expression_matrix = pd.read_csv('expression_data.tsv', sep='\t')
print(f"Shape: {expression_matrix.shape} "
f"({expression_matrix.shape[0]} observations x {expression_matrix.shape[1]} genes)")
# Example: Shape: (5000, 18654) (5000 observations x 18654 genes)
# From AnnData (e.g., after scanpy preprocessing)
import anndata as ad
adata = ad.read_h5ad('preprocessed.h5ad')
expression_matrix = pd.DataFrame(
adata.X.toarray() if hasattr(adata.X, 'toarray') else adata.X,
columns=adata.var_names.tolist()
)
print(f"Converted AnnData: {expression_matrix.shape}")
# Example: Converted AnnData: (5000, 18654)
```
### Step 2: Load Transcription Factor List
Providing a TF list restricts regulators to known transcription factors, reducing computation time and improving biological relevance. If omitted, all genes are treated as potential regulators.
```python
from arboreto.utils import load_tf_names
# From file (one TF name per line)
tf_names = load_tf_names('human_tfs.txt')
print(f"Loaded {len(tf_names)} transcription factors")
# Example: Loaded 1639 transcription factors
# Or define directly
tf_names = ['MYC', 'TP53', 'SOX2', 'NANOG', 'POU5F1']
# Verify TFs exist in expression matrix columns
tf_in_data = [tf for tf in tf_names if tf in expression_matrix.columns]
print(f"TFs found in expression data: {len(tf_in_data)}/{len(tf_names)}")
# Example: TFs found in expression data: 1583/1639
```
### Step 3: Configure Dask Client (Optional)
By default arboreto creates an internal Dask client using all local cores. Create an explicit client for resource control, monitoring, or cluster deployment.
```python
from distributed import LocalCluster, Client
# Custom local client with resource limits
local_cluster = LocalCluster(
n_workers=8,
threads_per_worker=1, # avoid GIL contention in scikit-learn
memory_limit='4GB' # per worker
)
client = Client(local_cluster)
print(f"Dashboard: {client.dashboard_link}")
# Example: Dashboard: http://127.0.0.1:8787/status
```
### Step 4: Run GRN Inference
Call `grnboost2()` (recommended) or `genie3()`. Both share the same signature and output format.
```python
from arboreto.algo import grnboost2
if __name__ == '__main__':
network = grnboost2(
expression_data=expression_matrix,
tf_names=tf_names, # 'all' if no TF list
client_or_address=client, # omit to use default local scheduler
seed=777, # for reproducibility
verbose=True # print progress
)
print(f"Inferred {len(network)} regulatory links")
print(network.head())
# Example:
# TF target importance
# 0 MYC CDK4 3.214
# 1 MYC CCND1 2.871
# 2 TP53 CDKN1A 2.654
```
### Step 5: Filter Results
Raw output contains links for every TF-target pair with non-zero importance. Filter to retain high-confidence regulatory interactions.
```python
# Strategy 1: Impo|
Opentrons Protocol API v2 for OT-2/Flex: Python protocols for pipetting, serial dilutions, PCR, plate replication; control thermocycler, heater-shaker, magnetic, temperature modules. Use pylabrobot for multi-vendor.
Interactive visualization with Plotly. 40+ chart types (scatter, line, heatmap, 3D, geographic) with hover, zoom, pan. Two APIs: Plotly Express (DataFrame) and Graph Objects (fine control). For static publication figures use matplotlib; for statistical grammar use seaborn.
Statistical visualization on matplotlib + pandas. Distributions (histplot, kdeplot, violin, box), relational (scatter, line), categorical, regression, correlation heatmaps. Auto aggregation/CIs. Use plotly for interactive; matplotlib for low-level.
Best practices for single-cell RNA-seq cell type annotation including marker-based, reference-based, and automated classification approaches.
Bayesian modeling with PyMC 5: priors, likelihood, NUTS/ADVI sampling, diagnostics (R-hat, ESS), LOO/WAIC comparison, prediction. Hierarchical, logistic, GP variants; predictive checks.
Time-to-event modeling with scikit-survival: Cox PH (elastic net), Random Survival Forests, Boosting, SVMs for censored data. C-index, Brier, time-dependent AUC; Kaplan-Meier, Nelson-Aalen, competing risks. Pipeline/GridSearchCV compatible. Use statsmodels for frequentist, pymc for Bayesian, lifelines for parametric.
>-