Skill286 estrellas del repoactualizado 5d ago

arboreto-grn-inference

This Claude Code skill performs gene regulatory network inference from expression matrices using GRNBoost2 (gradient boosting) or GENIE3 (Random Forest algorithms). Use it when inferring transcription factor-to-target relationships from bulk or single-cell RNA-seq data, generating the regulatory adjacency matrix for pySCENIC pipeline workflows, or comparing network structure across experimental conditions. The skill handles parallelized computation via Dask, scaling from standard computers to high-performance clusters.

Ver fuente Repositorio: SciAgent-Skills

Instalar en Claude Code

Copiar

git clone --depth 1 https://github.com/jaechang-hits/SciAgent-Skills /tmp/arboreto-grn-inference && cp -r /tmp/arboreto-grn-inference/skills/genomics-bioinformatics/arboreto-grn-inference ~/.claude/skills/arboreto-grn-inference

Después abre una sesión nueva de Claude Code; el skill carga automáticamente.

Definición

SKILL.md

# Arboreto GRN Inference

## Overview

Arboreto infers gene regulatory networks (GRNs) from gene expression data using parallelized tree-based regression. For each target gene, it trains a regression model with all other genes (or a specified TF list) as features and emits TF-target-importance triplets. It provides two interchangeable algorithms -- GRNBoost2 (gradient boosting, fast) and GENIE3 (Random Forest, classic) -- sharing identical input/output formats. Computation is Dask-parallelized, scaling from laptop cores to HPC clusters.

## When to Use

- Inferring transcription factor-to-target gene regulatory relationships from bulk RNA-seq expression data
- Building gene regulatory networks from single-cell RNA-seq count matrices (cells as rows, genes as columns)
- Generating the adjacency matrix (Step 1) of the pySCENIC regulatory analysis pipeline
- Comparing regulatory network structure across experimental conditions (e.g., control vs treatment)
- Producing consensus regulatory networks by running inference across multiple random seeds
- Validating GRN results by comparing GRNBoost2 and GENIE3 outputs on the same dataset
- For downstream regulon identification and activity scoring, use arboreto output with pySCENIC
- For single-cell preprocessing (QC, normalization, clustering) before GRN inference, use **scanpy-scrna-seq**

## Prerequisites

- **Python packages**: `arboreto`, `pandas`, `numpy`, `dask`, `distributed`, `scikit-learn`, `scipy`
- **Data requirements**: Gene expression matrix (genes as columns, observations as rows) in TSV/CSV; optionally a TF list file (one gene name per line)
- **Environment**: Python 3.8+; optional `networkx`, `matplotlib` for visualization

```bash
pip install arboreto distributed networkx matplotlib
```

## Quick Start

Complete GRN inference in a single block. The `if __name__ == '__main__':` guard is required because Dask spawns worker processes via multiprocessing.

```python
import pandas as pd
from arboreto.algo import grnboost2
from arboreto.utils import load_tf_names

if __name__ == '__main__':
    # Load expression matrix (observations x genes)
    expression_matrix = pd.read_csv('expression_data.tsv', sep='\t')
    tf_names = load_tf_names('tf_list.txt')  # optional TF filter

    # Infer GRN (uses all local cores by default)
    network = grnboost2(expression_data=expression_matrix,
                        tf_names=tf_names, seed=777)

    # Filter top links and save
    top_network = network[network['importance'] > 1.0]
    top_network.to_csv('grn_output.tsv', sep='\t', index=False, header=False)
    print(f"Inferred {len(network)} links, kept {len(top_network)} above threshold")
    # Example: Inferred 185432 links, kept 12876 above threshold
```

## Workflow

### Step 1: Load Expression Data

Arboreto accepts a pandas DataFrame (recommended) or NumPy array. Rows are observations (cells or samples), columns are genes. Gene names must be column headers.

```python
import pandas as pd

# From TSV (genes as columns, observations as rows)
expression_matrix = pd.read_csv('expression_data.tsv', sep='\t')
print(f"Shape: {expression_matrix.shape}  "
      f"({expression_matrix.shape[0]} observations x {expression_matrix.shape[1]} genes)")
# Example: Shape: (5000, 18654)  (5000 observations x 18654 genes)

# From AnnData (e.g., after scanpy preprocessing)
import anndata as ad
adata = ad.read_h5ad('preprocessed.h5ad')
expression_matrix = pd.DataFrame(
    adata.X.toarray() if hasattr(adata.X, 'toarray') else adata.X,
    columns=adata.var_names.tolist()
)
print(f"Converted AnnData: {expression_matrix.shape}")
# Example: Converted AnnData: (5000, 18654)
```

### Step 2: Load Transcription Factor List

Providing a TF list restricts regulators to known transcription factors, reducing computation time and improving biological relevance. If omitted, all genes are treated as potential regulators.

```python
from arboreto.utils import load_tf_names

# From file (one TF name per line)
tf_names = load_tf_names('human_tfs.txt')
print(f"Loaded {len(tf_names)} transcription factors")
# Example: Loaded 1639 transcription factors

# Or define directly
tf_names = ['MYC', 'TP53', 'SOX2', 'NANOG', 'POU5F1']

# Verify TFs exist in expression matrix columns
tf_in_data = [tf for tf in tf_names if tf in expression_matrix.columns]
print(f"TFs found in expression data: {len(tf_in_data)}/{len(tf_names)}")
# Example: TFs found in expression data: 1583/1639
```

### Step 3: Configure Dask Client (Optional)

By default arboreto creates an internal Dask client using all local cores. Create an explicit client for resource control, monitoring, or cluster deployment.

```python
from distributed import LocalCluster, Client

# Custom local client with resource limits
local_cluster = LocalCluster(
    n_workers=8,
    threads_per_worker=1,   # avoid GIL contention in scikit-learn
    memory_limit='4GB'       # per worker
)
client = Client(local_cluster)
print(f"Dashboard: {client.dashboard_link}")
# Example: Dashboard: http://127.0.0.1:8787/status
```

### Step 4: Run GRN Inference

Call `grnboost2()` (recommended) or `genie3()`. Both share the same signature and output format.

```python
from arboreto.algo import grnboost2

if __name__ == '__main__':
    network = grnboost2(
        expression_data=expression_matrix,
        tf_names=tf_names,         # 'all' if no TF list
        client_or_address=client,  # omit to use default local scheduler
        seed=777,                  # for reproducibility
        verbose=True               # print progress
    )
    print(f"Inferred {len(network)} regulatory links")
    print(network.head())
    # Example:
    #       TF  target  importance
    # 0   MYC   CDK4       3.214
    # 1   MYC   CCND1      2.871
    # 2  TP53  CDKN1A      2.654
```

### Step 5: Filter Results

Raw output contains links for every TF-target pair with non-zero importance. Filter to retain high-confidence regulatory interactions.

```python
# Strategy 1: Impo

Del mismo repositorio

sciagent-skill-creatorSkill

opentrons-integrationSkill

Opentrons Protocol API v2 for OT-2/Flex: Python protocols for pipetting, serial dilutions, PCR, plate replication; control thermocycler, heater-shaker, magnetic, temperature modules. Use pylabrobot for multi-vendor.

plotly-interactive-visualizationSkill

Interactive visualization with Plotly. 40+ chart types (scatter, line, heatmap, 3D, geographic) with hover, zoom, pan. Two APIs: Plotly Express (DataFrame) and Graph Objects (fine control). For static publication figures use matplotlib; for statistical grammar use seaborn.

seaborn-statistical-visualizationSkill

Statistical visualization on matplotlib + pandas. Distributions (histplot, kdeplot, violin, box), relational (scatter, line), categorical, regression, correlation heatmaps. Auto aggregation/CIs. Use plotly for interactive; matplotlib for low-level.

single-cell-annotationSkill

Best practices for single-cell RNA-seq cell type annotation including marker-based, reference-based, and automated classification approaches.

pymc-bayesian-modelingSkill

Bayesian modeling with PyMC 5: priors, likelihood, NUTS/ADVI sampling, diagnostics (R-hat, ESS), LOO/WAIC comparison, prediction. Hierarchical, logistic, GP variants; predictive checks.

scikit-survival-analysisSkill

Time-to-event modeling with scikit-survival: Cox PH (elastic net), Random Survival Forests, Boosting, SVMs for censored data. C-index, Brier, time-dependent AUC; Kaplan-Meier, Nelson-Aalen, competing risks. Pipeline/GridSearchCV compatible. Use statsmodels for frequentist, pymc for Bayesian, lifelines for parametric.

statistical-analysisSkill