Skill284 estrellas del repoactualizado 4d ago

scikit-survival-analysis

This Claude Code skill provides time-to-event modeling using scikit-survival, a library for analyzing censored survival data with Cox proportional hazards models, Random Survival Forests, boosting, and survival SVMs. Use it when predicting outcomes where events may not yet have occurred, evaluating survival predictions with censoring-aware metrics like C-index and Brier score, or estimating non-parametric survival curves for clinical trials, reliability studies, and competing risks analyses.

Ver fuente Repositorio: SciAgent-Skills

Instalar en Claude Code

Copiar

git clone --depth 1 https://github.com/jaechang-hits/SciAgent-Skills /tmp/scikit-survival-analysis && cp -r /tmp/scikit-survival-analysis/skills/biostatistics/scikit-survival-analysis ~/.claude/skills/scikit-survival-analysis

Después abre una sesión nueva de Claude Code; el skill carga automáticamente.

Definición

SKILL.md

# scikit-survival -- Survival Analysis

## Overview

scikit-survival is a Python library for time-to-event analysis built on scikit-learn. It handles right-censored data (observations where the event has not yet occurred) using Cox models, ensemble methods, survival SVMs, and non-parametric estimators. All models follow the scikit-learn `fit/predict` API and integrate with Pipelines, cross-validation, and GridSearchCV.

## When to Use

- Modeling time-to-event outcomes with right-censored data (clinical trials, reliability)
- Fitting Cox proportional hazards models (standard or elastic net penalized)
- Building ensemble survival models (Random Survival Forest, Gradient Boosting)
- Training survival SVMs for margin-based learning on medium-sized datasets
- Evaluating survival predictions with censoring-aware metrics (C-index, Brier score, AUC)
- Estimating non-parametric survival curves (Kaplan-Meier, Nelson-Aalen)
- Analyzing competing risks with cumulative incidence functions
- High-dimensional survival data with automatic feature selection (CoxNet L1/L2)
- For **simpler parametric models** (Weibull, log-normal AFT) or statistical tests (log-rank), use `lifelines`
- For **deep learning survival models**, use `pycox` or `torchlife`

## Prerequisites

```bash
pip install scikit-survival scikit-learn pandas numpy matplotlib
```

**Python**: >= 3.9. **Dependencies**: scikit-learn, numpy, scipy, pandas, joblib, osqp (for some SVM solvers).

**Data format**: Survival outcomes are NumPy structured arrays with `(event, time)` fields. Events are boolean (True = event occurred, False = censored). Times are positive floats.

## Quick Start

```python
from sksurv.datasets import load_breast_cancer
from sksurv.ensemble import RandomSurvivalForest
from sksurv.metrics import concordance_index_ipcw
from sklearn.model_selection import train_test_split

X, y = load_breast_cancer()
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

rsf = RandomSurvivalForest(n_estimators=100, random_state=42)
rsf.fit(X_train, y_train)

risk_scores = rsf.predict(X_test)
c_index = concordance_index_ipcw(y_train, y_test, risk_scores)[0]
print(f"C-index: {c_index:.3f}")  # e.g., 0.68

# Individual survival curves
surv_fns = rsf.predict_survival_function(X_test[:2])
for fn in surv_fns:
    print(f"5-year survival: {fn(365 * 5):.3f}")
```

## Core API

### Module 1: Data Preparation

Create structured survival arrays and preprocess features.

```python
import numpy as np
import pandas as pd
from sksurv.util import Surv
from sksurv.preprocessing import OneHotEncoder, encode_categorical
from sksurv.datasets import load_gbsg2, load_breast_cancer
from sklearn.preprocessing import StandardScaler

# Create survival outcome from arrays
event = np.array([True, False, True, True, False])
time = np.array([120.0, 365.0, 200.0, 90.0, 400.0])
y = Surv.from_arrays(event=event, time=time)
print(y.dtype)  # [('event', '?'), ('time', '<f8')]

# From DataFrame columns
# y = Surv.from_dataframe("event_col", "time_col", df)

# Load built-in datasets
# Available: load_gbsg2, load_breast_cancer, load_veterans_lung_cancer,
#            load_whas500, load_aids, load_flchain
X, y = load_gbsg2()
print(f"Shape: {X.shape}, Events: {y['event'].sum()}, "
      f"Censoring rate: {1 - y['event'].mean():.1%}")

# Encode categoricals (survival-aware one-hot)
X_encoded = encode_categorical(X)  # auto-detect and encode all categorical cols

# Standardize (critical for Cox and SVM models)
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X_encoded)
```

```python
from sksurv.io import loadarff

# Load ARFF format (Weka format)
data = loadarff("survival_data.arff")
X_arff, y_arff = data[0], data[1]  # DataFrame, structured array
```

### Module 2: Cox Proportional Hazards

Semi-parametric model: h(t|x) = h_0(t) * exp(beta^T x). Interpretable coefficients as log hazard ratios.

```python
from sksurv.linear_model import CoxPHSurvivalAnalysis, CoxnetSurvivalAnalysis, IPCRidge

# Standard Cox PH model
cox = CoxPHSurvivalAnalysis(alpha=0.0, ties="breslow")
cox.fit(X_train, y_train)
print(f"Coefficients: {cox.coef_}")  # log hazard ratios
# Hazard ratio interpretation: exp(coef) = HR for 1-unit increase
risk_scores = cox.predict(X_test)  # Higher = higher risk

# Survival function for individual patients
surv_funcs = cox.predict_survival_function(X_test[:3])
for fn in surv_funcs:
    print(f"5-year survival: {fn(365 * 5):.3f}")
```

```python
# Penalized Cox (elastic net) -- for high-dimensional data (p > n)
coxnet = CoxnetSurvivalAnalysis(
    l1_ratio=0.9,           # 0=Ridge, 1=Lasso, between=Elastic Net
    alpha_min_ratio=0.01,   # smallest alpha / largest alpha ratio
    n_alphas=100,           # steps in regularization path
)
coxnet.fit(X_train, y_train)

# Feature selection: non-zero coefficients
selected = np.where(coxnet.coef_ != 0)[0]
print(f"Selected {len(selected)} / {X_train.shape[1]} features")

# IPCRidge: accelerated failure time model (predicts log survival time)
ipcridge = IPCRidge(alpha=1.0)
ipcridge.fit(X_train, y_train)
log_survival_time = ipcridge.predict(X_test)
```

### Module 3: Ensemble Methods

Non-parametric tree-based models for complex non-linear relationships.

```python
from sksurv.ensemble import (
    RandomSurvivalForest,
    GradientBoostingSurvivalAnalysis,
    ComponentwiseGradientBoostingSurvivalAnalysis,
    ExtraSurvivalTrees,
)

# Random Survival Forest -- robust, minimal tuning
rsf = RandomSurvivalForest(
    n_estimators=200, min_samples_split=10, min_samples_leaf=15,
    max_features="sqrt", random_state=42, n_jobs=-1,
)
rsf.fit(X_train, y_train)
risk = rsf.predict(X_test)

# Gradient Boosting -- best performance, needs tuning
gbs = GradientBoostingSurvivalAnalysis(
    loss="coxph",           # "coxph" or "ipcwls" (AFT)
    n_estimators=300, learning_rate=0.05, max_depth=3,
    subsample=0.8, dropout_rate=0.1, random_state=42,
)
gbs.fit(X_train, y_train)

# ComponentwiseGB --

Del mismo repositorio

sciagent-skill-creatorSkill

opentrons-integrationSkill

Opentrons Protocol API v2 for OT-2/Flex: Python protocols for pipetting, serial dilutions, PCR, plate replication; control thermocycler, heater-shaker, magnetic, temperature modules. Use pylabrobot for multi-vendor.

plotly-interactive-visualizationSkill

Interactive visualization with Plotly. 40+ chart types (scatter, line, heatmap, 3D, geographic) with hover, zoom, pan. Two APIs: Plotly Express (DataFrame) and Graph Objects (fine control). For static publication figures use matplotlib; for statistical grammar use seaborn.

seaborn-statistical-visualizationSkill

Statistical visualization on matplotlib + pandas. Distributions (histplot, kdeplot, violin, box), relational (scatter, line), categorical, regression, correlation heatmaps. Auto aggregation/CIs. Use plotly for interactive; matplotlib for low-level.

single-cell-annotationSkill

Best practices for single-cell RNA-seq cell type annotation including marker-based, reference-based, and automated classification approaches.

pymc-bayesian-modelingSkill

Bayesian modeling with PyMC 5: priors, likelihood, NUTS/ADVI sampling, diagnostics (R-hat, ESS), LOO/WAIC comparison, prediction. Hierarchical, logistic, GP variants; predictive checks.

statistical-analysisSkill

statsmodels-statistical-modelingSkill

Python statistical modeling: regression (OLS, WLS, GLM), discrete (Logit, Poisson, NegBin), time series (ARIMA, SARIMAX, VAR), with rigorous inference, diagnostics, and hypothesis tests. Use scikit-learn for ML; statistical-analysis for test choice.