ml-strategy
This Claude Code skill implements a machine-learning trading strategy using scikit-learn models (RandomForest, GradientBoosting, or Ridge) with walk-forward validation to predict price direction from OHLCV data. Use it when you need to generate trading signals based on engineered features like momentum, volatility, and volume ratios while avoiding look-ahead bias through time-series cross-validation on any liquid asset's candlestick data.
git clone --depth 1 https://github.com/HKUDS/Vibe-Trading /tmp/ml-strategy && cp -r /tmp/ml-strategy/agent/src/skills/ml-strategy ~/.claude/skills/ml-strategySKILL.md
# Machine-Learning Predictive Strategy
## Purpose
Use sklearn machine-learning models (`RandomForest` / `GradientBoosting` / `Ridge`) to predict the direction of future returns and generate trading signals. Walk-forward training is used to avoid future data leakage, and feature engineering extracts useful factors from OHLCV data.
## Signal Logic
1. **Validate input**: check OHLCV columns, minimum row count, NaN ratio — skip symbols that fail
2. **Feature engineering**: build multi-dimensional factors from raw OHLCV data (momentum, volatility, RSI, moving-average ratios, volume ratio, and more). All features are sanitized (inf removed, division-by-zero guarded)
3. **Label construction**: future N-day return > 0 is the positive class (`1`), < 0 is the negative class (`0`)
4. **Walk-forward training**: use an expanding or sliding window, train on historical data only, and roll forward day by day for prediction
5. **Signal generation**: map `predict_proba[:, 1]` to `[-1.0, 1.0]`, or use discrete signals from `predict` in `{-1, 0, 1}`. Output is guaranteed clean (no NaN, clipped to range)
## Complete SignalEngine Example
This is the recommended full pipeline. Copy and customise — safety is built in.
```python
import numpy as np
import pandas as pd
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
def validate_data(df: pd.DataFrame, min_rows: int = 300) -> bool:
"""Check that OHLCV data meets minimum quality for ML training.
Args:
df: DataFrame with DatetimeIndex.
min_rows: Minimum number of rows required.
Returns:
True if data is usable.
"""
required = {"open", "high", "low", "close", "volume"}
if not required.issubset(df.columns):
return False
if len(df) < min_rows:
return False
if df["close"].isnull().mean() > 0.2:
return False
return True
def build_features(df: pd.DataFrame) -> pd.DataFrame:
"""Build a machine-learning feature matrix from OHLCV data.
All features are guarded against division-by-zero and sanitized
(inf replaced with NaN) so downstream code never sees inf values.
Args:
df: DataFrame containing open, high, low, close, and volume columns.
Returns:
DataFrame with feature columns prefixed by 'f_'.
"""
c = df["close"]
v = df["volume"]
ret = c.pct_change()
features = pd.DataFrame(index=df.index)
features["f_ret_5d"] = c.pct_change(5)
features["f_ret_20d"] = c.pct_change(20)
features["f_vol_20d"] = ret.rolling(20).std()
features["f_ma_ratio"] = c / c.rolling(20).mean()
features["f_volume_ratio"] = v / v.rolling(20).mean()
# RSI(14) — guard: loss=0 in zero-volatility periods produces inf
delta = c.diff()
gain = delta.clip(lower=0).rolling(14).mean()
loss = (-delta.clip(upper=0)).rolling(14).mean()
rs = gain / loss.replace(0, np.nan)
features["f_rsi_14"] = 100 - (100 / (1 + rs))
# Bollinger Band position — guard: bb_upper == bb_lower when std=0
ma20 = c.rolling(20).mean()
std20 = c.rolling(20).std()
bb_upper = ma20 + 2 * std20
bb_lower = ma20 - 2 * std20
bb_range = (bb_upper - bb_lower).replace(0, np.nan)
features["f_bb_position"] = (c - bb_lower) / bb_range
# Intraday features
features["f_high_low_ratio"] = (df["high"] - df["low"]) / c
features["f_close_open_ratio"] = (c - df["open"]) / df["open"]
features["f_skew_20d"] = ret.rolling(20).skew()
# Sanitize: replace all inf with NaN (NaN handled by walk-forward)
features = features.replace([np.inf, -np.inf], np.nan)
return features
def walk_forward_predict(
features: pd.DataFrame,
labels: pd.Series,
min_train_size: int = 252,
retrain_freq: int = 20,
model_type: str = "random_forest",
window_type: str = "expanding",
sliding_size: int = 504,
) -> pd.Series:
"""Walk-forward training and prediction to avoid future data leakage.
Args:
features: Feature matrix aligned with labels by row index.
labels: Binary labels (0/1), representing the direction of future N-day returns.
min_train_size: Minimum training-set size in trading days.
retrain_freq: Retrain the model every N days.
model_type: One of "random_forest" / "gradient_boosting" / "ridge".
window_type: "expanding" uses all history; "sliding" uses a fixed lookback.
sliding_size: Lookback window size when window_type is "sliding".
Returns:
Predicted signal series with range [-1.0, 1.0], no NaN values.
"""
predictions = pd.Series(0.0, index=features.index)
model = None
scaler = None
for i in range(min_train_size, len(features)):
# Retrain every retrain_freq days
if model is None or (i - min_train_size) % retrain_freq == 0:
start = max(0, i - sliding_size) if window_type == "sliding" else 0
X_train = features.iloc[start:i].values
y_train = labels.iloc[start:i].values
# Drop rows with NaN
valid = ~(np.isnan(X_train).any(axis=1) | np.isnan(y_train))
X_train = X_train[valid]
y_train = y_train[valid]
if len(X_train) < 50:
continue
# Standardization: fit only on training set
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
# Build the model
if model_type == "random_forest":
model = RandomForestClassifier(
n_estimators=100, max_depth=5, random_state=42,
)
elif model_type == "gradient_boosting":
model = GradientBoostingClassifier(
n_estimators=100, max_depth=3, learning_rate=0.05,
random_state=42,
)
elif model_type == "ridge":Professional finance research toolkit — backtesting (7 engines + benchmark comparison panel), factor analysis, Alpha Zoo (452 pre-built alphas across qlib158/alpha101/gtja191/academic), options pricing, 77 finance skills, 29 multi-agent swarm teams, Trade Journal analyzer, and Shadow Account (extract → backtest → render) across 7 data sources (tushare, yfinance, okx, akshare, mootdx, ccxt, futu).
ADR/H-share/A-share cross-listing premium analysis — track pricing gaps between US-listed ADRs, HK-listed H-shares, and A-shares for arbitrage signals, dual-listing valuation, and delisting risk assessment.
AKShare financial data aggregator (18k+ stars). Free, no API key. Covers A-shares, US, HK, futures, macro, forex. Primary fallback for tushare and yfinance.
Browse and bench the bundled alpha zoos — prebuilt cross-sectional factor libraries (Kakushadze 101, GTJA 191, Qlib 158, Fama-French / Carhart). Use when the user asks "which alphas exist", wants metadata on a named alpha, or wants to run IC/IR on a whole zoo over a universe.
A 股 ST/*ST 风险预测框架 — 基于最新中报/三季报或业绩预告/快报,预测下一财年是否会因营收、利润、净资产、分红不达标而被风险警示,并将新浪监管处罚记录作为独立证据面纳入风险等级。仅适用于 A 股,不预测财务造假。
Asset allocation theory and optimizer usage — MPT / Black-Litterman / risk budgeting / all-weather strategy, including guides for 4 optimizers and rebalancing rules.
Diagnose failed or underperforming backtests, locate the root cause, and fix the issue
Behavioral finance applications: theories of overreaction and underreaction, behavioral explanations for momentum and reversal, investor sentiment cycles, cognitive-bias checklists, and debiasing quantitative strategies.