torchforge-rl-training
torchforge is Meta's PyTorch-native reinforcement learning library that decouples algorithm design from distributed training infrastructure using the Monarch actor system. Use it when implementing experimental RL algorithms like GRPO or DAPO that require clean abstractions, easy experimentation across single to multi-GPU setups, and integration with TorchTitan for model parallelism, though production deployments should prefer stable alternatives like miles or verl.
git clone --depth 1 https://github.com/Orchestra-Research/AI-Research-SKILLs /tmp/torchforge-rl-training && cp -r /tmp/torchforge-rl-training/06-post-training/torchforge ~/.claude/skills/torchforge-rl-trainingSKILL.md
# torchforge: PyTorch-Native Agentic RL Library
torchforge is Meta's PyTorch-native RL library that separates infrastructure concerns from algorithm concerns. It enables rapid RL research by letting you focus on algorithms while handling distributed training, inference, and weight sync automatically.
## When to Use torchforge
**Choose torchforge when you need:**
- Clean separation between RL algorithms and infrastructure
- PyTorch-native abstractions (no Ray dependency)
- Easy algorithm experimentation (GRPO, DAPO, SAPO in ~100 lines)
- Scalable training with Monarch actor system
- Integration with TorchTitan for model parallelism
**Consider alternatives when:**
- You need production-ready stability → use **miles** or **verl**
- You want Megatron-native training → use **slime**
- torchforge is experimental and APIs may change
## Key Features
- **Algorithm isolation**: Implement RL algorithms without touching infrastructure
- **Scalability**: From single GPU to thousands via Monarch
- **Modern stack**: TorchTitan (training), vLLM (inference), TorchStore (sync)
- **Loss functions**: GRPO, DAPO, CISPO, GSPO, SAPO built-in
## Architecture Overview
```
┌─────────────────────────────────────────────────────────┐
│ Application Layer (Your Code) │
│ - Define reward models, loss functions, sampling │
└─────────────────────┬───────────────────────────────────┘
│
┌─────────────────────▼───────────────────────────────────┐
│ Forge API Layer │
│ - Episode, Group dataclasses │
│ - Service interfaces (async/await) │
└─────────────────────┬───────────────────────────────────┘
│
┌─────────────────────▼───────────────────────────────────┐
│ Distributed Services (Monarch) │
│ ├── Trainer (TorchTitan FSDP) │
│ ├── Generator (vLLM inference) │
│ ├── Reference Model (frozen KL baseline) │
│ └── Reward Actors (compute rewards) │
└─────────────────────────────────────────────────────────┘
```
## Installation
```bash
# Create environment
conda create -n forge python=3.12
conda activate forge
# Install (handles PyTorch nightly + dependencies)
./scripts/install.sh
# Verify
python -c "import torch, forge, vllm; print('OK')"
```
### ROCm Installation
```bash
./scripts/install_rocm.sh
```
## Quick Start
### SFT Training (2+ GPUs)
```bash
python -m apps.sft.main --config apps/sft/llama3_8b.yaml
```
### GRPO Training (3+ GPUs)
```bash
python -m apps.grpo.main --config apps/grpo/qwen3_1_7b.yaml
```
---
## Workflow 1: GRPO Training for Math Reasoning
Use this workflow for training reasoning models with group-relative advantages.
### Prerequisites Checklist
- [ ] 3+ GPUs (GPU0: trainer, GPU1: ref_model, GPU2: generator)
- [ ] Model from HuggingFace Hub
- [ ] Training dataset (GSM8K, MATH, etc.)
### Step 1: Create Configuration
```yaml
# config/grpo_math.yaml
model: "Qwen/Qwen2.5-7B-Instruct"
dataset:
path: "openai/gsm8k"
split: "train"
streaming: true
training:
batch_size: 4
learning_rate: 1e-6
seq_len: 4096
dtype: bfloat16
gradient_accumulation_steps: 4
grpo:
n_samples: 8 # Responses per prompt
clip_low: 0.2
clip_high: 0.28
beta: 0.1 # KL penalty coefficient
temperature: 0.7
services:
generator:
procs: 1
num_replicas: 1
with_gpus: true
trainer:
procs: 1
num_replicas: 1
with_gpus: true
ref_model:
procs: 1
num_replicas: 1
with_gpus: true
```
### Step 2: Define Reward Function
```python
# rewards.py
# Reward functions are in forge.data.rewards
from forge.data.rewards import MathReward, ThinkingReward
import re
# Or define your own reward function
class CustomMathReward:
def __call__(self, prompt: str, response: str, target: str) -> float:
# Extract answer from response
match = re.search(r'\\boxed{([^}]+)}', response)
if not match:
return 0.0
answer = match.group(1).strip()
return 1.0 if answer == target else 0.0
```
### Step 3: Launch Training
```bash
python -m apps.grpo.main --config config/grpo_math.yaml
```
### Step 4: Monitor Progress
- [ ] Check W&B dashboard for loss curves
- [ ] Verify entropy is decreasing (policy becoming more deterministic)
- [ ] Monitor KL divergence (should stay bounded)
---
## Workflow 2: Custom Loss Function
Use this workflow to implement new RL algorithms.
### Step 1: Create Loss Class
```python
# src/forge/losses/custom_loss.py
import torch
import torch.nn as nn
class CustomLoss(nn.Module):
def __init__(self, clip_range: float = 0.2, beta: float = 0.1):
super().__init__()
self.clip_range = clip_range
self.beta = beta
def forward(
self,
logprobs: torch.Tensor,
ref_logprobs: torch.Tensor,
advantages: torch.Tensor,
padding_mask: torch.Tensor,
) -> torch.Tensor:
# Compute importance ratio
ratio = torch.exp(logprobs - ref_logprobs)
# Clipped policy gradient
clipped_ratio = torch.clamp(
ratio,
1 - self.clip_range,
1 + self.clip_range
)
pg_loss = -torch.min(ratio * advantages, clipped_ratio * advantages)
# KL penalty
kl = ref_logprobs - logprobs
# Apply mask and aggregate
masked_loss = (pg_loss + self.beta * kl) * padding_mask
loss = masked_loss.sum() / padding_mask.sum()
return loss
```
### Step 2: Integrate into Application
```python
# apps/custom/main.py
from forge.losses.custom_loss import CustomLoss
loss_fn = CustomLoss(clip_range=0.2, beta=0.1)
# In training loop
loss = loss_fn(
logprobs=logprobs,
ref_logprobs=ref_logprobs,
advantages=advantages,
padding_mask=padding_mask,
)
```
---
## Workflow 3: Multi-GPU DOrchestrates end-to-end autonomous AI research projects using a two-loop architecture. The inner loop runs rapid experiment iterations with clear optimization targets. The outer loop synthesizes results, identifies patterns, and steers research direction. Routes to domain-specific skills for execution, supports continuous agent operation via Claude Code /loop and OpenClaw heartbeat, and produces research presentations and papers. Use when starting a research project, running autonomous experiments, or managing a multi-hypothesis research effort.
Implements and trains LLMs using Lightning AI's LitGPT with 20+ pretrained architectures (Llama, Gemma, Phi, Qwen, Mistral). Use when need clean model implementations, educational understanding of architectures, or production fine-tuning with LoRA/QLoRA. Single-file implementations, no abstraction layers.
State-space model with O(n) complexity vs Transformers' O(n²). 5× faster inference, million-token sequences, no KV cache. Selective SSM with hardware-aware design. Mamba-1 (d_state=16) and Mamba-2 (d_state=128, multi-head). Models 130M-2.8B on HuggingFace.
Educational GPT implementation in ~300 lines. Reproduces GPT-2 (124M) on OpenWebText. Clean, hackable code for learning transformers. By Andrej Karpathy. Perfect for understanding GPT architecture from scratch. Train on Shakespeare (CPU) or OpenWebText (multi-GPU).
RNN+Transformer hybrid with O(n) inference. Linear time, infinite context, no KV cache. Train like GPT (parallel), infer like RNN (sequential). Linux Foundation AI project. Production at Windows, Office, NeMo. RWKV-7 (March 2025). Models up to 14B parameters.
Provides PyTorch-native distributed LLM pretraining using torchtitan with 4D parallelism (FSDP2, TP, PP, CP). Use when pretraining Llama 3.1, DeepSeek V3, or custom models at scale from 8 to 512+ GPUs with Float8, torch.compile, and distributed checkpointing.
Fast tokenizers optimized for research and production. Rust-based implementation tokenizes 1GB in <20 seconds. Supports BPE, WordPiece, and Unigram algorithms. Train custom vocabularies, track alignments, handle padding/truncation. Integrates seamlessly with transformers. Use when you need high-performance tokenization or custom tokenizer training.
Language-independent tokenizer treating text as raw Unicode. Supports BPE and Unigram algorithms. Fast (50k sentences/sec), lightweight (6MB memory), deterministic vocabulary. Used by T5, ALBERT, XLNet, mBART. Train on raw text without pre-tokenization. Use when you need multilingual support, CJK languages, or reproducible tokenization.