Skip to main content
ClaudeWave
Skill9.6k estrellas del repoactualizado 1mo ago

fine-tuning-openvla-oft

This skill provides fine-tuning and evaluation workflows for OpenVLA-OFT and OpenVLA-OFT+, which adapt pretrained vision-language models for robot action generation using LoRA adaptation and continuous action prediction heads instead of discrete tokenization. Use it when reproducing the OpenVLA-OFT paper results, training custom robot policies on LIBERO simulation or ALOHA real-world setups, deploying server-client inference architectures, or troubleshooting normalization and cross-GPU training issues.

Instalar en Claude Code
Copiar
git clone --depth 1 https://github.com/Orchestra-Research/AI-Research-SKILLs /tmp/fine-tuning-openvla-oft && cp -r /tmp/fine-tuning-openvla-oft/18-multimodal/openvla-oft ~/.claude/skills/fine-tuning-openvla-oft
Después abre una sesión nueva de Claude Code; el skill carga automáticamente.

SKILL.md

# OpenVLA-OFT

Fine-tuning and evaluation workflows for OpenVLA-OFT and OpenVLA-OFT+ from the official `openvla-oft` codebase. Covers blank-machine setup plus LoRA-based adaptation of OpenVLA for robot action generation with continuous action prediction heads.

## Quick start

Clone the public repo, follow the official setup, then evaluate a pretrained LIBERO checkpoint:

```bash
git clone https://github.com/moojink/openvla-oft.git
cd openvla-oft
python experiments/robot/libero/run_libero_eval.py \
  --pretrained_checkpoint moojink/openvla-7b-oft-finetuned-libero-spatial \
  --task_suite_name libero_spatial \
  --center_crop True \
  --num_trials_per_task 50 \
  --seed 7
```

## Core concepts

**What OpenVLA-OFT changes**: Standard OpenVLA tokenizes continuous actions into discrete bins, losing precision. OFT replaces this with dedicated continuous action heads (L1 regression or diffusion) while keeping the VLA backbone frozen and adapting via LoRA.

**OFT vs OFT+ variants**:

| Variant | FiLM | Images | Typical use |
|---------|------|--------|-------------|
| OFT | Off | 2 (front + wrist) | LIBERO simulation |
| OFT+ | On | 3 (high + left + right wrist) | ALOHA real-world |

**Key architecture choices**:
- **LoRA adaptation**: Rank-32 LoRA on VLA backbone (no full fine-tuning needed)
- **Continuous actions**: L1 regression head (default) or diffusion head
- **FiLM conditioning**: Feature-wise Linear Modulation for stronger language grounding in OFT+
- **Multi-image input**: Configurable 2 or 3 camera streams via `num_images_in_input`

## Compute requirements

| Task | GPU | VRAM | Notes |
|------|-----|------|-------|
| LIBERO evaluation | 1x A100/A40 | ~16 GB | Single GPU |
| ALOHA evaluation | 1x A100/A40 | ~18 GB | Single GPU |
| LIBERO fine-tuning | 8x A100 | ~27 GB/GPU | Paper default |
| ALOHA fine-tuning (OFT+) | 8x A100 | ~35 GB/GPU | FiLM + 3 images |
| LoRA merge | 1x any GPU | ~16 GB | One-time step |

## Expected performance benchmarks

Official results (paper setup, seed=7, 50 trials per task):

| Task Suite | Task-Specific | Combined Policy | Notes |
|-----------|--------------|-----------------|-------|
| LIBERO-Spatial | 97.2% | 96.8% | Easiest suite |
| LIBERO-Object | 97.4% | 97.0% | Object manipulation |
| LIBERO-Goal | 95.8% | 95.4% | May peak at 50k-100k steps |
| LIBERO-10 | 98.0% | 98.0% | Long-horizon tasks |
| **Average** | **97.1%** | **96.8%** | Near-equivalent |

Reproduction notes: results are tied to Python 3.10.14, PyTorch 2.2.0, NVIDIA A100, and custom Transformers fork.

## When to use vs alternatives

**Use OpenVLA-OFT when:**
- The target task is robot action generation with visual and language conditioning
- LoRA-based adaptation of `openvla/openvla-7b` is preferred
- You need official LIBERO or ALOHA workflows from the OpenVLA-OFT paper
- You want continuous action heads (L1 regression or diffusion) instead of tokenized actions

**Use alternatives when:**
- You need a different VLA architecture (use `fine-tuning-serving-openpi` for pi0/pi0.5 models)
- You need the NVIDIA Cosmos Policy stack (use `evaluating-cosmos-policy`)
- You need general LLM fine-tuning without robot action heads

---

## Workflow 1: Set up environment

Copy this checklist and track progress:

```text
Setup Progress:
- [ ] Step 1: Create conda env and install PyTorch
- [ ] Step 2: Install openvla-oft package in editable mode
- [ ] Step 3: Install FlashAttention2
- [ ] Step 4: Verify critical versions
```

**Step 1: Create conda env and clone repo**

```bash
conda create -n openvla-oft python=3.10 -y
conda activate openvla-oft
git clone https://github.com/moojink/openvla-oft.git
cd openvla-oft
pip3 install torch==2.2.0 torchvision==0.17.0 torchaudio==2.2.0
pip3 install robosuite==1.4.0
```

**Step 2: Install package**

```bash
pip install -e .
```

**Step 3: Install FlashAttention2**

```bash
pip install packaging ninja
pip install "flash-attn==2.5.5" --no-build-isolation
```

**Step 4: Verify versions**

```python
import torch, transformers, peft
print(f"PyTorch: {torch.__version__}")         # Expected: 2.2.0
print(f"Transformers: {transformers.__version__}")
print(f"PEFT: {peft.__version__}")             # Expected: 0.11.1
```

---

## Workflow 2: Evaluate pretrained checkpoints on LIBERO

```text
LIBERO Eval Progress:
- [ ] Step 1: Install LIBERO dependencies
- [ ] Step 2: Choose checkpoint and task suite
- [ ] Step 3: Run evaluation
- [ ] Step 4: Parse and validate results
```

**Step 1: Install LIBERO**

```bash
git clone https://github.com/Lifelong-Robot-Learning/LIBERO.git
pip install -e LIBERO
pip install -r experiments/robot/libero/libero_requirements.txt
```

**Step 2: Choose checkpoint**

| Checkpoint | Task suite |
|-----------|------------|
| `moojink/openvla-7b-oft-finetuned-libero-spatial` | `libero_spatial` |
| `moojink/openvla-7b-oft-finetuned-libero-object` | `libero_object` |
| `moojink/openvla-7b-oft-finetuned-libero-goal` | `libero_goal` |
| `moojink/openvla-7b-oft-finetuned-libero-10` | `libero_10` |
| `moojink/openvla-7b-oft-finetuned-libero-spatial-object-goal-10` | Combined |

**Step 3: Run evaluation**

```bash
python experiments/robot/libero/run_libero_eval.py \
  --pretrained_checkpoint moojink/openvla-7b-oft-finetuned-libero-spatial \
  --task_suite_name libero_spatial \
  --center_crop True \
  --num_trials_per_task 50 \
  --seed 7
```

**Step 4: Parse results**

```python
import re

def parse_libero_log(log_path):
    """Extract per-task success rates from LIBERO eval log."""
    with open(log_path) as f:
        content = f.read()
    matches = re.findall(r"Task (.+?): (\d+)/(\d+) successes", content)
    for task, successes, trials in matches:
        rate = int(successes) / int(trials)
        print(f"  {task}: {rate:.0%} ({successes}/{trials})")

parse_libero_log("experiments/logs/latest.log")
```

---

## Workflow 3: Fine-tune on LIBERO

> **Detailed reference**: See [references/libero-workflow.md](references/libero-workflow.md) for the full
autoresearchSkill

Orchestrates end-to-end autonomous AI research projects using a two-loop architecture. The inner loop runs rapid experiment iterations with clear optimization targets. The outer loop synthesizes results, identifies patterns, and steers research direction. Routes to domain-specific skills for execution, supports continuous agent operation via Claude Code /loop and OpenClaw heartbeat, and produces research presentations and papers. Use when starting a research project, running autonomous experiments, or managing a multi-hypothesis research effort.

implementing-llms-litgptSkill

Implements and trains LLMs using Lightning AI's LitGPT with 20+ pretrained architectures (Llama, Gemma, Phi, Qwen, Mistral). Use when need clean model implementations, educational understanding of architectures, or production fine-tuning with LoRA/QLoRA. Single-file implementations, no abstraction layers.

mamba-architectureSkill

State-space model with O(n) complexity vs Transformers' O(n²). 5× faster inference, million-token sequences, no KV cache. Selective SSM with hardware-aware design. Mamba-1 (d_state=16) and Mamba-2 (d_state=128, multi-head). Models 130M-2.8B on HuggingFace.

nanogptSkill

Educational GPT implementation in ~300 lines. Reproduces GPT-2 (124M) on OpenWebText. Clean, hackable code for learning transformers. By Andrej Karpathy. Perfect for understanding GPT architecture from scratch. Train on Shakespeare (CPU) or OpenWebText (multi-GPU).

rwkv-architectureSkill

RNN+Transformer hybrid with O(n) inference. Linear time, infinite context, no KV cache. Train like GPT (parallel), infer like RNN (sequential). Linux Foundation AI project. Production at Windows, Office, NeMo. RWKV-7 (March 2025). Models up to 14B parameters.

distributed-llm-pretraining-torchtitanSkill

Provides PyTorch-native distributed LLM pretraining using torchtitan with 4D parallelism (FSDP2, TP, PP, CP). Use when pretraining Llama 3.1, DeepSeek V3, or custom models at scale from 8 to 512+ GPUs with Float8, torch.compile, and distributed checkpointing.

huggingface-tokenizersSkill

Fast tokenizers optimized for research and production. Rust-based implementation tokenizes 1GB in <20 seconds. Supports BPE, WordPiece, and Unigram algorithms. Train custom vocabularies, track alignments, handle padding/truncation. Integrates seamlessly with transformers. Use when you need high-performance tokenization or custom tokenizer training.

sentencepieceSkill

Language-independent tokenizer treating text as raw Unicode. Supports BPE and Unigram algorithms. Fast (50k sentences/sec), lightweight (6MB memory), deterministic vocabulary. Used by T5, ALBERT, XLNet, mBART. Train on raw text without pre-tokenization. Use when you need multilingual support, CJK languages, or reproducible tokenization.