Skip to main content
ClaudeWave
Skill1.1k estrellas del repoactualizado today

finetuning

This skill helps practitioners select and diagnose fine-tuning techniques for language models by matching reward shape to method (SFT, LoRA, DPO/KTO/ORPO, RL variants). Use it when choosing a training approach, evaluating why a training run underperformed, or designing reward signals. It provides a decision tree prioritizing verifiable rewards for RL, preference pairs for DPO, demonstrations for SFT, and filtered SFT for hybrid approaches, along with diagnostics for common failure patterns and guidance on literature review before committing compute.

Instalar en Claude Code
Copiar
git clone --depth 1 https://github.com/evo-hq/evo /tmp/finetuning && cp -r /tmp/finetuning/plugins/evo/skills/finetuning ~/.claude/skills/finetuning
Después abre una sesión nueva de Claude Code; el skill carga automáticamente.

SKILL.md

# Finetuning

Priors, not rules. Only firm guardrails: held-out eval you never train on, no leakage, trust evo's recorded numbers over the run's self-report. Override anything else against the gate.

## Pick the technique by reward shape

Decide on the reward first, technique second. Choosing the comfortable technique over the matching one is the most common failure.

| Reward shape | Technique |
|---|---|
| Verifiable (exact match, unit tests, parser-decidable) | **RL** (GRPO / RLOO / PPO) — reward includes format, so the model learns to emit verifier-acceptable shape |
| Preference pairs (chosen vs rejected) | **DPO / KTO / ORPO** — cheaper than full RL, no rollouts |
| Demonstrations only (curated traces, chat data) | **SFT** — install format/tone/capability the base lacks |
| Have a scorer + want SFT stability | **RFT** — sample, filter by reward, SFT on survivors |

"SFT-then-RL" is not a law. For a competent base model on a verifiable benchmark, RL-from-base often beats SFT-then-RL end-to-end.

## Research the literature before the first commit

The decision tree above is the structural prior. The empirical answer for *this* model on *this* benchmark usually has a recent paper, blog, or HF Space recipe behind it -- and what beats baseline on a 4B base model in 2026 is not what the agent's pre-training data captures. Before picking the technique for `exp_0001` (the first experiment after baseline), invoke `evo:ideator` with a `literature` brief:

```
Task(
    subagent_type="evo:ideator",
    prompt="brief=literature\n"
           "model_family=<e.g. Qwen3-4B-Base, Llama-3.1-8B-Base>\n"
           "benchmark=<name + URL/paper if known>\n"
           "objective=<one line: what beats baseline looks like>\n"
           "constraints=<budget, data sources allowed, gated models forbidden, etc>"
)
```

The ideator returns ranked proposals with references (arXiv, HF Hub, GitHub, blogs). Read them before picking from the reward-shape table. A paper showing GRPO-from-base works on `<model_family>` for a similar verifiable benchmark beats applying the table cold.

Run this **once before `exp_0001`**, and again whenever the optimize loop hits a plateau (the "stuck across distinct techniques" diagnostic below). Not every subsequent experiment needs a literature pass -- the table + diagnostics carry the rest.

## Before committing the budget: smoke-run

Run the full pipeline on ~10 examples for ~1 minute. Must produce: a checkpoint the benchmark can load AND a non-zero eval on a held-out item. If not, the recipe is broken — fix it, don't scale it. dtype mismatch, tokenizer/template drift, OOM at this batch size, empty artifacts dir despite falling loss — all surface on 10 examples. Running longer doesn't surface them differently, just more expensively.

## Long training: checkpoint, mid-eval, early-stop in-script

Training for an hour and getting one number at the end is the wrong granularity for evo's tree search. By the time you know the recipe failed, you've spent the budget. Build the verification *into* the training script, not around it.

Pattern for any training run expected to exceed ~30 min wall-clock:

1. **Periodic checkpoint** every N steps (e.g. every 0.25 epoch, or every 200 steps — whichever is faster).
2. **Mini-eval after each checkpoint** on a small held-out subset (5–10 items, not the full held-out — that's reserved for the final committed score). Same scorer as the real eval; the model just sees fewer items.
3. **Early-stop on regression**: track best mid-eval score; stop if it hasn't improved in `patience` checkpoints (typically 2). Don't burn 60 more minutes once the trajectory has flattened or reverted.
4. **Save the BEST checkpoint, not the last.** Early-stop means the current model is probably past its peak; the checkpoint you commit should be the one that scored highest mid-training, not whatever the trainer happened to leave behind.
5. **Log every mid-eval score to your tracker** (see `## Stream training metrics live`). The user watching the live dashboard sees the trajectory build up step-by-step instead of staring at the loss curve hoping it transfers.

HuggingFace TRL: implement as a `TrainerCallback` on `on_step_end` — save checkpoint, run the mini-eval via vLLM or HF transformers, compare to `best_score`, set `control.should_training_stop = True` on stall. Pattern is one ~30-line class.

Keep vLLM warm across mid-evals when you can (one serve process, reload adapter between checkpoints) — cold-starting vLLM every 200 steps adds 5 min of overhead per checkpoint.

Use a tighter mini-eval subset than the full held-out. The mini-eval is a *signal*, not the score that gets committed. If the mini-eval scores ≥ baseline on its subset, run the full held-out as the eval-gate scoring pass at the end. If it doesn't, early-stop.

This is Pattern B from the design tradeoff with multi-node staging (Pattern A — break the training into multiple committed evo nodes, each a stage). Pattern B keeps the experiment as one evo node with the verification logic inside the script; it's simpler to write and avoids per-stage vLLM spin-up, at the cost of less tree-search introspection. Multi-stage as separate nodes is preferable when you want the orchestrator to be able to branch alternative continuations from any mid-training checkpoint.

## Cap retries at training scale

`evo run` allows up to `max_attempts=3` retries per experiment by default. That budget was designed for second-scale benchmarks where retrying after an edit-bug fix is free. At training scale (~hours per attempt), it's the wrong tradeoff — by attempt 2 you've spent more compute than just trying a fresh hypothesis would cost.

For training-heavy workspaces, set the cap to 1 once at init:

```bash
evo config set max-attempts 1
```

One attempt, one shot. Regression → `evo discard` → new branch from parent with a different hypothesis. This pairs with the in-script early-stop above: each attempt is single-shot, but its interna