Skip to main content
ClaudeWave
Skill1.2k repo starsupdated today

ship

The ship skill converts a finished evo optimization run into a clean, mergeable code change by distilling the winning experiment to its minimal diff, removing debug code and unnecessary edits, then either opening a pull request or merging directly depending on repository configuration. Use it when an optimization run is complete and you want to land the best result as production-ready code that maintainers would accept.

Install in Claude Code
Copy
git clone --depth 1 https://github.com/evo-hq/evo /tmp/ship && cp -r /tmp/ship/plugins/evo/skills/ship ~/.claude/skills/ship
Then start a new Claude Code session; the skill loads automatically.

SKILL.md

# Ship

Turn a finished evo run into a change a maintainer would merge.

The optimize loop leaves a tree of committed experiments. The winning worktree
diff is not mergeable as-is: it carries debug prints, search-process churn,
over-broad edits, and sometimes a test that was relaxed to clear a gate. Shipping
is the step that re-derives the *minimal clean change* reproducing the winning
behaviour, lands it the way the repo expects (PR or merge), and reports how
mergeable it is.

Correctness is the floor, not the goal. The score says the behaviour works; this
skill decides whether the *diff* is fit to merge.

## Invocation

```bash
/evo:ship            # ship the auto-selected winner
/evo:ship exp_0042   # ship a specific experiment instead
```

## Stage 1 -- Select the winner

Pick the experiment to ship, then confirm it with the user before touching their
tree.

```bash
evo status    # current best valid score + counts
evo report    # top valid experiments table + score chart
```

- The default winner is the highest-scoring valid result in the graph history,
  not the frontier. `evo frontier` is for choosing where to branch next; it can
  exclude an exhausted branch whose score is still the right thing to ship. An
  explicit `exp_id` argument overrides auto-selection.
- A shippable winner must be valid: `committed`, or `pruned` with
  `prune_kind=exhausted`, with a commit and score, no `gate_result === false`,
  and no invalid-pruned ancestor. Never select `discarded`, `failed`, `active`,
  `evaluated`, legacy-pruned nodes with no `prune_kind`, `prune_kind=invalid`, or
  descendants of invalid-pruned nodes. If no valid candidate exists, stop and
  report why nothing is safe to ship.
- Resolve the run's root (baseline) node, then show the cumulative change:
  ```bash
  evo diff <root_id> <winner_id>   # target-scoped cumulative diff, baseline -> winner
  ```
  For changes outside the benchmark target, diff the commits directly
  (`git diff <baseline_commit> <winner_commit>`); each node carries `.commit`.
- Present a one-screen summary: winner id, score baseline -> winner (delta),
  the winning hypothesis, and a diffstat. Get a go before proceeding.

## Stage 2 -- Distill to a mergeable change

Work on a fresh branch off the user's current HEAD, not in the experiment
worktree. Re-derive the change so it stands on its own:

- **Scope restraint.** Keep only the files and lines the behaviour needs. Drop
  experiment scaffolding, debug logging, commented-out attempts, and churn the
  search introduced and then abandoned. Smaller, local diffs merge; sprawl does
  not.
- **Test integrity.** If the search weakened, skipped, or deleted a test to clear
  a gate, restore it. New behaviour that changes outputs needs a test that
  covers it. Never ship a green benchmark that rode on a loosened test -- call it
  out instead.
- **Mechanical cleanliness.** Match the repo's formatter and linter. No stray
  whitespace, no reordered imports unless the repo does that.
- **Codebase adherence.** Match surrounding naming, error handling, and structure.
  The diff should read like the file it lands in.

Then confirm the behaviour survived the distillation:

```bash
evo run <winner_id> --check    # or the project's benchmark / test command
```

If the distilled change no longer reproduces the winning score, do not paper over
it -- report the gap (which part of the experiment diff was load-bearing) and let
the user decide. Best-effort means honest about what could not be cleaned up, not
silently shipping the raw worktree.

## Stage 3 -- Land

Detect how the repo expects changes to arrive:

```bash
git remote -v
```

- **Remote present** -> open a pull request. Commit the distilled change on its
  branch, push, and `gh pr create` with the mergeability report (Stage 4) as the
  body. Do not push or open the PR without the user's go.
- **No remote** -> merge the distilled change into the user's working branch as a
  single clean commit. Do not force, do not rewrite existing history.

The landed commit message carries provenance: the winning experiment id, the
score delta, and the one-line hypothesis. State what changed and why it is safe;
do not narrate the search process.

## Stage 4 -- Mergeability report (advisory)

Always produce the report. It never blocks the merge -- it tells the user, and a
future reviewer, how mergeable the change is across the axes a maintainer judges
on:

- **Technique** -- what the change actually does to move the score, named
  concretely (the algorithm, data structure, or mechanism), not the search
  story. Distilled from the winning hypothesis: "replaced the O(n^2) dedup with a
  hash set", not "exp_0042 improved throughput". This is what a reviewer reads
  first.
- **Behavioural correctness** -- score baseline -> shipped (delta); benchmark
  status after distillation.
- **Regression safety** -- full test suite result on the distilled change.
- **Scope** -- files touched, diff size, whether the change stays local.
- **Test correctness** -- explicit yes/no on whether any test was modified,
  weakened, or removed, with detail; whether new behaviour is covered.
- **Mechanical cleanliness** -- formatter / linter status.
- **Codebase adherence** -- a note on style/convention fit.

Lead with a plain-language summary: what changed and why it is safe to merge. On
a remote repo this is the PR body. With no remote, print it and save it alongside
the run so the user can paste it into a review later.

## Guardrails (firm)

Everything above is method you can adapt to the repo. These are not:

- Never weaken, skip, or delete a test to make the change land. If the experiment
  did, restore it and report it.
- Never ship invalid-pruned, legacy-pruned, discarded, failed, active,
  evaluated, gate-failed, or invalid-lineage nodes. Only exhausted pruned nodes
  remain normal ship candidates.
- Never push or open a PR without the user's explicit go.
- Never rewrite or force-overwrite existing history on
discoverSkill

Initialize evo for the current repository by exploring the codebase, proposing unexplored optimization dimensions, constructing the benchmark inside a baseline worktree, and running the first experiment. Use when the user invokes /evo:discover, mentions setting up evo, wants to instrument a codebase for autonomous optimization, or asks to start a new evo run on a project.

infra-setupSkill

Non-user-invocable provider/setup reference for evo backend switching, prerequisite checks, and auth/install guidance.

optimizeSkill

Run the evo optimization loop with parallel subagents until interrupted.

reportSkill

Read-only evo run reporting. Use when the user invokes /evo:report, asks what happened overnight, asks what improved recently, asks for the best/frontier candidates, asks for a quick score chart without opening the dashboard, or wants the scatter plot in chat output. Never run benchmarks, gates, Slurm commands, evo run, or ad-hoc verification scripts for report requests.

subagentSkill

Protocol that evo optimization subagents follow when dispatched from /optimize. Auto-loaded by spawned subagents via their host's skill loader. The orchestrator may also invoke this skill to understand the brief shape its dispatched subagents expect + what they're required to emit -- useful when writing briefs or debugging a subagent's behavior.

finetuningSkill

This skill should be used when picking or diagnosing a training move (SFT, LoRA, DPO/KTO/ORPO, RFT, GRPO/PPO/RLOO, RLHF), or when the user mentions fine-tuning, post-training, training recipe, reward design, or weight updates. Decision tree by reward shape, smoke-run gate, three failure diagnostics, five false-progress patterns. Provider recipes and I/O contract in references/.