Skill1.4k repo starsupdated 10d ago

optimize

The optimize skill runs the evo optimization loop, spawning multiple parallel subagents each round to explore different improvement strategies within assigned briefs. Use it to automatically iterate on code optimization by having the orchestrator identify failure patterns, delegate focused experiments to semi-autonomous subagents, and converge on improvements until no progress occurs or the process is interrupted.

View source Repository: evo

Install in Claude Code

Copy

git clone --depth 1 https://github.com/evo-hq/evo /tmp/optimize && cp -r /tmp/optimize/scripts/rlm_eval/baseline_ ~/.claude/skills/optimize

Then start a new Claude Code session; the skill loads automatically.

Definition

baseline_SKILL.md

Run the `evo` optimization loop. Each round, the orchestrator writes structured briefs and spawns parallel subagents that execute within them. Each subagent is semi-autonomous: it reads the pointer traces, forms the concrete edit, runs experiments, and can iterate within its branch. Runs until interrupted or the stall limit is reached.

## Host conventions

This skill runs on any host that implements the Agent Skills spec. When the body uses generic phrases, apply the host's best-fit equivalent:

- **"spawn N subagents in parallel"** -- use your host's parallel-subagent or background-task tool if you have one (e.g. `Agent` with `run_in_background`, `spawn_agent` + `wait_agent`, `spawn_agents_on_csv` for batch). Respect the host's concurrency cap -- if N exceeds it, run in batches. If the host has no parallel-subagent tool, run them serially and note the reduced round width in the final summary.
- **Slash commands shown in user-facing copy** (e.g. `/evo:optimize`) -- translate to your host's mention syntax when speaking to the user (e.g. `$evo optimize` on Codex -- plugin namespace then skill name, separated by a space).

## Configuration

These defaults can be overridden via arguments: `/optimize [subagents=N] [budget=N] [stall=N]`

- **subagents**: number of parallel subagents per round (default: 5)
- **budget**: max iterations each subagent can run within its branch (default: 5)
- **stall**: consecutive rounds with no improvement before auto-stopping (default: 5)

## Prerequisites

- Workspace must be initialized (`evo status` should succeed)
- A baseline experiment must be committed (run `/discover` first)
- All benchmark dependencies must be available in the environment

## Architecture

```
Orchestrator (this agent):
- Reads state, identifies failure patterns cross-cutting the tree
- Writes one brief per subagent: objective + parent + boundaries + pointer traces
- Verifies briefs are diverse (no two attacking the same surface)
- Collects results, prunes dead branches, adjusts strategy

Subagent A (brief, budget: N iterations):
- Reads its pointer traces, forms the concrete edit
- Creates experiment, edits target, runs benchmark, analyzes
- If budget remains and sees a promising follow-up, continues
- Can run up to N serial experiments on its own branch
- Returns: what it tried, what worked, what it learned

Subagent B (different brief, budget: N iterations):
- Same protocol, non-overlapping objective
...
```

Both layers read traces; the depth differs. The orchestrator scans for cross-cutting patterns (which failures are common, which branches plateau) -- enough to pick N non-overlapping briefs. Subagents read their pointer traces in depth, enough to commit to a concrete edit. Structured briefs are what prevent parallel subagents from duplicating each other's work.

**Trace instrumentation style**: `.evo/meta.json`'s `instrumentation_mode` records `sdk` vs `inline`. Subagents must stay consistent with it (see `skills/subagent/SKILL.md` for details).

## The Loop

Repeat until interrupted or stall limit reached:

### 1. Read current state

```bash
evo scratchpad # full state: tree, best path, frontier, annotations, diffs, gates, what-not-to-try
evo frontier # explorable nodes (JSON)
evo status # one-line summary
evo annotations # all annotations (filterable with --task/--exp)
evo path <id> # root-to-node chain with scores
evo diff <id> # diff vs parent
evo diff <id> <other> # diff between any two experiments
evo gate list <id> # effective gates for a node (inherited from ancestors)
```

On the first iteration, also read `.evo/project.md` to understand the optimization surface.

### 2. Analyze state and write subagent briefs

From the scratchpad, frontier, traces, and annotations, determine:
- Which frontier nodes are most promising
- What failure patterns are most common and impactful
- What strategies have been tried and their outcomes
- Which branches are plateauing or exhausted
- What gates exist on each frontier node (`evo gate list <id>`) -- subagents must satisfy these

**Read the "Awaiting Decision" section of the scratchpad.** Evaluated nodes (ran, bad outcome, not yet discarded) are a cross-agent signal: if three subagents in the last round produced evaluated nodes that all failed the same gate, surface the pattern -- maybe the gate is too tight, maybe the approach has a shared flaw. Either tell the next round to avoid it, or propose a brief that attacks it directly. Without this cross-cutting read, each subagent rediscovers the same wall independently.

Then write **one brief per subagent** with these four fields:

1. **Objective** -- one sentence describing the bottleneck to attack and the evidence for it. Should name *where in the system's behavior* the gain is hiding (e.g., "tool-use error recovery fails after the first bad call across tasks 2, 5, 7") but **must not name specific files, functions, or concrete edits** -- that's the subagent's job after it reads the code.
2. **Parent node** -- which experiment to branch from.
3. **Boundaries / anti-patterns** -- what this subagent should NOT try, explicitly called out with reasons. Include approaches already tried and discarded (from "What Not To Try"), gates it must not regress, and anything adjacent subagents in this round are doing (so it doesn't duplicate).
4. **Pointer traces** -- task IDs the subagent should study first, with a one-line reason each.

Be specific and bounded. Vague briefs like "improve accuracy" cause subagents to duplicate each other's work; structured briefs prevent it.

**Diversity check (before spawning).** Re-read the N briefs side by side. If two briefs:
- point at the same objective phrased differently, OR
- cite overlapping pointer traces without meaningfully different framings, OR
- attack the same area of the system,

merge or re-scope one of them. The frontier/pruning logic handles t

More from this repository

discoverSkill

Initialize evo for the current repository by exploring the codebase, proposing unexplored optimization dimensions, constructing the benchmark inside a baseline worktree, and running the first experiment. Use when the user invokes /evo:discover, mentions setting up evo, wants to instrument a codebase for autonomous optimization, or asks to start a new evo run on a project.

infra-setupSkill

Non-user-invocable provider/setup reference for evo backend switching, prerequisite checks, and auth/install guidance.

reportSkill

Read-only evo run reporting. Use when the user invokes /evo:report, asks what happened overnight, asks what improved recently, asks for the best/frontier candidates, asks for a quick score chart without opening the dashboard, or wants the scatter plot in chat output. Never run benchmarks, gates, Slurm commands, evo run, or ad-hoc verification scripts for report requests.

subagentSkill

Protocol that evo optimization subagents follow when dispatched from /optimize. Auto-loaded by spawned subagents via their host's skill loader. The orchestrator may also invoke this skill to understand the brief shape its dispatched subagents expect + what they're required to emit -- useful when writing briefs or debugging a subagent's behavior.

finetuningSkill

This skill should be used when picking or diagnosing a training move (SFT, LoRA, DPO/KTO/ORPO, RFT, GRPO/PPO/RLOO, RLHF), or when the user mentions fine-tuning, post-training, training recipe, reward design, or weight updates. Decision tree by reward shape, smoke-run gate, three failure diagnostics, five false-progress patterns. Provider recipes and I/O contract in references/.

shipSkill

Land the winning experiment from an evo run as a clean, mergeable change -- open a PR when the repo has a remote, otherwise merge into the working branch. Distills the best-scoring experiment down to the minimal diff that reproduces its behaviour, shaped for the qualities a maintainer merges on (scope discipline, test integrity, style adherence), then attaches an advisory mergeability report. Use when the user invokes /evo:ship, asks to land/merge/ship the best result, or wants to turn a finished optimization into a pull request.