evolve
The evolve skill runs autonomous improvement loops that iteratively measure code quality against a rubric, identify the highest-priority issue, execute fixes via the rpi skill, and repeat until stopping conditions are met. Use it when you need continuous, hands-off code refinement that prioritizes work by impact, such as closing coverage gaps, reducing complexity, fixing failing tests, addressing security vulnerabilities, or resolving directive mismatches across a codebase.
git clone --depth 1 https://github.com/boshu2/agentops /tmp/evolve && cp -r /tmp/evolve/images/gemini/skills/evolve ~/.claude/skills/evolveSKILL.md
# /evolve — Goal-Driven Compounding Loop
> **Cross-vendor analog:** Anthropic Managed Agents Outcomes (May 2026). Both close the loop "agent runs → grader scores against a rubric → agent retries"; AgentOps does it locally against any model.
> Measure what's wrong. Fix the worst thing. Measure again. Compound.
**The loop runs as this skill (skills-are-the-runtime).** `evolve` selects work
and invokes complete `/rpi --auto` cycles — that *is* the loop. `evolve` (and
`ao rpi loop --supervisor`) are terminal-native **wrapper commands** for humans or
non-skill runtimes, not the default expression of the loop; they reuse the same v2
RPI loop engine. (The substrate dispatches the whole `evolve` skill loop as one
unit; it never drives the loop's insides. The `evolve`/`ao rpi` CLI wrappers are
being retired — ag-iowf.)
**Operator cadence:** post-mortem finished work, analyze the current repo state,
select or create the next highest-value work item, let `rpi` handle research,
planning, pre-mortem, implementation, and validation, then harvest follow-ups
and repeat until a kill switch, max-cycle cap, regression breaker, or real
dormancy stops the run.
Always-on autonomous loop over `rpi`. Work selection order:
1. **Harvested `.agents/rpi/next-work.jsonl` work** (freshest concrete follow-up)
2. **Open ready beads work** (`bd ready`)
3. **Failing goals and directive gaps** (`ao goals measure`)
4. **Testing improvements** (missing/thin coverage, missing regression tests)
5. **Validation tightening and bug-hunt passes** (gates, audits, bug sweeps)
6. **Complexity / TODO / FIXME / drift / dead code / stale docs / stale research mining**
7. **Concrete feature suggestions** derived from repo purpose when no sharper work exists
**Work generators** that feed the selection ladder (auto-invoked, skip with `--no-lifecycle`):
- `Skill(skill="test", args="coverage")` → files with <40% coverage become queue items (Step 3.4)
- `Skill(skill="refactor", args="--sweep all --dry-run")` → functions with CC > 20 become queue items (Step 3.6)
- `Skill(skill="deps", args="audit")` → deps with CVSS >= 7.0 or 2+ major versions behind become queue items (Step 3.5)
- `Skill(skill="perf", args="profile --quick")` → perf findings become queue items when hot paths detected (Step 3.5)
**Dormancy is last resort.** Empty current queues mean "run the generator layers", not "stop". Only go dormant after the queue layers and generator layers come up empty across multiple consecutive passes.
**Live skill edit immune system:** if an evolve cycle edits
`skills/<slug>/SKILL.md`, run
`ao skills edit seal --skill <slug> --actor "${AGENT_NAME:-agent}"` before the
cycle hands off. The seal creates the rollback commit and records the
`Skill-Edit` trailers used by the daily digest. Critical skills listed in
`docs/contracts/critical-skills.txt` reject unattended edits; use
`--allow-critical` only when Bo is supervising that critical edit.
```bash
/evolve # Run until kill switch, max-cycles, or real dormancy
/evolve --max-cycles=5 # Cap at 5 cycles
/evolve --dry-run # Show what would be worked on, don't execute
/evolve --beads-only # Skip goals measurement, work beads backlog only
/evolve --quality # Quality-first mode: prioritize post-mortem findings
/evolve --quality --max-cycles=10 # Quality mode with cycle cap
/evolve --compile # Mine → Defrag warmup before first cycle
/evolve --compile --max-cycles=5 # Warm knowledge base then run 5 cycles
/evolve --test-first # Default strict-quality /rpi execution path
/evolve --no-test-first # Explicit opt-out from test-first mode
```
## Delineation vs Nightly Knowledge Compounding
| Lane | Runs | Mutates code? | Mutates corpus? | Outer loop? | Budget |
|------|------|---------------|-----------------|-------------|--------|
| `$curate --mode=dream` | nightly, private local | **No** | **Yes (heavy)** | **Yes (convergence)** | wall-clock + plateau |
| `evolve` | daytime, operator-driven | Yes (via `rpi`) | Yes (light) | Yes | cycle cap |
**The old dream skill is retired**; out-of-session compounding moved to Gas City and the current skill surface is `$curate --mode=dream`. `/evolve` owns the live daytime code-compounding lane. Both still share the fitness-measurement substrate via `corpus.Compute` / `ao goals measure`.
## Flags
| Flag | Default | Description |
|------|---------|-------------|
| `--max-cycles=N` | unlimited | Stop after `N` completed cycles |
| `--dry-run` | off | Show planned cycle actions without executing |
| `--beads-only` | off | Skip goal measurement and run backlog-only selection |
| `--skip-baseline` | off | Skip first-run baseline snapshot |
| `--quality` | off | Prioritize harvested post-mortem findings |
| `--compile` | off | Run `ao mine` + `ao defrag` warmup before cycle 1 |
| `--test-first` | on | Pass strict-quality defaults through to `rpi` |
| `--no-test-first` | off | Explicitly disable test-first passthrough to `rpi` |
| `--no-lifecycle` | off | Skip lifecycle work generators in Steps 3.4-3.6 (/test, /deps, /perf, /refactor). Falls back to manual scanning. |
| `--mode=burst\|loop` | burst | Operator-loop; STOP refused. [loop-mode.md](references/loop-mode.md). |
## Execution Steps
**YOU MUST EXECUTE THIS WORKFLOW. Do not just describe it.**
**FULLY AUTONOMOUS.** Read `references/autonomous-execution.md`. Every `rpi` uses `--auto`. Do NOT ask the user anything. Each cycle = complete 3-phase `rpi` run.
For broad AgentOps 3.0 domain evolution across skills, CLI, hooks, docs, tests,
beads, and knowledge, first read
[references/domain-evolution-bootstrap.md](references/domain-evolution-bootstrap.md).
It supplies the BDD/DDD/Hexagonal/TDD/XP control surface and the clean-room
skill-factory guardrails.
### Step 0: Setup
**Stale-checkout survey guard (run FIRST).** Before any tree-reading survey: `git fetch origin && git status -sb`. If the checkout is behind/diverged AND it is a tUse Agent Mail from Codex for file leases, notifications, inboxes, and conflict prevention.
>-
>-
Use when converting markdown plans into br beads with dependencies for implementation or swarm execution.
Use when switching AI coding CLI accounts quickly to recover from subscription rate limits or OAuth friction.
>-
Use when starting non-trivial work, mining lessons, or preventing repeated mistakes with cm procedural memory.
Mine past agent sessions for working prompts, decisions, and patterns. Use when "what did I ask?", "find that prompt", session archaeology, or agent history.