darwinian-evolver
Darwinian Evolver runs an LLM-driven evolutionary search loop to optimize prompts, regex patterns, SQL queries, or code snippets against a fitness function. Use it when you have a measurable scoring criterion and need to explore many variants automatically, typically over 50 to 500 LLM calls. Do not use it for differentiable optimization targets, minor tweaks better handled manually, or purely subjective fitness signals.
git clone --depth 1 https://github.com/NousResearch/hermes-agent /tmp/darwinian-evolver && cp -r /tmp/darwinian-evolver/optional-skills/research/darwinian-evolver ~/.claude/skills/darwinian-evolverSKILL.md
# Darwinian Evolver
Run Imbue's [darwinian_evolver](https://github.com/imbue-ai/darwinian_evolver) — an
LLM-driven evolutionary search loop — to optimize a **prompt, regex, SQL query,
or small code snippet** against a fitness function.
Status: thin wrapper around the upstream tool. The skill installs it, walks the
agent through writing a `Problem` definition (organism + evaluator + mutator),
and drives the loop via the upstream CLI or a small custom Python driver.
**License:** the upstream tool is **AGPL-3.0**. The skill ONLY ever invokes it
via the upstream CLI or a `subprocess`/`uv run` call (mere aggregation). Do NOT
import upstream classes into Hermes itself.
## When to Use
- User says "optimize this prompt", "evolve a regex for X", "auto-improve this
code/SQL", "search for a better instruction".
- You have a scorer (exact match, regex pass-rate, unit test, LLM-judge, runtime
metric) AND a starting candidate (organism). If you don't have a scorer, stop
and define one first — that's the hard part.
- Cost is OK: a typical run is 50–500 LLM calls. On gpt-4o-mini that's pennies;
on Claude Sonnet it can be a few dollars.
Do **not** use this when:
- The optimization target is differentiable (use gradient descent / DSPy).
- You only need to try 2–3 variants — just write them by hand.
- The fitness signal is purely subjective with no measurable criterion.
## Prerequisites
- Python ≥3.11
- `git`, `uv` (or `pip`)
- One of: `OPENROUTER_API_KEY`, `ANTHROPIC_API_KEY`, or `OPENAI_API_KEY`
The skill ships a small `parrot_openrouter.py` driver that uses `OPENROUTER_API_KEY`
via the OpenAI SDK, so any model on OpenRouter works. The upstream CLI itself
hardcodes Anthropic and needs `ANTHROPIC_API_KEY`.
## Install (One-Time)
Run via the `terminal` tool:
```bash
mkdir -p ~/.hermes/cache/darwinian-evolver && cd ~/.hermes/cache/darwinian-evolver
[ -d darwinian_evolver ] || git clone --depth 1 https://github.com/imbue-ai/darwinian_evolver.git
cd darwinian_evolver && uv sync
```
Verify:
```bash
cd ~/.hermes/cache/darwinian-evolver/darwinian_evolver \
&& uv run darwinian_evolver --help | head -5
```
## Quick Start — The Built-In Parrot Example
Tiny smoke test (requires `ANTHROPIC_API_KEY`):
```bash
cd ~/.hermes/cache/darwinian-evolver/darwinian_evolver
uv run darwinian_evolver parrot \
--num_iterations 2 \
--num_parents_per_iteration 2 \
--mutator_concurrency 2 --evaluator_concurrency 2 \
--output_dir /tmp/parrot_demo
```
Outputs:
- `/tmp/parrot_demo/snapshots/iteration_N.pkl` — pickled population per iteration
- `/tmp/parrot_demo/<jsonl>` — per-iteration JSON log (path printed at end)
Open `~/.hermes/cache/darwinian-evolver/darwinian_evolver/darwinian_evolver/lineage_visualizer.html`
in a browser and load the JSON log to see the evolutionary tree.
## Quick Start — OpenRouter Driver (No Anthropic Key)
The skill ships `scripts/parrot_openrouter.py` — same parrot problem, but the
LLM call goes through OpenRouter so any provider works.
```bash
# From wherever the skill is installed:
SKILL_DIR=~/.hermes/skills/research/darwinian-evolver
DE_DIR=~/.hermes/cache/darwinian-evolver/darwinian_evolver
cd "$DE_DIR" && \
EVOLVER_MODEL='openai/gpt-4o-mini' \
uv run --with openai python "$SKILL_DIR/scripts/parrot_openrouter.py" \
--num_iterations 3 --num_parents_per_iteration 2 \
--output_dir /tmp/parrot_or
```
Inspect the result with `scripts/show_snapshot.py`:
```bash
uv run --with openai python "$SKILL_DIR/scripts/show_snapshot.py" \
/tmp/parrot_or/snapshots/iteration_3.pkl
```
Expected output: 7 evolved prompt templates ranked by score, with the best
landing around 0.6–0.8 (the seed `Say {{ phrase }}` scored 0.000).
## Defining a Custom Problem
The skill ships `templates/custom_problem_template.py` — copy, edit, run.
Three things you must define:
1. **`Organism`** — a Pydantic `BaseModel` subclass holding the artifact being
evolved (`prompt_template: str`, `regex_pattern: str`, `sql_query: str`,
`code_block: str`, etc.). Add a `run(*args)` method that exercises it.
2. **`Evaluator`** — `.evaluate(organism) -> EvaluationResult(score=..., trainable_failure_cases=[...], holdout_failure_cases=[...], is_viable=True)`.
- **`score`** is in `[0, 1]`. Higher is better.
- **`trainable_failure_cases`** — what the mutator sees. Include enough
context (input, expected, actual) for the LLM to diagnose.
- **`holdout_failure_cases`** — kept out of the mutator's view. Use these
to detect overfitting.
- **`is_viable=True`** unless the organism is completely broken (raises,
returns None, etc.). A 0-score viable organism is fine — it just gets
down-weighted in parent selection.
3. **`Mutator`** — `.mutate(organism, failure_cases, learning_log_entries) -> list[Organism]`.
Typically: build an LLM prompt that includes the current organism + a
failure case + an ask to propose a fix; parse the LLM's response; return
a new `Organism`. Return `[]` on parse failure — the loop handles it.
Then write a driver script that wires `Problem(initial_organism, evaluator, [mutators])`
into `EvolveProblemLoop` and iterates over `loop.run(num_iterations=N)` — the
shipped `scripts/parrot_openrouter.py` is the reference.
## Hyperparameters That Actually Matter
| flag | default | when to change |
|---|---|---|
| `--num_iterations` | 5 | bump to 10–20 once you trust the evaluator |
| `--num_parents_per_iteration` | 4 | drop to 2 for cheap exploration |
| `--mutator_concurrency` | 10 | drop to 2–4 to avoid rate limits |
| `--evaluator_concurrency` | 10 | same; evaluator hits the LLM too |
| `--batch_size` | 1 | raise to 3–5 once your mutator handles multiple failures |
| `--verify_mutations` | off | turn on once mutator is wasteful (>10× cost saving on later runs per Imbue) |
| `--midpoint_score` | `p75` | leave alone unless scores cluster |
| `--sharpness` | 10 | leave alone |
## Pitfalls
1. **`Initial organism must be viable`** — set `is_viable=TrOperate the Antigravity CLI (agy): plugins, auth, sandbox.
Delegate coding tasks to Blackbox AI CLI agent. Multi-model agent with built-in judge that runs tasks through multiple LLMs and picks the best result. Requires the blackbox CLI and a Blackbox AI API key.
Delegate coding to xAI Grok Build CLI (features, PRs).
Configure and use Honcho memory with Hermes -- cross-session user modeling, multi-profile peer isolation, observation config, dialectic reasoning, session summaries, and context budget enforcement. Use when setting up Honcho, troubleshooting memory, managing profiles with Honcho peers, or tuning observation, recall, and dialectic settings.
Delegate coding to OpenHands CLI (model-agnostic, LiteLLM).
Read-only EVM client: wallets, tokens, gas across 8 chains.
Hyperliquid market data, account history, trade review.
Query Solana blockchain data with USD pricing — wallet balances, token portfolios with values, transaction details, NFTs, whale detection, and live network stats. Uses Solana RPC + CoinGecko. No API key required.