eval-run
The eval-run skill executes or supervises Mnemon harness evaluation runs within isolated HostAgent workspaces. Use it to run planned evaluation scenarios and suites, install required loop templates, collect artifacts and logs, and record failures as evidence rather than silent skips, while maintaining boundaries around canonical scenario modifications and artifact preservation.
git clone --depth 1 https://github.com/mnemon-dev/mnemon /tmp/eval-run && cp -r /tmp/eval-run/harness/loops/eval/skills/eval-run ~/.claude/skills/eval-runSKILL.md
# Eval Run Use this skill to execute or supervise a planned eval run. ## Procedure 1. Confirm the plan names a host, suite or scenario, and evidence targets. 2. Create or use an isolated workspace. Do not run scenario state in the developer's active workspace unless the eval explicitly requires it. 3. Install the requested loop templates with `harness/ops`. 4. For Codex app-server evals, use the project runner when available: ```bash python3 scripts/codex_app_server_eval.py --suite ``` Use a specific suite option when the scenario requires it. 5. Collect artifacts and logs before cleanup. 6. Record timeouts, setup failures, and HostAgent readiness failures as eval evidence, not as silent skips. ## Boundaries - Do not change canonical scenarios, suites, or rubrics while running an eval. - Do not delete artifacts needed for report review. - Do not treat an exploratory run as a regression result.
Analyze Mnemon harness eval reports, classify outcomes, and extract improvement evidence.
Turn stable Mnemon harness eval findings into scoped project, loop, adapter, docs, or eval asset improvements.
Design a scenario-driven Mnemon harness eval with target, hypothesis, HostAgent, loop configuration, evidence, and rubric.
Manage project-scoped Mnemon goal state, evidence, verification, completion, blockers, and host goal links.
Recall long-term memory from Mnemon when GUIDE.md indicates that prior memory may help the current task.
Maintain prompt-facing working memory by editing MEMORY.md when GUIDE.md indicates that durable information should be kept.
Draft or revise high-quality SKILL.md content for approved or proposed Mnemon skill changes.
Start a low-frequency review of skill evidence and canonical skill lifecycle state.