Skip to main content
ClaudeWave
Skill341 repo starsupdated 2d ago

eval-run

The eval-run skill executes or supervises Mnemon harness evaluation runs within isolated HostAgent workspaces. Use it to run planned evaluation scenarios and suites, install required loop templates, collect artifacts and logs, and record failures as evidence rather than silent skips, while maintaining boundaries around canonical scenario modifications and artifact preservation.

Install in Claude Code
Copy
git clone --depth 1 https://github.com/mnemon-dev/mnemon /tmp/eval-run && cp -r /tmp/eval-run/harness/loops/eval/skills/eval-run ~/.claude/skills/eval-run
Then start a new Claude Code session; the skill loads automatically.

SKILL.md

# Eval Run

Use this skill to execute or supervise a planned eval run.

## Procedure

1. Confirm the plan names a host, suite or scenario, and evidence targets.
2. Create or use an isolated workspace. Do not run scenario state in the
   developer's active workspace unless the eval explicitly requires it.
3. Install the requested loop templates with `harness/ops`.
4. For Codex app-server evals, use the project runner when available:

   ```bash
   python3 scripts/codex_app_server_eval.py --suite
   ```

   Use a specific suite option when the scenario requires it.
5. Collect artifacts and logs before cleanup.
6. Record timeouts, setup failures, and HostAgent readiness failures as eval
   evidence, not as silent skips.

## Boundaries

- Do not change canonical scenarios, suites, or rubrics while running an eval.
- Do not delete artifacts needed for report review.
- Do not treat an exploratory run as a regression result.