Skip to main content
ClaudeWave
Skill341 estrellas del repoactualizado 2d ago

eval-run

The eval-run skill executes or supervises Mnemon harness evaluation runs within isolated HostAgent workspaces. Use it to run planned evaluation scenarios and suites, install required loop templates, collect artifacts and logs, and record failures as evidence rather than silent skips, while maintaining boundaries around canonical scenario modifications and artifact preservation.

Instalar en Claude Code
Copiar
git clone --depth 1 https://github.com/mnemon-dev/mnemon /tmp/eval-run && cp -r /tmp/eval-run/harness/loops/eval/skills/eval-run ~/.claude/skills/eval-run
Después abre una sesión nueva de Claude Code; el skill carga automáticamente.

SKILL.md

# Eval Run

Use this skill to execute or supervise a planned eval run.

## Procedure

1. Confirm the plan names a host, suite or scenario, and evidence targets.
2. Create or use an isolated workspace. Do not run scenario state in the
   developer's active workspace unless the eval explicitly requires it.
3. Install the requested loop templates with `harness/ops`.
4. For Codex app-server evals, use the project runner when available:

   ```bash
   python3 scripts/codex_app_server_eval.py --suite
   ```

   Use a specific suite option when the scenario requires it.
5. Collect artifacts and logs before cleanup.
6. Record timeouts, setup failures, and HostAgent readiness failures as eval
   evidence, not as silent skips.

## Boundaries

- Do not change canonical scenarios, suites, or rubrics while running an eval.
- Do not delete artifacts needed for report review.
- Do not treat an exploratory run as a regression result.