Skip to main content
ClaudeWave
Skill2.9k repo starsupdated 17d ago

evaluate-presets

The evaluate-presets skill provides shell scripts to systematically test Ralph's hat collection presets by running individual or all presets via direct CLI invocation, capturing metrics like iterations and events published, and generating evaluation reports. Use this skill when validating preset configurations after changes, auditing the preset library for quality issues, or testing newly added presets to ensure correct functionality before deployment.

Install in Claude Code
Copy
git clone --depth 1 https://github.com/mikeyobrien/ralph-orchestrator /tmp/evaluate-presets && cp -r /tmp/evaluate-presets/.claude/skills/evaluate-presets ~/.claude/skills/evaluate-presets
Then start a new Claude Code session; the skill loads automatically.

SKILL.md

# Evaluate Presets

## Overview

Systematically test all hat collection presets using shell scripts. Direct CLI invocation—no meta-orchestration complexity.

## When to Use

- Testing preset configurations after changes
- Auditing the preset library for quality
- Validating new presets work correctly
- After modifying hat routing logic

## Quick Start

**Evaluate a single preset:**
```bash
./tools/evaluate-preset.sh tdd-red-green claude
```

**Evaluate all presets:**
```bash
./tools/evaluate-all-presets.sh claude
```

**Arguments:**
- First arg: preset name (without `.yml` extension)
- Second arg: backend (`claude` or `kiro`, defaults to `claude`)

## Bash Tool Configuration

**IMPORTANT:** When invoking these scripts via the Bash tool, use these settings:

- **Single preset evaluation:** Use `timeout: 600000` (10 minutes max) and `run_in_background: true`
- **All presets evaluation:** Use `timeout: 600000` (10 minutes max) and `run_in_background: true`

Since preset evaluations can run for hours (especially the full suite), **always run in background mode** and use the `TaskOutput` tool to check progress periodically.

**Example invocation pattern:**
```
Bash tool with:
  command: "./tools/evaluate-preset.sh tdd-red-green claude"
  timeout: 600000
  run_in_background: true
```

After launching, use `TaskOutput` with `block: false` to check status without waiting for completion.

## What the Scripts Do

### `evaluate-preset.sh`

1. Loads test task from `tools/preset-test-tasks.yml` (if `yq` available)
2. Creates merged config with evaluation settings
3. Runs Ralph with `--record-session` for metrics capture
4. Captures output logs, exit codes, and timing
5. Extracts metrics: iterations, hats activated, events published

**Output structure:**
```
.eval/
├── logs/<preset>/<timestamp>/
│   ├── output.log          # Full stdout/stderr
│   ├── session.jsonl       # Recorded session
│   ├── metrics.json        # Extracted metrics
│   ├── environment.json    # Runtime environment
│   └── merged-config.yml   # Config used
└── logs/<preset>/latest -> <timestamp>
```

### `evaluate-all-presets.sh`

Runs all 12 presets sequentially and generates a summary:

```
.eval/results/<suite-id>/
├── SUMMARY.md              # Markdown report
├── <preset>.json           # Per-preset metrics
└── latest -> <suite-id>
```

## Presets Under Evaluation

| Preset | Test Task |
|--------|-----------|
| `tdd-red-green` | Add `is_palindrome()` function |
| `adversarial-review` | Review user input handler for security |
| `socratic-learning` | Understand `HatRegistry` |
| `spec-driven` | Specify and implement `StringUtils::truncate()` |
| `mob-programming` | Implement a `Stack` data structure |
| `scientific-method` | Debug failing mock test assertion |
| `code-archaeology` | Understand history of `config.rs` |
| `performance-optimization` | Profile hat matching |
| `api-design` | Design a `Cache` trait |
| `documentation-first` | Document `RateLimiter` |
| `incident-response` | Respond to "tests failing in CI" |
| `migration-safety` | Plan v1 to v2 config migration |

## Interpreting Results

**Exit codes from `evaluate-preset.sh`:**
- `0` — Success (LOOP_COMPLETE reached)
- `124` — Timeout (preset hung or took too long)
- Other — Failure (check `output.log`)

**Metrics in `metrics.json`:**
- `iterations` — How many event loop cycles
- `hats_activated` — Which hats were triggered
- `events_published` — Total events emitted
- `completed` — Whether completion promise was reached

## Hat Routing Performance

**Critical:** Validate that hats get fresh context per Tenet #1 ("Fresh Context Is Reliability").

### What Good Looks Like

Each hat should execute in its **own iteration**:
```
Iter 1: Ralph → publishes starting event → STOPS
Iter 2: Hat A → does work → publishes next event → STOPS
Iter 3: Hat B → does work → publishes next event → STOPS
Iter 4: Hat C → does work → LOOP_COMPLETE
```

### Red Flags (Same-Iteration Hat Switching)

**BAD:** Multiple hat personas in one iteration:
```
Iter 2: Ralph does Blue Team + Red Team + Fixer work
        ^^^ All in one bloated context!
```

### How to Check

**1. Count iterations vs events in `session.jsonl`:**
```bash
# Count iterations
grep -c "_meta.loop_start\|ITERATION" .eval/logs/<preset>/latest/output.log

# Count events published
grep -c "bus.publish" .eval/logs/<preset>/latest/session.jsonl
```

**Expected:** iterations ≈ events published (one event per iteration)
**Bad sign:** 2-3 iterations but 5+ events (all work in single iteration)

**2. Check for same-iteration hat switching in `output.log`:**
```bash
grep -E "ITERATION|Now I need to perform|Let me put on|I'll switch to" \
    .eval/logs/<preset>/latest/output.log
```

**Red flag:** Hat-switching phrases WITHOUT an ITERATION separator between them.

**3. Check event timestamps in `session.jsonl`:**
```bash
cat .eval/logs/<preset>/latest/session.jsonl | jq -r '.ts'
```

**Red flag:** Multiple events with identical timestamps (published in same iteration).

### Routing Performance Triage

| Pattern | Diagnosis | Action |
|---------|-----------|--------|
| iterations ≈ events | ✅ Good | Hat routing working |
| iterations << events | ⚠️ Same-iteration switching | Check prompt has STOP instruction |
| iterations >> events | ⚠️ Recovery loops | Agent not publishing required events |
| 0 events | ❌ Broken | Events not being read from JSONL |

### Root Cause Checklist

If hat routing is broken:

1. **Check workflow prompt** in `hatless_ralph.rs`:
   - Does it say "CRITICAL: STOP after publishing"?
   - Is the DELEGATE section clear about yielding control?

2. **Check hat instructions** propagation:
   - Does `HatInfo` include `instructions` field?
   - Are instructions rendered in the `## HATS` section?

3. **Check events context**:
   - Is `build_prompt(context)` using the context parameter?
   - Does prompt include `## PENDING EVENTS` section?

## Autonomous Fix Workflow

After evaluation, delegate fixes to subagents:
code-assistSkill

Guides implementation of code tasks using test-driven development in an Explore, Plan, Code, Commit workflow. Acts as a Technical Implementation Partner and TDD Coach — following existing patterns, avoiding over-engineering, and producing idiomatic, modern code.

ralph-e2e-verifierSubagent

Use this agent when you need to run the Ralph orchestrator end-to-end test suite, analyze diagnostic outputs, and generate comprehensive reports of findings. This includes validating backend connectivity, orchestration loop behavior, event parsing, hat collections, memory systems, and error handling. Invoke this agent after making changes to core orchestration logic, before releases, or when debugging integration issues.\\n\\nExamples:\\n\\n<example>\\nContext: User has made changes to the event parsing logic and wants to verify nothing is broken.\\nuser: \"I just modified the event parsing in ralph-core, can you verify everything still works?\"\\nassistant: \"I'll use the ralph-e2e-verifier agent to run the full E2E test suite and analyze the results.\"\\n<Task tool invocation to launch ralph-e2e-verifier>\\n</example>\\n\\n<example>\\nContext: User is preparing a release and needs validation.\\nuser: \"We're preparing to release v0.5.0, please run the E2E tests\"\\nassistant: \"I'll launch the ralph-e2e-verifier agent to run comprehensive E2E tests across all backends and generate a release readiness report.\"\\n<Task tool invocation to launch ralph-e2e-verifier>\\n</example>\\n\\n<example>\\nContext: User notices orchestration issues and wants diagnostics analyzed.\\nuser: \"Ralph seems to be selecting the wrong hats, can you investigate?\"\\nassistant: \"I'll use the ralph-e2e-verifier agent to run E2E tests with diagnostics enabled and analyze the hat selection decisions.\"\\n<Task tool invocation to launch ralph-e2e-verifier>\\n</example>

ralph-loop-runnerSubagent

Use this agent when you need to execute a Ralph orchestration loop end-to-end and verify its completion. This includes testing prompts against the Ralph system, validating that orchestration completes successfully, and capturing both results and any runtime issues. Examples:\\n\\n<example>\\nContext: User wants to test if a prompt works correctly with Ralph orchestration.\\nuser: \"Test if Ralph can handle the prompt 'create a hello world function'\"\\nassistant: \"I'll use the ralph-loop-runner agent to execute this prompt through Ralph and verify completion.\"\\n<Task tool call to ralph-loop-runner agent>\\n</example>\\n\\n<example>\\nContext: User is debugging why a Ralph run failed.\\nuser: \"Run this spec through Ralph and tell me what went wrong\"\\nassistant: \"Let me use the ralph-loop-runner agent to execute this and capture any runtime problems.\"\\n<Task tool call to ralph-loop-runner agent>\\n</example>\\n\\n<example>\\nContext: User wants to validate Ralph behavior after code changes.\\nuser: \"I just modified the event parser, can you run a test loop?\"\\nassistant: \"I'll use the ralph-loop-runner agent to run a complete orchestration loop and verify the changes work correctly.\"\\n<Task tool call to ralph-loop-runner agent>\\n</example>

code-task-generatorSkill

Generates structured .code-task.md files from descriptions or PDD implementation plans. Auto-detects input type, creates properly formatted tasks with Given-When-Then acceptance criteria.

find-code-tasksSkill

Lists all code tasks in the repository with their status, dates, and metadata. Useful for getting an overview of pending work or finding specific tasks.

pddSkill

Transforms a rough idea into a detailed design document with implementation plan. Follows Prompt-Driven Development — iterative requirements clarification, research, design, and planning.

playwriterSkill

Browser automation via Playwriter (remorses) using persistent Chrome sessions and the full Playwright Page API.

pr-demoSkill

Use when creating animated demos (GIFs) for pull requests or documentation. Covers terminal recording with asciinema and conversion to GIF/SVG for GitHub embedding.