Skip to main content
ClaudeWave
Skill6.1k repo starsupdated today

cli-eval

This Claude Code skill provides a command-line interface for creating, managing, and executing evaluation suites that benchmark LLM model performance. Use it when you need to define automated test suites, run evaluations across different models or configurations, monitor live benchmark progress, generate performance scorecards, compare model outputs, and integrate evaluation workflows into CI/CD pipelines.

Install in Claude Code
Copy
git clone --depth 1 https://github.com/diegosouzapw/OmniRoute /tmp/cli-eval && cp -r /tmp/cli-eval/skills/cli-eval ~/.claude/skills/cli-eval
Then start a new Claude Code session; the skill loads automatically.

SKILL.md

<!-- generated by src/lib/agentSkills/generator.ts; manual edits will be overwritten -->

## Overview

Create and run evaluation suites, watch live benchmark progress, view scorecards, compare model performance, and integrate eval runs with CI workflows from the CLI.

## Quick install

```bash
npm install -g omniroute   # or: npx omniroute
omniroute --version
```

## Subcommands

### `eval`

**Example:**

```bash
omniroute eval
```

### `eval suites`

**Example:**

```bash
omniroute eval suites
```

### `eval list`

**Example:**

```bash
omniroute eval list
```

### `eval get <suiteId>`

**Example:**

```bash
omniroute eval get <suiteId>
```

### `eval create`

**Flags:**

- `--file <path>`

**Example:**

```bash
omniroute eval create
```

### `eval run <suiteId>`

**Flags:**

- `-m, --model <id>`
- `--combo <name>`
- `--concurrency <n>`
- `--tag <tag>`
- `--watch`

**Example:**

```bash
omniroute eval run <suiteId>
```

### `eval list`

**Flags:**

- `--suite <id>`
- `--status <s>`
- `--since <ts>`
- `--limit <n>`

**Example:**

```bash
omniroute eval list
```

### `eval get <runId>`

**Example:**

```bash
omniroute eval get <runId>
```

### `eval results <runId>`

**Flags:**

- `--failed`

**Example:**

```bash
omniroute eval results <runId>
```

### `eval cancel <runId>`

**Flags:**

- `--yes`

**Example:**

```bash
omniroute eval cancel <runId>
```

### `eval scorecard <runId>`

**Example:**

```bash
omniroute eval scorecard <runId>
```

### `simulate [prompt]`

**Flags:**

- `--file <path>`
- `-m, --model <id>`
- `--combo <name>`
- `--reasoning-effort <level>`
- `--thinking-budget <n>`
- `--explain`

**Example:**

```bash
omniroute simulate [prompt]
```

<!-- skill:custom-start -->
<!-- Migrated from skills/omniroute-cli-eval/SKILL.md (preserved curated content) -->

# OmniRoute — CLI Evals

Requires the `omniroute` CLI. See [CLI entry-point skill](https://raw.githubusercontent.com/diegosouzapw/OmniRoute/main/skills/omniroute-cli/SKILL.md) for install + global flags.

## What are evals?

Evals are automated test suites that score LLM outputs against expected answers or rubrics. OmniRoute stores suites and run results in its local database.

## Eval suites

```bash
omniroute eval suites list                       # List all eval suites
omniroute eval suites list --json                # JSON output

omniroute eval suites get <suiteId>              # Full suite definition
```

### Create a suite

```bash
omniroute eval suites create \
  --name "code-quality" \
  --rubric "exact-match" \
  --samples-file ./samples.jsonl                 # JSONL: {input, expected_output}
```

Rubric options: `exact-match`, `contains`, `llm-judge`, `regex`.

`--samples-file` format (one JSON object per line):

```jsonl
{"input": "What is 2+2?", "expected_output": "4"}
{"input": "Translate 'hello' to Spanish", "expected_output": "hola"}
```

## Run an eval

```bash
omniroute eval suites run <suiteId> \
  --model claude-sonnet-4-6                      # Run suite against a specific model

omniroute eval suites run <suiteId> \
  --model gpt-4o \
  --watch                                        # Live TUI progress (EvalWatch)
```

The run is asynchronous. Use `--watch` for a live terminal dashboard or poll manually:

```bash
RUN_ID=$(omniroute eval suites run <suiteId> --model claude-sonnet-4-6 --output json | jq -r '.id')
omniroute eval get $RUN_ID
```

## Manage runs

```bash
omniroute eval list                              # List all eval runs
omniroute eval list --json

omniroute eval get <runId>                       # Run details (status, model, score)
omniroute eval results <runId>                   # Per-sample results
omniroute eval scorecard <runId>                 # Full scorecard with pass/fail per sample
omniroute eval cancel <runId>                    # Cancel a running eval
```

## Scorecard output

```bash
omniroute eval scorecard <runId> --output json
```

Response fields per sample:

```json
{
  "id": "sample-1",
  "score": 0.95,
  "passed": true,
  "input": "What is 2+2?",
  "output": "4",
  "expected": "4"
}
```

## Comparing models

Run the same suite against multiple models and compare:

```bash
for MODEL in claude-sonnet-4-6 gpt-4o gemini-2.0-flash; do
  omniroute eval suites run $SUITE_ID --model $MODEL --output json | jq '{model: .model, score: .score}'
done
```

## CI integration

```bash
# Run and fail CI if score drops below threshold
SCORE=$(omniroute eval suites run $SUITE_ID --model claude-sonnet-4-6 --output json | jq -r '.score')
python3 -c "import sys; score=float('$SCORE'); sys.exit(0 if score >= 0.90 else 1)"
```

## Errors

- `suites create` fails with `invalid rubric` → use one of: `exact-match`, `contains`, `llm-judge`, `regex`
- `suites run` returns `model not found` → verify model ID with `omniroute models --search <name>`
- `eval get` shows `status: failed` → check `omniroute logs --search eval` for error details
- `scorecard` returns empty results → the run may still be `running`; poll `omniroute eval get <runId>` until `status` is `completed`
<!-- skill:custom-end -->