cli-eval
This Claude Code skill provides a command-line interface for creating, managing, and executing evaluation suites that benchmark LLM model performance. Use it when you need to define automated test suites, run evaluations across different models or configurations, monitor live benchmark progress, generate performance scorecards, compare model outputs, and integrate evaluation workflows into CI/CD pipelines.
git clone --depth 1 https://github.com/diegosouzapw/OmniRoute /tmp/cli-eval && cp -r /tmp/cli-eval/skills/cli-eval ~/.claude/skills/cli-evalSKILL.md
<!-- generated by src/lib/agentSkills/generator.ts; manual edits will be overwritten -->
## Overview
Create and run evaluation suites, watch live benchmark progress, view scorecards, compare model performance, and integrate eval runs with CI workflows from the CLI.
## Quick install
```bash
npm install -g omniroute # or: npx omniroute
omniroute --version
```
## Subcommands
### `eval`
**Example:**
```bash
omniroute eval
```
### `eval suites`
**Example:**
```bash
omniroute eval suites
```
### `eval list`
**Example:**
```bash
omniroute eval list
```
### `eval get <suiteId>`
**Example:**
```bash
omniroute eval get <suiteId>
```
### `eval create`
**Flags:**
- `--file <path>`
**Example:**
```bash
omniroute eval create
```
### `eval run <suiteId>`
**Flags:**
- `-m, --model <id>`
- `--combo <name>`
- `--concurrency <n>`
- `--tag <tag>`
- `--watch`
**Example:**
```bash
omniroute eval run <suiteId>
```
### `eval list`
**Flags:**
- `--suite <id>`
- `--status <s>`
- `--since <ts>`
- `--limit <n>`
**Example:**
```bash
omniroute eval list
```
### `eval get <runId>`
**Example:**
```bash
omniroute eval get <runId>
```
### `eval results <runId>`
**Flags:**
- `--failed`
**Example:**
```bash
omniroute eval results <runId>
```
### `eval cancel <runId>`
**Flags:**
- `--yes`
**Example:**
```bash
omniroute eval cancel <runId>
```
### `eval scorecard <runId>`
**Example:**
```bash
omniroute eval scorecard <runId>
```
### `simulate [prompt]`
**Flags:**
- `--file <path>`
- `-m, --model <id>`
- `--combo <name>`
- `--reasoning-effort <level>`
- `--thinking-budget <n>`
- `--explain`
**Example:**
```bash
omniroute simulate [prompt]
```
<!-- skill:custom-start -->
<!-- Migrated from skills/omniroute-cli-eval/SKILL.md (preserved curated content) -->
# OmniRoute — CLI Evals
Requires the `omniroute` CLI. See [CLI entry-point skill](https://raw.githubusercontent.com/diegosouzapw/OmniRoute/main/skills/omniroute-cli/SKILL.md) for install + global flags.
## What are evals?
Evals are automated test suites that score LLM outputs against expected answers or rubrics. OmniRoute stores suites and run results in its local database.
## Eval suites
```bash
omniroute eval suites list # List all eval suites
omniroute eval suites list --json # JSON output
omniroute eval suites get <suiteId> # Full suite definition
```
### Create a suite
```bash
omniroute eval suites create \
--name "code-quality" \
--rubric "exact-match" \
--samples-file ./samples.jsonl # JSONL: {input, expected_output}
```
Rubric options: `exact-match`, `contains`, `llm-judge`, `regex`.
`--samples-file` format (one JSON object per line):
```jsonl
{"input": "What is 2+2?", "expected_output": "4"}
{"input": "Translate 'hello' to Spanish", "expected_output": "hola"}
```
## Run an eval
```bash
omniroute eval suites run <suiteId> \
--model claude-sonnet-4-6 # Run suite against a specific model
omniroute eval suites run <suiteId> \
--model gpt-4o \
--watch # Live TUI progress (EvalWatch)
```
The run is asynchronous. Use `--watch` for a live terminal dashboard or poll manually:
```bash
RUN_ID=$(omniroute eval suites run <suiteId> --model claude-sonnet-4-6 --output json | jq -r '.id')
omniroute eval get $RUN_ID
```
## Manage runs
```bash
omniroute eval list # List all eval runs
omniroute eval list --json
omniroute eval get <runId> # Run details (status, model, score)
omniroute eval results <runId> # Per-sample results
omniroute eval scorecard <runId> # Full scorecard with pass/fail per sample
omniroute eval cancel <runId> # Cancel a running eval
```
## Scorecard output
```bash
omniroute eval scorecard <runId> --output json
```
Response fields per sample:
```json
{
"id": "sample-1",
"score": 0.95,
"passed": true,
"input": "What is 2+2?",
"output": "4",
"expected": "4"
}
```
## Comparing models
Run the same suite against multiple models and compare:
```bash
for MODEL in claude-sonnet-4-6 gpt-4o gemini-2.0-flash; do
omniroute eval suites run $SUITE_ID --model $MODEL --output json | jq '{model: .model, score: .score}'
done
```
## CI integration
```bash
# Run and fail CI if score drops below threshold
SCORE=$(omniroute eval suites run $SUITE_ID --model claude-sonnet-4-6 --output json | jq -r '.score')
python3 -c "import sys; score=float('$SCORE'); sys.exit(0 if score >= 0.90 else 1)"
```
## Errors
- `suites create` fails with `invalid rubric` → use one of: `exact-match`, `contains`, `llm-judge`, `regex`
- `suites run` returns `model not found` → verify model ID with `omniroute models --search <name>`
- `eval get` shows `status: failed` → check `omniroute logs --search eval` for error details
- `scorecard` returns empty results → the run may still be `running`; poll `omniroute eval get <runId>` until `status` is `completed`
<!-- skill:custom-end -->Interact with the OmniRoute A2A server from the CLI. Send tasks, inspect skill execution history, and test the JSON-RPC 2.0 agent-to-agent protocol interactively.
Backup and restore OmniRoute data from the CLI. Trigger incremental snapshots, sync to cloud storage, manage backup schedules, and restore from archive files.
Submit and monitor batch inference jobs from the CLI. Upload and manage files for batch processing, retrieve results, and integrate batch pipelines with CI/CD workflows.
Send chat completions, stream responses, and start an interactive REPL session from the CLI. Supports all OmniRoute providers, combo routing, and system prompt configuration.
Configure and test prompt compression from the CLI. Manage RTK filters, Caveman rules, stacked compression modes, and preview compression output with real prompts.
Manage context engineering configurations, RTK filter sets, and conversation sessions from the CLI. Apply context-relay settings and inspect active context pipelines.
View cost breakdowns, token usage, and call logs from the CLI. Filter by provider, model, or date range. Export usage reports and inspect per-connection spending.
Check server health, component status, and live metrics from the CLI. Run `health`, `health components`, and `health watch` for a real-time dashboard of circuit breakers and provider status.