Skill584 repo starsupdated 5d ago

bkit-evals

bkit-evals runs quality evaluation suites for Claude Code skills by wrapping the evals runner with input validation, timeout enforcement, and structured result persistence. Use the `run` command to execute evaluations for a specific skill and the `list` command to view all available skills with eval definitions organized by type.

View source Repository: bkit-claude-code

Install in Claude Code

Copy

git clone --depth 1 https://github.com/popup-studio-ai/bkit-claude-code /tmp/bkit-evals && cp -r /tmp/bkit-evals/skills/bkit-evals ~/.claude/skills/bkit-evals

Then start a new Claude Code session; the skill loads automatically.

Definition

SKILL.md

# bkit Evals — Skill Quality Evaluation Runner

> v2.1.11 Sprint β FR-β2. Wraps `evals/runner.js` with input validation,
> result persistence, and structured reporting. Replaces the bare `node
> evals/runner.js <skill>` invocation that previously required users to
> remember argv structure and ignored timeout / sandbox concerns.

## Arguments

| Argument | Description | Example |
|----------|-------------|---------|
| `run <skill>` | Execute the eval suite for one skill | `/bkit-evals run gap-detector` |
| `list` | List all skills that have an `eval.yaml` definition | `/bkit-evals list` |

If no argument is provided, render the same output as `list`.

## Behavior

### `run <skill>`

1. Validate `skill` against `/^[a-z][a-z0-9-]{0,63}$/`. Reject anything else
   (no shell metacharacters, no slashes, no spaces) — see Security below.
2. Spawn `node evals/runner.js --skill <skill>` via `child_process.spawnSync`
   (argv form, no shell). Default timeout 30 s, max 120 s. The `--skill` flag
   form is mandated by the runner CLI and locked by L3 contract test.
3. Capture stdout / stderr. Parse the trailing JSON block via
   balanced-brace fallback (string-aware).
4. Apply fail-closed defense: if `parsed === null` and stdout includes
   `Usage:`, return `reason: 'argv_format_mismatch'`; if `parsed === null`
   otherwise, return `reason: 'parsed_null'`. Exit code 0 alone NEVER
   implies success — the parsed JSON must be present.
5. Persist the structured result to
   `.bkit/runtime/evals-{skill}-{ISO timestamp}.json` with stdout/stderr
   tails (2000 chars each), `parsed` payload, and `reason` field.
5. Render a one-line summary in the chat:
   - exit code
   - parsed pass/fail counts (if available)
   - path of the persisted result file

### `list`

1. Read `evals/config.json` to enumerate skill classifications.
2. For each classification (`workflow`, `capability`, `hybrid`),
   list skills that have `evals/{classification}/{skill}/eval.yaml`.
3. Render a category-grouped table with skill name + a one-line note from
   the eval YAML (`description` field if present).

## Security

- Skill name regex prevents argument injection. Anything outside
  `[a-z][a-z0-9-]{0,63}` is rejected with `reason: invalid_skill_name`.
- argv-array spawn (no shell). No template-string concatenation into
  command lines.
- Result file path is composed from a hardcoded base + sanitized skill
  name + timestamp; no traversal possible.
- Subprocess timeout enforced (default 30 s, hard cap 120 s) so a buggy
  eval cannot block the session indefinitely.

## Module Dependencies

| Module | Function | Usage |
|--------|----------|-------|
| `lib/evals/runner-wrapper.js` | `invokeEvals(skill, opts)` | Validate + spawn + persist |
| `lib/evals/runner-wrapper.js` | `isValidSkillName(name)` | Regex pre-check shared with `list` |
| `evals/runner.js` | (subprocess) | Existing eval execution engine |

## Result Schema

`.bkit/runtime/evals-{skill}-{timestamp}.json`:

```json
{
  "skill": "gap-detector",
  "invokedAt": "<ISO 8601>",
  "exitCode": 0,
  "timedOut": false,
  "stdoutTail": "...",
  "stderrTail": "...",
  "parsed": { /* whatever runner.js prints as JSON, or null */ }
}
```

## Examples

```bash
# Single eval
/bkit-evals run gap-detector

# Discovery
/bkit-evals list
```

## Related

- `/control trust` — eval results contribute to trust score
- `/code-review` — uses eval data when assessing skills
- `/bkit explore` (FR-β1) — explore evals as a category

ARGUMENTS: