Skill74 repo starsupdated 6d ago

llm-judge

The llm-judge skill compares multiple code repositories against a specification document to rank implementations across dimensions like functionality, security, and performance. Use it when selecting the best solution among competing codebases, evaluating alternative approaches to the same problem, or choosing between pull requests that solve identical requirements.

View source Repository: beagle

Install in Claude Code

Copy

git clone --depth 1 https://github.com/existential-birds/beagle /tmp/llm-judge && cp -r /tmp/llm-judge/plugins/beagle-analysis/skills/llm-judge ~/.claude/skills/llm-judge

Then start a new Claude Code session; the skill loads automatically.

Definition

SKILL.md

# LLM Judge

Compare code implementations across multiple repositories using structured evaluation.

## Usage

```text
llm-judge <spec> <repo1> <repo2> [repo3...] [--labels=...] [--weights=...] [--branch=...]
```

## Arguments

| Argument | Required | Description |
|----------|----------|-------------|
| `spec` | Yes | Path to spec/requirements document |
| `repos` | Yes | 2+ paths to repositories to compare |
| `--labels` | No | Comma-separated labels (default: directory names) |
| `--weights` | No | Override weights, e.g. `functionality:40,security:30` |
| `--branch` | No | Branch to compare against main (default: `main`) |

## Workflow

1. Parse `$ARGUMENTS` into `spec_path`, `repo_paths`, `labels`, `weights`, and `branch`.
2. Validate the spec file, each repo path, and the minimum repo count.
3. Read the spec document into memory.
4. Load this skill and the supporting reference files.
5. Gather facts per repository (one Phase 1 unit per repo) — facts only, no scoring.
6. Validate the repo-agent JSON results before proceeding.
7. Score each dimension (one Phase 2 unit per dimension).
8. Aggregate scores, compute weighted totals, rank repos, and write the report.
9. Display the markdown summary and verify the JSON report.

## Hard gates

Sequenced workflow: **do not start the next phase until the current gate passes.** Each pass condition must be checkable (file on disk, non-empty content, or `json.load` succeeds)—not “I reviewed internally.”

| Gate | Pass condition | Unblocks |
|------|----------------|----------|
| **A — Inputs** | `spec_path` is a readable file and non-empty; `len(repo_paths) ≥ 2`; each path contains `.git`. | Phase 1 repo agents |
| **B — Phase 1 facts** | For **each** repo agent output: stdin/stdout parses as JSON; required keys/shape match `references/fact-schema.md`. | Phase 2 judge agents |
| **C — Phase 2 scores** | **Five** judge outputs (one per dimension) each parse as JSON; each includes a score (and justification) for **every** repo label. | Aggregation |
| **D — Report file** | `.beagle/llm-judge-report.json` exists; `python3 -c "import json; json.load(open('.beagle/llm-judge-report.json'))"` exits 0. | Markdown summary to the user |
| **E — Consistency** | Summary table and verdict use the same labels, weights, and per-dimension scores as the JSON report. | Mark task complete |

Parallelism is allowed **within** a phase (all Phase 1 tasks together; all Phase 2 tasks together), but Phase 2 must not start until Gate B passes, and the user-visible summary must not precede Gate D.

## Command Workflow

### Step 1: Parse Arguments

Parse `$ARGUMENTS` to extract:
- `spec_path`: first positional argument
- `repo_paths`: remaining positional arguments (must be 2+)
- `labels`: from `--labels` or derived from directory names
- `weights`: from `--weights` or defaults
- `branch`: from `--branch` or `main`

**Default Weights:**

```json
{
  "functionality": 30,
  "security": 25,
  "tests": 20,
  "overengineering": 15,
  "dead_code": 10
}
```

### Step 2: Validate Inputs

```bash
[ -f "$SPEC_PATH" ] || { echo "Error: Spec file not found: $SPEC_PATH"; exit 1; }

for repo in "${REPO_PATHS[@]}"; do
  [ -d "$repo/.git" ] || { echo "Error: Not a git repository: $repo"; exit 1; }
done

[ ${#REPO_PATHS[@]} -ge 2 ] || { echo "Error: Need at least 2 repositories to compare"; exit 1; }
```

### Step 3: Read Spec Document

```bash
SPEC_CONTENT=$(cat "$SPEC_PATH") || { echo "Error: Failed to read spec file: $SPEC_PATH"; exit 1; }
[ -z "$SPEC_CONTENT" ] && { echo "Error: Spec file is empty: $SPEC_PATH"; exit 1; }
```

### Step 4: Load the Skill

Load this **llm-judge** skill and its reference files into context.

### Step 5: Phase 1 - Gather Facts Per Repo

**If the agent supports subagents**, dispatch one Phase 1 repo agent per repository in parallel; **otherwise** run the same fact-gathering steps sequentially, one repo at a time — the output is identical either way. Give each unit this brief:

```text
You are a Phase 1 Repo Agent for the LLM Judge evaluation.

**Your Repo:** $LABEL at $REPO_PATH

**Spec Document:**
$SPEC_CONTENT

**Instructions:**
1. Load the **llm-judge** skill's references/repo-agent.md for detailed instructions
2. Follow references/fact-schema.md for the output format
3. Load the **llm-artifacts-detection** skill ([../../../beagle-core/skills/llm-artifacts-detection/SKILL.md](../../../beagle-core/skills/llm-artifacts-detection/SKILL.md), if available) for dead-code/overengineering analysis

Explore the repository and gather facts. Return ONLY valid JSON following the fact schema.

Do NOT score or judge. Only gather facts.
```

Collect all repo outputs into `ALL_FACTS`.

### Step 6: Validate Phase 1 Results

```bash
echo "$FACTS" | python3 -c "import json,sys; json.load(sys.stdin)" 2>/dev/null || { echo "Error: Invalid JSON from $LABEL"; exit 1; }
```

### Step 7: Phase 2 - Score Per Dimension

**If the agent supports subagents**, dispatch one judge agent per dimension (five total) in parallel; **otherwise** score each dimension sequentially — identical output. Give each unit this brief:

```text
You are the $DIMENSION Judge for the LLM Judge evaluation.

**Spec Document:**
$SPEC_CONTENT

**Facts from all repos:**
$ALL_FACTS_JSON

**Instructions:**
1. Load the **llm-judge** skill's references/judge-agents.md for detailed instructions
2. Follow references/scoring-rubrics.md for the $DIMENSION rubric

Score each repo on $DIMENSION. Return ONLY valid JSON with scores and justifications.
```

### Step 8: Aggregate Scores

```python
for repo_label in labels:
    scores[repo_label] = {}
    for dimension in dimensions:
        scores[repo_label][dimension] = judge_outputs[dimension]['scores'][repo_label]

    weighted_total = sum(
        scores[repo_label][dim]['score'] * weights[dim] / 100
        for dim in dimensions
    )
    scores[repo_label]['weighted_total'] = round(weighted_total, 2)

ranking = sorted(labels, key=lambda l: scores[l]['weighted_total'],