advanced-evaluation
Advanced Evaluation is a Claude Code skill for implementing production-grade LLM-as-judge evaluation systems. Use it when building automated quality assessment for LLM outputs, conducting pairwise comparisons between model responses, establishing consistent rubrics across evaluation teams, debugging inconsistent evaluation results, designing A/B tests for model changes, or analyzing correlations between automated and human judgments. It provides techniques for direct scoring and pairwise comparison approaches while addressing systematic biases like position bias and length bias.
git clone --depth 1 https://github.com/guanyang/open-agent-hub /tmp/advanced-evaluation && cp -r /tmp/advanced-evaluation/skills/advanced-evaluation ~/.claude/skills/advanced-evaluationSKILL.md
# Advanced Evaluation
This skill covers production-grade techniques for evaluating LLM outputs using LLMs as judges. It synthesizes research from academic papers, industry practices, and practical implementation experience into actionable patterns for building reliable evaluation systems.
**Key insight**: LLM-as-a-Judge is not a single technique but a family of approaches, each suited to different evaluation contexts. Choosing the right approach and mitigating known biases is the core competency this skill develops.
## When to Activate
Activate this skill when:
- Building LLM-as-judge systems for LLM outputs
- Comparing multiple model responses to select the best one
- Establishing consistent quality standards across evaluation teams
- Debugging evaluation systems that show inconsistent results
- Designing A/B tests for prompt or model changes
- Creating rubrics specifically for LLM or human/LLM hybrid judges
- Analyzing correlation between automated and human judgments
Do not activate this skill for adjacent work owned by other skills:
- General deterministic checks, regression suites, production quality gates, or outcome metrics: `evaluation`.
- Autonomous loop governance, locked rubrics, rollback, or PR approval boundaries: `harness-engineering`.
- Tool API contracts for evaluation tools: `tool-design`.
## Core Concepts
### The Evaluation Taxonomy
Select between two primary approaches based on whether ground truth exists:
**Direct Scoring** — Use when objective criteria exist (factual accuracy, instruction following, toxicity). A single LLM rates one response on a defined scale. Achieves moderate-to-high reliability for well-defined criteria. Watch for score calibration drift and inconsistent scale interpretation.
**Pairwise Comparison** — Use for subjective preferences (tone, style, persuasiveness). An LLM compares two responses and selects the better one. Pairwise methods often correlate better with human preference than open-ended direct scoring for subjective tasks (claim-advanced-evaluation-position-swap). Watch for position bias and length bias.
### The Bias Landscape
Mitigate these systematic biases in every evaluation system:
**Position Bias**: First-position responses get preferential treatment. Mitigate by evaluating twice with swapped positions, then apply majority vote or consistency check.
**Length Bias**: Longer responses score higher regardless of quality. Mitigate by explicitly prompting to ignore length and applying length-normalized scoring.
**Self-Enhancement Bias**: Models rate their own outputs higher. Mitigate by using different models for generation and evaluation.
**Verbosity Bias**: Excessive detail scores higher even when unnecessary. Mitigate with criteria-specific rubrics that penalize irrelevant detail.
**Authority Bias**: Confident tone scores higher regardless of accuracy. Mitigate by requiring evidence citation and adding a fact-checking layer.
### Metric Selection Framework
Match metrics to the evaluation task structure:
| Task Type | Primary Metrics | Secondary Metrics |
|-----------|-----------------|-------------------|
| Binary classification (pass/fail) | Recall, Precision, F1 | Cohen's kappa |
| Ordinal scale (1-5 rating) | Spearman's rho, Kendall's tau | Cohen's kappa (weighted) |
| Pairwise preference | Agreement rate, Position consistency | Confidence calibration |
| Multi-label | Macro-F1, Micro-F1 | Per-label precision/recall |
Prioritize systematic disagreement patterns over absolute agreement rates because a judge that consistently disagrees with humans on specific criteria is more problematic than one with random noise.
## Evaluation Approaches
### Direct Scoring Implementation
Build direct scoring with three components: clear criteria, a calibrated scale, and structured output format.
**Criteria Definition Pattern**:
```
Criterion: [Name]
Description: [What this criterion measures]
Weight: [Relative importance, 0-1]
```
**Scale Calibration** — Choose scale granularity based on rubric detail:
- 1-3: Binary with neutral option, lowest cognitive load
- 1-5: Standard Likert, best balance of granularity and reliability
- 1-10: Use only with detailed per-level rubrics because calibration is harder
**Prompt Structure for Direct Scoring**:
```
You are an expert evaluator assessing response quality.
## Task
Evaluate the following response against each criterion.
## Original Prompt
{prompt}
## Response to Evaluate
{response}
## Criteria
{for each criterion: name, description, weight}
## Instructions
For each criterion:
1. Find specific evidence in the response
2. Score according to the rubric (1-{max} scale)
3. Justify your score with evidence
4. Suggest one specific improvement
## Output Format
Respond with structured JSON containing scores, justifications, and summary.
```
Require evidence before the score in scoring prompts so the judge must anchor its decision in observable output features before emitting a number.
### Pairwise Comparison Implementation
Apply position bias mitigation in every pairwise evaluation:
1. Run deterministic pre-checks first: both candidates must satisfy the same schema, source-evidence requirements, and scope constraints.
2. First judge pass: Response A in first position, Response B in second.
3. Second judge pass: Response B in first position, Response A in second.
4. Consistency check: If passes disagree, return TIE with reduced confidence.
5. Final verdict: Consistent winner with averaged confidence and explicit tie-breaker rationale.
**Prompt Structure for Pairwise Comparison**:
```
You are an expert evaluator comparing two AI responses.
## Critical Instructions
- Do NOT prefer responses because they are longer
- Do NOT prefer responses based on position (first vs second)
- Focus ONLY on quality according to the specified criteria
- Ties are acceptable when responses are genuinely equivalent
## Original Prompt
{prompt}
## Response A
{response_a}
## Response B
{response_Principal Software Architect specializing in system design, database modeling, API engineering, and system resilience.
Principal Diagnostics Engineer specializing in root cause analysis, error troubleshooting, and hotfixes.
Principal Clean Code Specialist specializing in code simplification, performance tuning, and refactoring loops.
Senior Technical Lead and Security Auditor specializing in code quality, correctness, and security audits.
Senior QA Automation Engineer specializing in unit, integration, and E2E test suite creation.
Run when user calls /commit or asks to generate a commit message. Analyzes staged changes and writes a structured commit message.
Run when user calls /review. Analyzes local changes and runs a comprehensive code review using the agent-reviewer prompt.
Run when user calls /test-tdd. Scans modified files, locates their corresponding unit/integration test suites, and runs them.