evaluation
This skill provides evaluation frameworks for agent systems that account for non-deterministic behavior and dynamic decision-making. Use it when building systematic testing, regression detection, quality gates, multi-dimensional rubrics, and production monitoring for agent pipelines, distinct from judge design work handled by advanced-evaluation or control surface engineering managed by harness-engineering.
git clone --depth 1 https://github.com/guanyang/open-agent-hub /tmp/evaluation && cp -r /tmp/evaluation/skills/evaluation ~/.claude/skills/evaluationSKILL.md
# Evaluation Methods for Agent Systems Evaluate agent systems differently from traditional software because agents make dynamic decisions, are non-deterministic between runs, and often lack single correct answers. Build evaluation frameworks that account for these characteristics, provide actionable feedback, catch regressions, and validate that context engineering choices achieve intended effects. ## When to Activate Activate this skill when: - Testing agent performance systematically - Validating context engineering choices - Measuring improvements over time - Catching regressions before deployment - Building quality gates for agent pipelines - Comparing different agent configurations - Evaluating production systems continuously Do not activate this skill for adjacent work owned by other skills: - Designing the LLM judge itself, pairwise comparison, judge calibration, or bias mitigation: `advanced-evaluation`. - Designing autonomous control surfaces, novelty gates, rollback, or PR approval boundaries: `harness-engineering`. - Debugging a specific context failure mode before measuring it: `context-degradation`. ## Core Concepts Focus evaluation on outcomes rather than execution paths, because agents may find alternative valid routes to goals. Judge whether the agent achieves the right outcome via a reasonable process, not whether it followed a specific sequence of steps. Use multi-dimensional rubrics instead of single scores because one number hides critical failures in specific dimensions. Capture factual accuracy, completeness, citation accuracy, source quality, and tool efficiency as separate dimensions, then weight them for the use case. Use model-judged evaluation only after deterministic checks and rubrics are stable. When the work centers on judge prompts, pairwise comparison, calibration, or bias mitigation, switch to Advanced Evaluation. Run deterministic validation before LLM judgment whenever the artifact has machine-checkable structure. Schema validity, duplicate keys, rubric math, manifest sync, retrieval status, and required evidence paths should fail fast before an evaluator spends tokens or returns a subjective score. **Performance Drivers** Apply browsing-agent research when designing evaluation budgets: token usage, tool calls, and model choice can dominate measured performance variance (claim-evaluation-browsecomp-variance). | Factor | Variance Explained | Implication | |--------|-------------------|-------------| | Token usage | Primary driver | More exploration can improve performance until cost or context quality collapses | | Number of tool calls | Secondary driver | More tool use helps only when calls retrieve useful evidence | | Model choice | Secondary but multiplicative | Better models often use tokens and tools more efficiently | Act on these implications when designing evaluations: - **Set realistic token budgets**: Evaluate agents with production-realistic token limits, not unlimited resources. - **Compare model upgrades against token increases**: Better models may use tokens more efficiently than weaker models with larger budgets. - **Validate multi-agent architectures**: Extra agents add tokens and tool calls; evaluate them against single-agent baselines. ## Detailed Topics ### Evaluation Challenges **Handle Non-Determinism and Multiple Valid Paths** Design evaluations that tolerate path variation because agents may take completely different valid paths to reach goals. One agent might search three sources while another searches ten; both may produce correct answers. Avoid checking for specific steps. Instead, define outcome criteria (correctness, completeness, quality) and score against those, treating the execution path as informational rather than evaluative. **Test Context-Dependent Failures** Evaluate across a range of complexity levels and interaction lengths because agent failures often depend on context in subtle ways. An agent might succeed on simple queries but fail on complex ones, work well with one tool set but fail with another, or degrade after extended interaction as context accumulates. Include simple, medium, complex, and very complex test cases to surface these patterns. **Score Composite Quality Dimensions Separately** Break agent quality into separate dimensions (factual accuracy, completeness, coherence, tool efficiency, process quality) and score each independently because an agent might score high on accuracy but low on efficiency, or vice versa. Then compute weighted aggregates tuned to use-case priorities. This approach reveals which dimensions need improvement rather than averaging away the signal. ### Evaluation Rubric Design **Build Multi-Dimensional Rubrics** Define rubrics covering key dimensions with descriptive levels from excellent to failed. Include these core dimensions and adapt weights per use case: - Factual accuracy: Claims match ground truth (weight heavily for knowledge tasks) - Completeness: Output covers requested aspects (weight heavily for research tasks) - Citation accuracy: Citations match claimed sources (weight for trust-sensitive contexts) - Source quality: Uses appropriate primary sources (weight for authoritative outputs) - Tool efficiency: Uses right tools a reasonable number of times (weight for cost-sensitive systems) **Convert Rubrics to Numeric Scores** Map dimension assessments to numeric scores (0.0 to 1.0), apply per-dimension weights, and calculate weighted overall scores. Set passing thresholds based on use-case requirements, typically 0.7 for general use and 0.9 for high-stakes applications. Store individual dimension scores alongside the aggregate because the breakdown drives targeted improvement. ### Evaluation Methodologies **Use LLM-as-Judge for Scale** Build LLM-based evaluation prompts that include: clear task description, the agent output under test, ground truth when available, an evaluation scale with explicit level descriptions, and a request for struct
Principal Software Architect specializing in system design, database modeling, API engineering, and system resilience.
Principal Diagnostics Engineer specializing in root cause analysis, error troubleshooting, and hotfixes.
Principal Clean Code Specialist specializing in code simplification, performance tuning, and refactoring loops.
Senior Technical Lead and Security Auditor specializing in code quality, correctness, and security audits.
Senior QA Automation Engineer specializing in unit, integration, and E2E test suite creation.
Run when user calls /commit or asks to generate a commit message. Analyzes staged changes and writes a structured commit message.
Run when user calls /review. Analyzes local changes and runs a comprehensive code review using the agent-reviewer prompt.
Run when user calls /test-tdd. Scans modified files, locates their corresponding unit/integration test suites, and runs them.