Skill29.9k repo starsupdated today

agent-evaluation

This Claude Code skill provides frameworks for testing and evaluating LLM agents through statistical analysis of multiple test runs, behavioral contract testing, and adversarial scenarios. Use it when building reliable agent systems that need production-ready evaluation beyond single-run benchmarks, particularly to identify behavioral regressions, assess actual capabilities, and prevent metric gaming that can mask real-world failures.

View source Repository: claude-code-templates

Install in Claude Code

Copy

git clone --depth 1 https://github.com/davila7/claude-code-templates /tmp/agent-evaluation && cp -r /tmp/agent-evaluation/cli-tool/components/skills/ai-research/agent-evaluation ~/.claude/skills/agent-evaluation

Then start a new Claude Code session; the skill loads automatically.

Definition

SKILL.md

# Agent Evaluation

You're a quality engineer who has seen agents that aced benchmarks fail spectacularly in
production. You've learned that evaluating LLM agents is fundamentally different from
testing traditional software—the same input can produce different outputs, and "correct"
often has no single answer.

You've built evaluation frameworks that catch issues before production: behavioral regression
tests, capability assessments, and reliability metrics. You understand that the goal isn't
100% test pass rate—it

## Capabilities

- agent-testing
- benchmark-design
- capability-assessment
- reliability-metrics
- regression-testing

## Requirements

- testing-fundamentals
- llm-fundamentals

## Patterns

### Statistical Test Evaluation

Run tests multiple times and analyze result distributions

### Behavioral Contract Testing

Define and test agent behavioral invariants

### Adversarial Testing

Actively try to break agent behavior

## Anti-Patterns

### ❌ Single-Run Testing

### ❌ Only Happy Path Tests

### ❌ Output String Matching

## ⚠️ Sharp Edges

| Issue | Severity | Solution |
|-------|----------|----------|
| Agent scores well on benchmarks but fails in production | high | // Bridge benchmark and production evaluation |
| Same test passes sometimes, fails other times | high | // Handle flaky tests in LLM agent evaluation |
| Agent optimized for metric, not actual task | medium | // Multi-dimensional evaluation to prevent gaming |
| Test data accidentally used in training or prompts | critical | // Prevent data leakage in agent evaluation |

## Related Skills

Works well with: `multi-agent-orchestration`, `agent-communication`, `autonomous-agents`

More from this repository

agent-expertSubagent

Use this agent when creating specialized Claude Code agents for the claude-code-templates components system. Specializes in agent design, prompt engineering, domain expertise modeling, and agent best practices. Examples: <example>Context: User wants to create a new specialized agent. user: 'I need to create an agent that specializes in React performance optimization' assistant: 'I'll use the agent-expert agent to create a comprehensive React performance agent with proper domain expertise and practical examples' <commentary>Since the user needs to create a specialized agent, use the agent-expert agent for proper agent structure and implementation.</commentary></example> <example>Context: User needs help with agent prompt design. user: 'How do I create an agent that can handle both frontend and backend security?' assistant: 'Let me use the agent-expert agent to design a full-stack security agent with proper domain boundaries and expertise areas' <commentary>The user needs agent development help, so use the agent-expert agent.</commentary></example>

blog-writerSubagent

Use this agent to create blog articles for aitmpl.com from Claude Code Templates components. Reads the component, asks the user to confirm details, generates SVG cover, HTML article, and updates blog-articles.json. Examples: <example>Context: User wants a blog for a component. user: 'Create a blog article for cli-tool/components/hooks/security/secret-scanner.json' assistant: 'I'll use the blog-writer agent to create the full blog article with cover image and proper structure' <commentary>The user wants a blog article from a component, use blog-writer for the full pipeline.</commentary></example>

build-checkerSubagent

Runs pre-deploy build checks on the dashboard. Validates Astro build, checks for common esbuild/JSX issues, verifies API endpoints compile, and reports errors with fixes. Use before merging PRs that touch dashboard/.

catalog-generatorSubagent

Regenerates the component catalog (docs/components.json) by running the Python script. Use this agent when components have been added, modified, or deleted to update the catalog. Handles the full regeneration process including download statistics fetching from Supabase.

cli-ui-designerSubagent

CLI interface design specialist. Use PROACTIVELY to create terminal-inspired user interfaces with modern web technologies. Expert in CLI aesthetics, terminal themes, and command-line UX patterns.

command-expertSubagent

Use this agent when creating CLI commands for the claude-code-templates components system. Specializes in command design, argument parsing, task automation, and best practices for CLI development. Examples: <example>Context: User wants to create a new CLI command. user: 'I need to create a command that optimizes images in a project' assistant: 'I'll use the command-expert agent to create a comprehensive image optimization command with proper argument handling and batch processing' <commentary>Since the user needs to create a CLI command, use the command-expert agent for proper command structure and implementation.</commentary></example> <example>Context: User needs help with command argument parsing. user: 'How do I create a command that accepts multiple file patterns?' assistant: 'Let me use the command-expert agent to design a flexible command with proper glob pattern support and validation' <commentary>The user needs CLI command development help, so use the command-expert agent.</commentary></example>

component-improverSubagent

Applies researched improvements to Claude Code components, validates changes with the component-reviewer agent, and creates pull requests. The only agent that modifies files and creates PRs.

component-migratorSubagent

Migrates components (agents, commands, skills, hooks, settings, MCPs) from external GitHub repositories to claude-code-templates, validates them with component-reviewer, and regenerates the catalog