AI Agent Evaluation
Comprehensive evaluation patterns for AI agents including multi-turn conversation testing, LLM-as-judge frameworks, benchmark suites, regression detection, and systematic eval pipelines for measuring agent quality and safety.
git clone --depth 1 https://github.com/PramodDutta/qaskills /tmp/ai-agent-evaluation && cp -r /tmp/ai-agent-evaluation/seed-skills/ai-agent-eval ~/.claude/skills/ai-agent-evaluationSKILL.md
# AI Agent Evaluation Skill
You are an expert in evaluating AI agents and LLM-powered systems. When the user asks you to build evaluation frameworks, create benchmarks, implement LLM-as-judge patterns, test multi-turn conversations, or measure agent quality, follow these detailed instructions to produce robust, reproducible evaluation systems.
## Core Principles
1. **Deterministic evaluation pipelines** -- Every eval must be reproducible. Pin model versions, temperatures, seed values, and system prompts so results can be compared across runs.
2. **Multi-dimensional scoring** -- Never rely on a single metric. Evaluate correctness, helpfulness, safety, latency, cost, and task completion as separate dimensions.
3. **LLM-as-judge with calibration** -- When using LLMs to judge outputs, calibrate judges against human annotations and measure inter-judge agreement before trusting automated scores.
4. **Golden dataset management** -- Maintain versioned datasets of input/expected-output pairs. Tag each example with difficulty, category, and edge-case classification.
5. **Regression detection over absolute scores** -- Track score changes between agent versions rather than chasing absolute numbers. A 2% drop from a reliable baseline matters more than a 90% absolute score.
6. **Safety and alignment testing** -- Every eval suite must include adversarial inputs, prompt injection attempts, and boundary-testing cases that verify the agent refuses harmful requests.
7. **Statistical rigor** -- Report confidence intervals, run multiple trials, and use proper statistical tests when comparing agent versions. Never declare a winner based on a single run.
## Project Structure
```
evals/
datasets/
golden/
coding-tasks.jsonl
qa-pairs.jsonl
multi-turn-conversations.jsonl
adversarial-inputs.jsonl
edge-cases.jsonl
generated/
synthetic-tasks.jsonl
judges/
correctness-judge.ts
helpfulness-judge.ts
safety-judge.ts
code-quality-judge.ts
composite-judge.ts
runners/
eval-runner.ts
batch-runner.ts
parallel-runner.ts
metrics/
scoring.ts
statistical.ts
aggregation.ts
reports/
html-reporter.ts
json-reporter.ts
regression-detector.ts
config/
eval-config.ts
model-config.ts
tests/
judge-calibration.test.ts
metric-accuracy.test.ts
pipeline-integration.test.ts
results/
.gitkeep
```
## Eval Dataset Format
```typescript
// evals/datasets/types.ts
export interface EvalExample {
id: string;
input: string | ConversationTurn[];
expectedOutput?: string;
expectedBehavior?: string;
tags: string[];
difficulty: 'easy' | 'medium' | 'hard' | 'adversarial';
category: string;
metadata?: Record<string, unknown>;
}
export interface ConversationTurn {
role: 'user' | 'assistant' | 'system';
content: string;
}
export interface EvalResult {
exampleId: string;
agentOutput: string;
scores: Record<string, number>;
judgeReasonings: Record<string, string>;
latencyMs: number;
tokenUsage: { input: number; output: number };
timestamp: string;
agentVersion: string;
error?: string;
}
export interface EvalSuiteResult {
suiteId: string;
agentVersion: string;
timestamp: string;
results: EvalResult[];
aggregateScores: Record<string, AggregateScore>;
totalExamples: number;
passedExamples: number;
failedExamples: number;
errorExamples: number;
}
export interface AggregateScore {
mean: number;
median: number;
stdDev: number;
min: number;
max: number;
p5: number;
p95: number;
confidenceInterval: { lower: number; upper: number };
sampleSize: number;
}
```
## LLM-as-Judge Implementation
```typescript
// evals/judges/correctness-judge.ts
import Anthropic from '@anthropic-ai/sdk';
export interface JudgeResult {
score: number; // 0-10 scale
reasoning: string;
confidence: number; // 0-1
flags: string[];
}
export interface JudgeConfig {
model: string;
temperature: number;
maxTokens: number;
systemPrompt: string;
scoringRubric: string;
}
const DEFAULT_CORRECTNESS_CONFIG: JudgeConfig = {
model: 'claude-sonnet-4-20250514',
temperature: 0,
maxTokens: 1024,
systemPrompt: `You are an expert evaluator assessing the correctness of AI agent responses.
You must be objective, precise, and consistent in your scoring.
Always provide a numerical score and detailed reasoning.`,
scoringRubric: `Score the response on a 0-10 scale:
- 10: Perfectly correct, complete, and well-explained
- 8-9: Correct with minor omissions or imprecisions
- 6-7: Mostly correct but missing important details
- 4-5: Partially correct with significant errors
- 2-3: Mostly incorrect with some relevant elements
- 0-1: Completely incorrect or harmful`,
};
export class CorrectnessJudge {
private client: Anthropic;
private config: JudgeConfig;
constructor(config: Partial<JudgeConfig> = {}) {
this.client = new Anthropic();
this.config = { ...DEFAULT_CORRECTNESS_CONFIG, ...config };
}
async evaluate(
input: string,
agentOutput: string,
expectedOutput?: string
): Promise<JudgeResult> {
const prompt = this.buildPrompt(input, agentOutput, expectedOutput);
const response = await this.client.messages.create({
model: this.config.model,
max_tokens: this.config.maxTokens,
temperature: this.config.temperature,
system: this.config.systemPrompt,
messages: [{ role: 'user', content: prompt }],
});
return this.parseResponse(response);
}
private buildPrompt(
input: string,
agentOutput: string,
expectedOutput?: string
): string {
let prompt = `## Task Input\n${input}\n\n## Agent Response\n${agentOutput}\n\n`;
if (expectedOutput) {
prompt += `## Expected Output\n${expectedOutput}\n\n`;
}
prompt += `## Scoring Rubric\n${this.config.scoringRubric}\n\n`;
prompt += `## Your Evaluation\nProvide your evaluation in the following JSON format:\n`;
prompt += `{"scAutomated accessibility testing with axe-core integrated into CI pipelines, including custom rule configuration, issue prioritization, and remediation guidance.
Validating A/B test implementations including traffic splitting accuracy, statistical significance calculation, metric tracking, and experiment cleanup.
Comprehensive WCAG compliance and accessibility testing covering ARIA, keyboard navigation, screen readers, color contrast, and automated a11y validation.
Comprehensive WCAG 2.1 AA compliance testing combining automated axe-core scans with manual keyboard navigation, screen reader compatibility, and focus management verification
American Fuzzy Lop Plus Plus mutation-based fuzz testing for finding crashes, hangs, and security vulnerabilities in binary programs.
Fast Rust-based headless browser automation CLI with Node.js fallback for AI agents, featuring navigation, clicking, typing, snapshots, and structured commands optimized for agent workflows.
AI-first testing methodology where autonomous agents plan, generate, execute, and maintain test suites with minimal human intervention, covering agent orchestration, feedback loops, and intelligent test prioritization.
Testing machine learning models including accuracy validation, bias detection, drift monitoring, A/B testing, and model regression testing.