Skip to main content
ClaudeWave
Skill145 repo starsupdated yesterday

AI Agent Evaluation

Comprehensive evaluation patterns for AI agents including multi-turn conversation testing, LLM-as-judge frameworks, benchmark suites, regression detection, and systematic eval pipelines for measuring agent quality and safety.

Install in Claude Code
Copy
git clone --depth 1 https://github.com/PramodDutta/qaskills /tmp/ai-agent-evaluation && cp -r /tmp/ai-agent-evaluation/seed-skills/ai-agent-eval ~/.claude/skills/ai-agent-evaluation
Then start a new Claude Code session; the skill loads automatically.

SKILL.md

# AI Agent Evaluation Skill

You are an expert in evaluating AI agents and LLM-powered systems. When the user asks you to build evaluation frameworks, create benchmarks, implement LLM-as-judge patterns, test multi-turn conversations, or measure agent quality, follow these detailed instructions to produce robust, reproducible evaluation systems.

## Core Principles

1. **Deterministic evaluation pipelines** -- Every eval must be reproducible. Pin model versions, temperatures, seed values, and system prompts so results can be compared across runs.
2. **Multi-dimensional scoring** -- Never rely on a single metric. Evaluate correctness, helpfulness, safety, latency, cost, and task completion as separate dimensions.
3. **LLM-as-judge with calibration** -- When using LLMs to judge outputs, calibrate judges against human annotations and measure inter-judge agreement before trusting automated scores.
4. **Golden dataset management** -- Maintain versioned datasets of input/expected-output pairs. Tag each example with difficulty, category, and edge-case classification.
5. **Regression detection over absolute scores** -- Track score changes between agent versions rather than chasing absolute numbers. A 2% drop from a reliable baseline matters more than a 90% absolute score.
6. **Safety and alignment testing** -- Every eval suite must include adversarial inputs, prompt injection attempts, and boundary-testing cases that verify the agent refuses harmful requests.
7. **Statistical rigor** -- Report confidence intervals, run multiple trials, and use proper statistical tests when comparing agent versions. Never declare a winner based on a single run.

## Project Structure

```
evals/
  datasets/
    golden/
      coding-tasks.jsonl
      qa-pairs.jsonl
      multi-turn-conversations.jsonl
      adversarial-inputs.jsonl
      edge-cases.jsonl
    generated/
      synthetic-tasks.jsonl
  judges/
    correctness-judge.ts
    helpfulness-judge.ts
    safety-judge.ts
    code-quality-judge.ts
    composite-judge.ts
  runners/
    eval-runner.ts
    batch-runner.ts
    parallel-runner.ts
  metrics/
    scoring.ts
    statistical.ts
    aggregation.ts
  reports/
    html-reporter.ts
    json-reporter.ts
    regression-detector.ts
  config/
    eval-config.ts
    model-config.ts
  tests/
    judge-calibration.test.ts
    metric-accuracy.test.ts
    pipeline-integration.test.ts
  results/
    .gitkeep
```

## Eval Dataset Format

```typescript
// evals/datasets/types.ts
export interface EvalExample {
  id: string;
  input: string | ConversationTurn[];
  expectedOutput?: string;
  expectedBehavior?: string;
  tags: string[];
  difficulty: 'easy' | 'medium' | 'hard' | 'adversarial';
  category: string;
  metadata?: Record<string, unknown>;
}

export interface ConversationTurn {
  role: 'user' | 'assistant' | 'system';
  content: string;
}

export interface EvalResult {
  exampleId: string;
  agentOutput: string;
  scores: Record<string, number>;
  judgeReasonings: Record<string, string>;
  latencyMs: number;
  tokenUsage: { input: number; output: number };
  timestamp: string;
  agentVersion: string;
  error?: string;
}

export interface EvalSuiteResult {
  suiteId: string;
  agentVersion: string;
  timestamp: string;
  results: EvalResult[];
  aggregateScores: Record<string, AggregateScore>;
  totalExamples: number;
  passedExamples: number;
  failedExamples: number;
  errorExamples: number;
}

export interface AggregateScore {
  mean: number;
  median: number;
  stdDev: number;
  min: number;
  max: number;
  p5: number;
  p95: number;
  confidenceInterval: { lower: number; upper: number };
  sampleSize: number;
}
```

## LLM-as-Judge Implementation

```typescript
// evals/judges/correctness-judge.ts
import Anthropic from '@anthropic-ai/sdk';

export interface JudgeResult {
  score: number; // 0-10 scale
  reasoning: string;
  confidence: number; // 0-1
  flags: string[];
}

export interface JudgeConfig {
  model: string;
  temperature: number;
  maxTokens: number;
  systemPrompt: string;
  scoringRubric: string;
}

const DEFAULT_CORRECTNESS_CONFIG: JudgeConfig = {
  model: 'claude-sonnet-4-20250514',
  temperature: 0,
  maxTokens: 1024,
  systemPrompt: `You are an expert evaluator assessing the correctness of AI agent responses. 
You must be objective, precise, and consistent in your scoring.
Always provide a numerical score and detailed reasoning.`,
  scoringRubric: `Score the response on a 0-10 scale:
- 10: Perfectly correct, complete, and well-explained
- 8-9: Correct with minor omissions or imprecisions
- 6-7: Mostly correct but missing important details
- 4-5: Partially correct with significant errors
- 2-3: Mostly incorrect with some relevant elements
- 0-1: Completely incorrect or harmful`,
};

export class CorrectnessJudge {
  private client: Anthropic;
  private config: JudgeConfig;

  constructor(config: Partial<JudgeConfig> = {}) {
    this.client = new Anthropic();
    this.config = { ...DEFAULT_CORRECTNESS_CONFIG, ...config };
  }

  async evaluate(
    input: string,
    agentOutput: string,
    expectedOutput?: string
  ): Promise<JudgeResult> {
    const prompt = this.buildPrompt(input, agentOutput, expectedOutput);

    const response = await this.client.messages.create({
      model: this.config.model,
      max_tokens: this.config.maxTokens,
      temperature: this.config.temperature,
      system: this.config.systemPrompt,
      messages: [{ role: 'user', content: prompt }],
    });

    return this.parseResponse(response);
  }

  private buildPrompt(
    input: string,
    agentOutput: string,
    expectedOutput?: string
  ): string {
    let prompt = `## Task Input\n${input}\n\n## Agent Response\n${agentOutput}\n\n`;

    if (expectedOutput) {
      prompt += `## Expected Output\n${expectedOutput}\n\n`;
    }

    prompt += `## Scoring Rubric\n${this.config.scoringRubric}\n\n`;
    prompt += `## Your Evaluation\nProvide your evaluation in the following JSON format:\n`;
    prompt += `{"sc
axe-core Accessibility AutomationSkill

Automated accessibility testing with axe-core integrated into CI pipelines, including custom rule configuration, issue prioritization, and remediation guidance.

A/B Test ValidationSkill

Validating A/B test implementations including traffic splitting accuracy, statistical significance calculation, metric tracking, and experiment cleanup.

Accessibility A11y EnhancedSkill

Comprehensive WCAG compliance and accessibility testing covering ARIA, keyboard navigation, screen readers, color contrast, and automated a11y validation.

Accessibility AuditorSkill

Comprehensive WCAG 2.1 AA compliance testing combining automated axe-core scans with manual keyboard navigation, screen reader compatibility, and focus management verification

AFL++ Fuzzing TestingSkill

American Fuzzy Lop Plus Plus mutation-based fuzz testing for finding crashes, hangs, and security vulnerabilities in binary programs.

Agent Browser AutomationSkill

Fast Rust-based headless browser automation CLI with Node.js fallback for AI agents, featuring navigation, clicking, typing, snapshots, and structured commands optimized for agent workflows.

Agentic Testing PatternsSkill

AI-first testing methodology where autonomous agents plan, generate, execute, and maintain test suites with minimal human intervention, covering agent orchestration, feedback loops, and intelligent test prioritization.

AI/ML Model TestingSkill

Testing machine learning models including accuracy validation, bias detection, drift monitoring, A/B testing, and model regression testing.