Skill145 repo starsupdated yesterday

AI Agent Evaluation

Comprehensive evaluation patterns for AI agents including multi-turn conversation testing, LLM-as-judge frameworks, benchmark suites, regression detection, and systematic eval pipelines for measuring agent quality and safety.

View source Repository: qaskills

Install in Claude Code

Copy

git clone --depth 1 https://github.com/PramodDutta/qaskills /tmp/ai-agent-evaluation && cp -r /tmp/ai-agent-evaluation/seed-skills/ai-agent-eval ~/.claude/skills/ai-agent-evaluation

Then start a new Claude Code session; the skill loads automatically.

Definition

SKILL.md

# AI Agent Evaluation Skill

You are an expert in evaluating AI agents and LLM-powered systems. When the user asks you to build evaluation frameworks, create benchmarks, implement LLM-as-judge patterns, test multi-turn conversations, or measure agent quality, follow these detailed instructions to produce robust, reproducible evaluation systems.

## Core Principles

1. **Deterministic evaluation pipelines** -- Every eval must be reproducible. Pin model versions, temperatures, seed values, and system prompts so results can be compared across runs.
2. **Multi-dimensional scoring** -- Never rely on a single metric. Evaluate correctness, helpfulness, safety, latency, cost, and task completion as separate dimensions.
3. **LLM-as-judge with calibration** -- When using LLMs to judge outputs, calibrate judges against human annotations and measure inter-judge agreement before trusting automated scores.
4. **Golden dataset management** -- Maintain versioned datasets of input/expected-output pairs. Tag each example with difficulty, category, and edge-case classification.
5. **Regression detection over absolute scores** -- Track score changes between agent versions rather than chasing absolute numbers. A 2% drop from a reliable baseline matters more than a 90% absolute score.
6. **Safety and alignment testing** -- Every eval suite must include adversarial inputs, prompt injection attempts, and boundary-testing cases that verify the agent refuses harmful requests.
7. **Statistical rigor** -- Report confidence intervals, run multiple trials, and use proper statistical tests when comparing agent versions. Never declare a winner based on a single run.

## Project Structure

```
evals/
  datasets/
    golden/
      coding-tasks.jsonl
      qa-pairs.jsonl
      multi-turn-conversations.jsonl
      adversarial-inputs.jsonl
      edge-cases.jsonl
    generated/
      synthetic-tasks.jsonl
  judges/
    correctness-judge.ts
    helpfulness-judge.ts
    safety-judge.ts
    code-quality-judge.ts
    composite-judge.ts
  runners/
    eval-runner.ts
    batch-runner.ts
    parallel-runner.ts
  metrics/
    scoring.ts
    statistical.ts
    aggregation.ts
  reports/
    html-reporter.ts
    json-reporter.ts
    regression-detector.ts
  config/
    eval-config.ts
    model-config.ts
  tests/
    judge-calibration.test.ts
    metric-accuracy.test.ts
    pipeline-integration.test.ts
  results/
    .gitkeep
```

## Eval Dataset Format

```typescript
// evals/datasets/types.ts
export interface EvalExample {
  id: string;
  input: string | ConversationTurn[];
  expectedOutput?: string;
  expectedBehavior?: string;
  tags: string[];
  difficulty: 'easy' | 'medium' | 'hard' | 'adversarial';
  category: string;
  metadata?: Record<string, unknown>;
}

export interface ConversationTurn {
  role: 'user' | 'assistant' | 'system';
  content: string;
}

export interface EvalResult {
  exampleId: string;
  agentOutput: string;
  scores: Record<string, number>;
  judgeReasonings: Record<string, string>;
  latencyMs: number;
  tokenUsage: { input: number; output: number };
  timestamp: string;
  agentVersion: string;
  error?: string;
}

export interface EvalSuiteResult {
  suiteId: string;
  agentVersion: string;
  timestamp: string;
  results: EvalResult[];
  aggregateScores: Record<string, AggregateScore>;
  totalExamples: number;
  passedExamples: number;
  failedExamples: number;
  errorExamples: number;
}

export interface AggregateScore {
  mean: number;
  median: number;
  stdDev: number;
  min: number;
  max: number;
  p5: number;
  p95: number;
  confidenceInterval: { lower: number; upper: number };
  sampleSize: number;
}
```

## LLM-as-Judge Implementation

```typescript
// evals/judges/correctness-judge.ts
import Anthropic from '@anthropic-ai/sdk';

export interface JudgeResult {
  score: number; // 0-10 scale
  reasoning: string;
  confidence: number; // 0-1
  flags: string[];
}

export interface JudgeConfig {
  model: string;
  temperature: number;
  maxTokens: number;
  systemPrompt: string;
  scoringRubric: string;
}

const DEFAULT_CORRECTNESS_CONFIG: JudgeConfig = {
  model: 'claude-sonnet-4-20250514',
  temperature: 0,
  maxTokens: 1024,
  systemPrompt: `You are an expert evaluator assessing the correctness of AI agent responses. 
You must be objective, precise, and consistent in your scoring.
Always provide a numerical score and detailed reasoning.`,
  scoringRubric: `Score the response on a 0-10 scale:
- 10: Perfectly correct, complete, and well-explained
- 8-9: Correct with minor omissions or imprecisions
- 6-7: Mostly correct but missing important details
- 4-5: Partially correct with significant errors
- 2-3: Mostly incorrect with some relevant elements
- 0-1: Completely incorrect or harmful`,
};

export class CorrectnessJudge {
  private client: Anthropic;
  private config: JudgeConfig;

  constructor(config: Partial<JudgeConfig> = {}) {
    this.client = new Anthropic();
    this.config = { ...DEFAULT_CORRECTNESS_CONFIG, ...config };
  }

  async evaluate(
    input: string,
    agentOutput: string,
    expectedOutput?: string
  ): Promise<JudgeResult> {
    const prompt = this.buildPrompt(input, agentOutput, expectedOutput);

    const response = await this.client.messages.create({
      model: this.config.model,
      max_tokens: this.config.maxTokens,
      temperature: this.config.temperature,
      system: this.config.systemPrompt,
      messages: [{ role: 'user', content: prompt }],
    });

    return this.parseResponse(response);
  }

  private buildPrompt(
    input: string,
    agentOutput: string,
    expectedOutput?: string
  ): string {
    let prompt = `## Task Input\n${input}\n\n## Agent Response\n${agentOutput}\n\n`;

    if (expectedOutput) {
      prompt += `## Expected Output\n${expectedOutput}\n\n`;
    }

    prompt += `## Scoring Rubric\n${this.config.scoringRubric}\n\n`;
    prompt += `## Your Evaluation\nProvide your evaluation in the following JSON format:\n`;
    prompt += `{"sc