Skill119 estrellas del repoactualizado 2d ago

ai-eval-ci

**ai-eval-ci** is a Claude Code skill that implements automated testing for AI agents and language model outputs within continuous integration pipelines, using frameworks like Promptfoo to compare model performance against baselines and fail builds when quality degrades. Use it to establish quality gates before production deployment, catch prompt regressions when models or system prompts change, benchmark multiple LLM providers, validate RAG pipeline accuracy, or measure agent tool-calling performance.

Ver fuente Repositorio: skills

Instalar en Claude Code

Copiar

git clone --depth 1 https://github.com/TerminalSkills/skills /tmp/ai-eval-ci && cp -r /tmp/ai-eval-ci/skills/ai-eval-ci ~/.claude/skills/ai-eval-ci

Después abre una sesión nueva de Claude Code; el skill carga automáticamente.

Definición

SKILL.md

# AI Eval in CI

## Overview

Test AI agents and LLM outputs the same way you test code — automated evaluations that run in CI, compare against baselines, and fail the build when quality drops. No dashboards to check manually. Just `npx eval run --ci` and a red or green build.

## When to Use

- Adding quality gates before deploying AI features to production
- Catching prompt regressions when system prompts or models change
- Comparing model performance (GPT-4o vs Claude Sonnet vs local Llama)
- Validating RAG pipeline accuracy against a test dataset
- Benchmarking agent tool-calling accuracy and latency

## Instructions

### Strategy 1: Promptfoo (Config-Driven Evals)

Promptfoo is the most popular open-source eval framework. Define test cases in YAML, run against multiple providers, get a comparison matrix.

```yaml
# promptfooconfig.yaml — Eval configuration
# Tests a customer support agent across 3 models with quality assertions
description: "Customer support agent eval"

providers:
  - id: openai:gpt-4o
  - id: anthropic:messages:claude-sonnet-4-20250514
  - id: ollama:llama3.1:8b

prompts:
  - |
    You are a customer support agent for a SaaS product.
    Respond helpfully and accurately. If you don't know, say so.
    
    Customer message: {{message}}

tests:
  - vars:
      message: "How do I reset my password?"
    assert:
      - type: llm-rubric
        value: "Response explains the password reset process clearly"
      - type: not-contains
        value: "I don't know"
      - type: latency
        threshold: 3000  # Must respond within 3 seconds

  - vars:
      message: "Can I get a refund for my annual plan?"
    assert:
      - type: llm-rubric
        value: "Response acknowledges the refund request and explains the policy"
      - type: not-contains
        value: "I'm an AI"  # Don't break character

  - vars:
      message: "Your product deleted all my data!"
    assert:
      - type: llm-rubric
        value: "Response shows empathy, takes the issue seriously, and offers next steps"
      - type: sentiment
        threshold: 0.3  # Must not be dismissive

  - vars:
      message: "What's the weather in Tokyo?"
    assert:
      - type: llm-rubric
        value: "Response politely redirects to product-related topics"
      - type: not-contains
        value: "Tokyo"  # Should not answer off-topic questions
```

```bash
# Run evals locally
npx promptfoo@latest eval

# Run in CI with threshold — exits non-zero if any test fails
npx promptfoo@latest eval --ci --output results.json

# Compare two prompt versions
npx promptfoo@latest eval --prompts prompt-v1.txt prompt-v2.txt --share
```

### Strategy 2: Custom Eval Framework (TypeScript)

When you need full control — custom scoring logic, database-backed test sets, domain-specific metrics.

```typescript
// eval.ts — Custom AI eval framework with CI integration
/**
 * Runs evaluation suites against AI agents/LLMs.
 * Each eval defines inputs, expected behavior, and scoring criteria.
 * Exits with code 1 if any score drops below threshold.
 */
import OpenAI from "openai";

interface EvalCase {
  name: string;
  input: string;
  rubric: string;          // What "good" looks like
  threshold: number;       // Minimum score 0-1
  metadata?: Record<string, unknown>;
}

interface EvalResult {
  name: string;
  score: number;
  pass: boolean;
  output: string;
  reasoning: string;
  latencyMs: number;
}

const openai = new OpenAI();

/**
 * Score an AI output using LLM-as-judge.
 * Returns a score 0-1 with reasoning.
 */
async function judge(output: string, rubric: string): Promise<{ score: number; reasoning: string }> {
  const response = await openai.chat.completions.create({
    model: "gpt-4o-mini",  // Cheap model for judging
    messages: [
      {
        role: "system",
        content: `You are an eval judge. Score the AI output against the rubric.
Return JSON: {"score": 0.0-1.0, "reasoning": "brief explanation"}
Score 1.0 = perfect match. Score 0.0 = complete failure.`,
      },
      {
        role: "user",
        content: `Rubric: ${rubric}\n\nAI Output:\n${output}`,
      },
    ],
    response_format: { type: "json_object" },
    temperature: 0,  // Deterministic judging
  });

  return JSON.parse(response.choices[0].message.content!);
}

/**
 * Run a single eval case against your AI agent.
 */
async function runEval(
  agentFn: (input: string) => Promise<string>,
  evalCase: EvalCase
): Promise<EvalResult> {
  const start = Date.now();
  const output = await agentFn(evalCase.input);
  const latencyMs = Date.now() - start;

  const { score, reasoning } = await judge(output, evalCase.rubric);

  return {
    name: evalCase.name,
    score,
    pass: score >= evalCase.threshold,
    output: output.slice(0, 200),
    reasoning,
    latencyMs,
  };
}

/**
 * Run all evals and exit with appropriate code for CI.
 */
async function runSuite(
  agentFn: (input: string) => Promise<string>,
  cases: EvalCase[]
): Promise<void> {
  console.log(`Running ${cases.length} evals...\n`);

  const results: EvalResult[] = [];
  for (const evalCase of cases) {
    const result = await runEval(agentFn, evalCase);
    results.push(result);
    const icon = result.pass ? "✅" : "❌";
    console.log(`${icon} ${result.name}: ${result.score.toFixed(2)} (threshold: ${evalCase.threshold}) [${result.latencyMs}ms]`);
    if (!result.pass) {
      console.log(`   Reasoning: ${result.reasoning}`);
    }
  }

  // Summary
  const passed = results.filter((r) => r.pass).length;
  const failed = results.filter((r) => !r.pass).length;
  const avgScore = results.reduce((s, r) => s + r.score, 0) / results.length;

  console.log(`\n📊 Results: ${passed} passed, ${failed} failed (avg score: ${avgScore.toFixed(2)})`);

  // CI exit code
  if (failed > 0) {
    console.log("\n❌ Eval suite FAILED — quality below threshold");
    process.exit(1);
  } else {
    console.log("\n✅ Eval suite PASSED");
  }
}

export { runSuite, EvalCase };
```

### Strategy 3:

Del mismo repositorio

PULL_REQUEST_TEMPLATESkill

3dsmax-renderingSkill

3dsmax-scriptingSkill

3proxySkill

a2a-protocolSkill

ab-test-setupSkill

When the user wants to plan, design, or implement an A/B test or experiment. Also use when the user mentions "A/B test," "split test," "experiment," "test this change," "variant copy," "multivariate test," or "hypothesis." For tracking implementation, see analytics-tracking.

ablySkill

accessibility-auditorSkill