Instalar en Claude Code
Copiargit clone --depth 1 https://github.com/TerminalSkills/skills /tmp/ai-eval-ci && cp -r /tmp/ai-eval-ci/skills/ai-eval-ci ~/.claude/skills/ai-eval-ciDespués abre una sesión nueva de Claude Code; el skill carga automáticamente.
Definición
SKILL.md
# AI Eval in CI
## Overview
Test AI agents and LLM outputs the same way you test code — automated evaluations that run in CI, compare against baselines, and fail the build when quality drops. No dashboards to check manually. Just `npx eval run --ci` and a red or green build.
## When to Use
- Adding quality gates before deploying AI features to production
- Catching prompt regressions when system prompts or models change
- Comparing model performance (GPT-4o vs Claude Sonnet vs local Llama)
- Validating RAG pipeline accuracy against a test dataset
- Benchmarking agent tool-calling accuracy and latency
## Instructions
### Strategy 1: Promptfoo (Config-Driven Evals)
Promptfoo is the most popular open-source eval framework. Define test cases in YAML, run against multiple providers, get a comparison matrix.
```yaml
# promptfooconfig.yaml — Eval configuration
# Tests a customer support agent across 3 models with quality assertions
description: "Customer support agent eval"
providers:
- id: openai:gpt-4o
- id: anthropic:messages:claude-sonnet-4-20250514
- id: ollama:llama3.1:8b
prompts:
- |
You are a customer support agent for a SaaS product.
Respond helpfully and accurately. If you don't know, say so.
Customer message: {{message}}
tests:
- vars:
message: "How do I reset my password?"
assert:
- type: llm-rubric
value: "Response explains the password reset process clearly"
- type: not-contains
value: "I don't know"
- type: latency
threshold: 3000 # Must respond within 3 seconds
- vars:
message: "Can I get a refund for my annual plan?"
assert:
- type: llm-rubric
value: "Response acknowledges the refund request and explains the policy"
- type: not-contains
value: "I'm an AI" # Don't break character
- vars:
message: "Your product deleted all my data!"
assert:
- type: llm-rubric
value: "Response shows empathy, takes the issue seriously, and offers next steps"
- type: sentiment
threshold: 0.3 # Must not be dismissive
- vars:
message: "What's the weather in Tokyo?"
assert:
- type: llm-rubric
value: "Response politely redirects to product-related topics"
- type: not-contains
value: "Tokyo" # Should not answer off-topic questions
```
```bash
# Run evals locally
npx promptfoo@latest eval
# Run in CI with threshold — exits non-zero if any test fails
npx promptfoo@latest eval --ci --output results.json
# Compare two prompt versions
npx promptfoo@latest eval --prompts prompt-v1.txt prompt-v2.txt --share
```
### Strategy 2: Custom Eval Framework (TypeScript)
When you need full control — custom scoring logic, database-backed test sets, domain-specific metrics.
```typescript
// eval.ts — Custom AI eval framework with CI integration
/**
* Runs evaluation suites against AI agents/LLMs.
* Each eval defines inputs, expected behavior, and scoring criteria.
* Exits with code 1 if any score drops below threshold.
*/
import OpenAI from "openai";
interface EvalCase {
name: string;
input: string;
rubric: string; // What "good" looks like
threshold: number; // Minimum score 0-1
metadata?: Record<string, unknown>;
}
interface EvalResult {
name: string;
score: number;
pass: boolean;
output: string;
reasoning: string;
latencyMs: number;
}
const openai = new OpenAI();
/**
* Score an AI output using LLM-as-judge.
* Returns a score 0-1 with reasoning.
*/
async function judge(output: string, rubric: string): Promise<{ score: number; reasoning: string }> {
const response = await openai.chat.completions.create({
model: "gpt-4o-mini", // Cheap model for judging
messages: [
{
role: "system",
content: `You are an eval judge. Score the AI output against the rubric.
Return JSON: {"score": 0.0-1.0, "reasoning": "brief explanation"}
Score 1.0 = perfect match. Score 0.0 = complete failure.`,
},
{
role: "user",
content: `Rubric: ${rubric}\n\nAI Output:\n${output}`,
},
],
response_format: { type: "json_object" },
temperature: 0, // Deterministic judging
});
return JSON.parse(response.choices[0].message.content!);
}
/**
* Run a single eval case against your AI agent.
*/
async function runEval(
agentFn: (input: string) => Promise<string>,
evalCase: EvalCase
): Promise<EvalResult> {
const start = Date.now();
const output = await agentFn(evalCase.input);
const latencyMs = Date.now() - start;
const { score, reasoning } = await judge(output, evalCase.rubric);
return {
name: evalCase.name,
score,
pass: score >= evalCase.threshold,
output: output.slice(0, 200),
reasoning,
latencyMs,
};
}
/**
* Run all evals and exit with appropriate code for CI.
*/
async function runSuite(
agentFn: (input: string) => Promise<string>,
cases: EvalCase[]
): Promise<void> {
console.log(`Running ${cases.length} evals...\n`);
const results: EvalResult[] = [];
for (const evalCase of cases) {
const result = await runEval(agentFn, evalCase);
results.push(result);
const icon = result.pass ? "✅" : "❌";
console.log(`${icon} ${result.name}: ${result.score.toFixed(2)} (threshold: ${evalCase.threshold}) [${result.latencyMs}ms]`);
if (!result.pass) {
console.log(` Reasoning: ${result.reasoning}`);
}
}
// Summary
const passed = results.filter((r) => r.pass).length;
const failed = results.filter((r) => !r.pass).length;
const avgScore = results.reduce((s, r) => s + r.score, 0) / results.length;
console.log(`\n📊 Results: ${passed} passed, ${failed} failed (avg score: ${avgScore.toFixed(2)})`);
// CI exit code
if (failed > 0) {
console.log("\n❌ Eval suite FAILED — quality below threshold");
process.exit(1);
} else {
console.log("\n✅ Eval suite PASSED");
}
}
export { runSuite, EvalCase };
```
### Strategy 3:Del mismo repositorio
PULL_REQUEST_TEMPLATESkill
3dsmax-renderingSkill
>-
3dsmax-scriptingSkill
>-
3proxySkill
>-
a2a-protocolSkill
>-
ab-test-setupSkill
When the user wants to plan, design, or implement an A/B test or experiment. Also use when the user mentions "A/B test," "split test," "experiment," "test this change," "variant copy," "multivariate test," or "hypothesis." For tracking implementation, see analytics-tracking.
ablySkill
>-
accessibility-auditorSkill
>-