agent-evaluation
This Claude Code skill provides frameworks and tools for testing and benchmarking language model agents across behavioral, capability, and reliability dimensions. Use it when you need to evaluate agent performance through statistical testing methodologies, set up benchmark comparisons using platforms like AgentBench or Tau-bench, detect risky agent behaviors with tool use, or establish production monitoring baselines for deployed agents.
git clone --depth 1 https://github.com/sickn33/antigravity-awesome-skills /tmp/agent-evaluation && cp -r /tmp/agent-evaluation/plugins/antigravity-awesome-skills-claude/skills/agent-evaluation ~/.claude/skills/agent-evaluationSKILL.md
# Agent Evaluation
Testing and benchmarking LLM agents including behavioral testing, capability assessment, reliability metrics, and production monitoring—where even top agents achieve less than 50% on real-world benchmarks
## Capabilities
- agent-testing
- benchmark-design
- capability-assessment
- reliability-metrics
- regression-testing
## Prerequisites
- Knowledge: Testing methodologies, Statistical analysis basics, LLM behavior patterns
- Skills_recommended: autonomous-agents, multi-agent-orchestration
- Required skills: testing-fundamentals, llm-fundamentals
## Scope
- Does_not_cover: Model training evaluation (loss, perplexity), Fairness and bias testing, User experience testing
- Boundaries: Focus is agent capability and reliability, Covers functional and behavioral testing
## Ecosystem
### Primary_tools
- AgentBench - Multi-environment benchmark for LLM agents (ICLR 2024)
- τ-bench (Tau-bench) - Sierra's real-world agent benchmark
- ToolEmu - Risky behavior detection for agent tool use
- Langsmith - LLM tracing and evaluation platform
### Alternatives
- Braintrust - When: Need production monitoring integration LLM evaluation and monitoring
- PromptFoo - When: Focus on prompt-level evaluation Prompt testing framework
### Deprecated
- Manual testing only
## Patterns
### Statistical Test Evaluation
Run tests multiple times and analyze result distributions
**When to use**: Evaluating stochastic agent behavior
interface TestResult {
testId: string;
runId: string;
passed: boolean;
score: number; // 0-1 for partial credit
latencyMs: number;
tokensUsed: number;
output: string;
expectedBehaviors: string[];
actualBehaviors: string[];
}
interface StatisticalAnalysis {
passRate: number;
confidence95: [number, number];
meanScore: number;
stdDevScore: number;
meanLatency: number;
p95Latency: number;
behaviorConsistency: number;
}
class StatisticalEvaluator {
private readonly minRuns = 10;
private readonly confidenceLevel = 0.95;
async evaluateAgent(
agent: Agent,
testSuite: TestCase[]
): Promise<EvaluationReport> {
const results: TestResult[] = [];
// Run each test multiple times
for (const test of testSuite) {
for (let run = 0; run < this.minRuns; run++) {
const result = await this.runTest(agent, test, run);
results.push(result);
}
}
// Analyze by test
const byTest = this.groupByTest(results);
const testAnalyses = new Map<string, StatisticalAnalysis>();
for (const [testId, testResults] of byTest) {
testAnalyses.set(testId, this.analyzeResults(testResults));
}
// Overall analysis
const overall = this.analyzeResults(results);
return {
overall,
byTest: testAnalyses,
concerns: this.identifyConcerns(testAnalyses),
recommendations: this.generateRecommendations(testAnalyses)
};
}
private analyzeResults(results: TestResult[]): StatisticalAnalysis {
const passes = results.filter(r => r.passed);
const passRate = passes.length / results.length;
// Calculate confidence interval for pass rate
const z = 1.96; // 95% confidence
const se = Math.sqrt((passRate * (1 - passRate)) / results.length);
const confidence95: [number, number] = [
Math.max(0, passRate - z * se),
Math.min(1, passRate + z * se)
];
const scores = results.map(r => r.score);
const latencies = results.map(r => r.latencyMs);
return {
passRate,
confidence95,
meanScore: this.mean(scores),
stdDevScore: this.stdDev(scores),
meanLatency: this.mean(latencies),
p95Latency: this.percentile(latencies, 95),
behaviorConsistency: this.calculateConsistency(results)
};
}
private calculateConsistency(results: TestResult[]): number {
// How consistent are the behaviors across runs?
if (results.length < 2) return 1;
const behaviorSets = results.map(r => new Set(r.actualBehaviors));
let consistencySum = 0;
let comparisons = 0;
for (let i = 0; i < behaviorSets.length; i++) {
for (let j = i + 1; j < behaviorSets.length; j++) {
const intersection = new Set(
[...behaviorSets[i]].filter(x => behaviorSets[j].has(x))
);
const union = new Set([...behaviorSets[i], ...behaviorSets[j]]);
consistencySum += intersection.size / union.size;
comparisons++;
}
}
return consistencySum / comparisons;
}
private identifyConcerns(analyses: Map<string, StatisticalAnalysis>): Concern[] {
const concerns: Concern[] = [];
for (const [testId, analysis] of analyses) {
if (analysis.passRate < 0.8) {
concerns.push({
testId,
type: 'low_pass_rate',
severity: analysis.passRate < 0.5 ? 'critical' : 'high',
message: `Pass rate ${(analysis.passRate * 100).toFixed(1)}% below threshold`
});
}
if (analysis.behaviorConsistency < 0.7) {
concerns.push({
testId,
type: 'inconsistent_behavior',
severity: 'high',
message: `Behavior consistency ${(analysis.behaviorConsistency * 100).toFixed(1)}% indicates unstable agent`
});
}
if (analysis.stdDevScore > 0.3) {
concerns.push({
testId,
type: 'high_variance',
severity: 'medium',
message: 'High score variance suggests unprArquitecto de Soluciones Principal y Consultor Tecnológico de Andru.ia. Diagnostica y traza la hoja de ruta óptima para proyectos de IA en español.
Security audit, hardening, threat modeling (STRIDE/PASTA), Red/Blue Team, OWASP checks, code review, incident response, and infrastructure security for any project.
Ingeniero de Sistemas de Andru.ia. Diseña, redacta y despliega nuevas habilidades (skills) dentro del repositorio siguiendo el Estándar de Diamante.
Estratega de Inteligencia de Dominio de Andru.ia. Analiza el nicho específico de un proyecto para inyectar conocimientos, regulaciones y estándares únicos del sector. Actívalo tras definir el nicho.
AI-powered presentation generation via the 2slides API — create slides from text, match a reference image style, summarize documents into decks, add AI voice narration, and export pages/audio. Use for any \"make slides\", \"create a deck\", or \"slides from this document\" request.
Expert in building 3D experiences for the web - Three.js, React
Structured guide for setting up A/B tests with mandatory gates for hypothesis, metrics, and execution readiness.
Use when a coding task should be driven end-to-end from issue intake through implementation, review, deployment, and acceptance verification with minimal human re-intervention.