eval
The eval command manages eval-driven development workflows by enabling users to define, execute, and track capability and regression tests for features. Use this command to create structured evaluation criteria in `.claude/evals/` files, run checks against those criteria, generate comprehensive reports with pass rates and metrics, and maintain an overview of all active feature evaluations across a project.
mkdir -p ~/.claude/commands && curl -fsSL https://raw.githubusercontent.com/sangrokjung/claude-forge/HEAD/commands/eval.md -o ~/.claude/commands/eval.mdeval.md
# Eval Command Manage eval-driven development workflow. ## Usage `/eval [define|check|report|list] [feature-name]` ## Define Evals `/eval define feature-name` Create a new eval definition: 1. Create `.claude/evals/feature-name.md` with template: ```markdown ## EVAL: feature-name Created: $(date) ### Capability Evals - [ ] [Description of capability 1] - [ ] [Description of capability 2] ### Regression Evals - [ ] [Existing behavior 1 still works] - [ ] [Existing behavior 2 still works] ### Success Criteria - pass@3 > 90% for capability evals - pass^3 = 100% for regression evals ``` 2. Prompt user to fill in specific criteria ## Check Evals `/eval check feature-name` Run evals for a feature: 1. Read eval definition from `.claude/evals/feature-name.md` 2. For each capability eval: - Attempt to verify criterion - Record PASS/FAIL - Log attempt in `.claude/evals/feature-name.log` 3. For each regression eval: - Run relevant tests - Compare against baseline - Record PASS/FAIL 4. Report current status: ``` EVAL CHECK: feature-name ======================== Capability: X/Y passing Regression: X/Y passing Status: IN PROGRESS / READY ``` ## Report Evals `/eval report feature-name` Generate comprehensive eval report: ``` EVAL REPORT: feature-name ========================= Generated: $(date) CAPABILITY EVALS ---------------- [eval-1]: PASS (pass@1) [eval-2]: PASS (pass@2) - required retry [eval-3]: FAIL - see notes REGRESSION EVALS ---------------- [test-1]: PASS [test-2]: PASS [test-3]: PASS METRICS ------- Capability pass@1: 67% Capability pass@3: 100% Regression pass^3: 100% NOTES ----- [Any issues, edge cases, or observations] RECOMMENDATION -------------- [SHIP / NEEDS WORK / BLOCKED] ``` ## List Evals `/eval list` Show all eval definitions: ``` EVAL DEFINITIONS ================ feature-auth [3/5 passing] IN PROGRESS feature-search [5/5 passing] READY feature-export [0/4 passing] NOT STARTED ``` ## Arguments $ARGUMENTS: - `define <name>` - Create new eval definition - `check <name>` - Run and check evals - `report <name>` - Generate full report - `list` - Show all evals - `clean` - Remove old eval logs (keeps last 10 runs)
Software architecture specialist for system design, scalability, and technical decision-making. Use PROACTIVELY when planning new features, refactoring large systems, or making architectural decisions.
Build and TypeScript error resolution specialist. Use PROACTIVELY when build fails or type errors occur. Fixes build/type errors only with minimal diffs, no architectural edits. Focuses on getting the build green quickly.
Expert code review specialist. Proactively reviews code for quality, security, and maintainability. Use immediately after writing or modifying code. MUST BE USED for all code changes.
PostgreSQL database specialist for query optimization, schema design, security, and performance. Use PROACTIVELY when writing SQL, creating migrations, designing schemas, or troubleshooting database performance. Incorporates Supabase best practices.
Documentation and codemap specialist. Use PROACTIVELY for updating codemaps and documentation. Runs /update-codemaps and /update-docs, generates docs/CODEMAPS/*, updates READMEs and guides.
End-to-end testing specialist using Vercel Agent Browser (preferred) with Playwright fallback. Use PROACTIVELY for generating, maintaining, and running E2E tests. Manages test journeys, quarantines flaky tests, uploads artifacts (screenshots, videos, traces), and ensures critical user flows work.
Expert planning specialist for complex features and refactoring. Use PROACTIVELY when users request feature implementation, architectural changes, or complex refactoring. Automatically activated for planning tasks.
Dead code cleanup and consolidation specialist. Use PROACTIVELY for removing unused code, duplicates, and refactoring. Runs analysis tools (knip, depcheck, ts-prune) to identify dead code and safely removes it.