Instalar en Claude Code
Copiargit clone --depth 1 https://github.com/Evol-ai/SkillCompass /tmp/commands && cp -r /tmp/commands/commands/eval- ~/.claude/skills/commandsDespués abre una sesión nueva de Claude Code; el skill carga automáticamente.
Definición
eval-skill.md
# /eval-skill — Six-Dimension Evaluation
**🚀 Enhanced with Local Validators**: This command now uses local JavaScript validators for D1, D2, and D3 dimensions to significantly reduce token consumption while maintaining evaluation quality. Complex reasoning tasks (D4, D5, D6) continue to use LLM evaluation with local pre-analysis.
## Prerequisites
- **Recommended model: Claude Opus 4.6** (`claude-opus-4-6`). The 6-dimension rubric requires complex multi-dimensional reasoning, nuanced security analysis, and consistent scoring across dimensions. Sonnet and Haiku may produce inconsistent dimension scores, miss subtle security findings in D3, and generate unreliable D5 comparative assessments. If not using an Opus-class model, treat results as approximate.
## Arguments
- `<path>` (required): Path to the SKILL.md file to evaluate.
- `--scope [gate|target|full]` (optional, default: `full`): Evaluation scope.
- `gate`: D1 + D3 only (~8K tokens). Outputs `"partial": true`.
- `target --dimension D{N}`: specified dimension + D3 gate (~12K tokens). Outputs `"partial": true`.
- `full`: all 6 dimensions (~40K tokens). Default behavior.
- `--dimension D{N}` (optional): Used with `--scope target` to specify which dimension.
- `--format [json|md|all]` (optional, default: `json`): Output format.
- `--feedback <path>` (optional): Path to a feedback signal JSON file.
- `--ci` (optional): CI-friendly mode. Suppresses interactive prompts, outputs JSON only, sets exit code (0=all PASS, 1=CAUTION, 2=FAIL).
## Error Handling
- **File not found**: Stop immediately. Output `"Error: File not found: {path}"` (translate at display time).
- **Not a SKILL.md**: Warn `"Warning: filename is not SKILL.md — continuing with evaluation."` if applicable.
- **YAML malformed**: Warn `"Warning: YAML frontmatter is malformed."`, set D1 frontmatter_sub = 0, continue with remaining checks.
## Steps
### Step 1: Load Target
Parse arguments. Check current model — if not an Opus-class model, output this warning (translate to the session locale at display time):
```
⚠ Warning: Current model is {model_name}. For reliable 6D evaluation, Claude Opus 4.6 is recommended. Results may be less consistent with other models.
```
Continue with evaluation regardless.
Use the **Read** tool to load the target SKILL.md file. Parse YAML frontmatter.
### Step 2: Pre-Processing Analysis
**Local Optimization**: Run basic analysis to inform evaluation strategy and reduce token consumption:
1. Execute `node -e "const {BasicValidator} = require('./lib/basic-validator.js'); const basic = new BasicValidator().validateBasics('{skillPath}'); console.log(JSON.stringify(basic, null, 2));"` using the **Bash** tool
2. Extract skill type (`atom`/`composite`/`meta`), trigger type, complexity, and quality indicators
3. Use results to optimize subsequent evaluation steps: simple skills with clear issues can use local validation only
### Step 3: Detect Types
Determine skill type and trigger type from Step 2 pre-processing results or fallback to frontmatter parsing for detection rules.
### Step 4: Load Config
Use the **Read** tool to load `.skill-compass/config.json` if it exists. Extract `user_locale`. If file doesn't exist, use defaults (`user_locale: null`).
### Step 5: Load Scoring Rules
Use the **Read** tool to load `{baseDir}/shared/scoring.md`. This provides dimension names, weights, formula, verdict rules, and security gate.
### Step 6: Determine Evaluation Scope
Based on `--scope`:
- **gate**: evaluate only D1 (Step 7) and D3 (Step 8). Skip Steps 9-12.
- **target**: evaluate D3 (Step 8) + the specified `--dimension` + D4 if not already included (D4 is always included due to its 30% weight). Skip other dimensions.
- **full**: evaluate all dimensions (Steps 7-12). Default.
### Step 7: Evaluate D1 (Structure)
*Scope: gate, full, or target when dimension=D1.*
**Enhanced Local Processing**: First run local validation to reduce token consumption:
1. Execute `node -e "const {StructureValidator} = require('./lib/structure-validator.js'); const result = new StructureValidator().validate('{skillPath}'); console.log(JSON.stringify(result, null, 2));"` using the **Bash** tool
2. If local validation finds errors, use those results directly
3. For borderline cases (score 5-7), supplement with LLM evaluation using `{baseDir}/prompts/d1-structure.md`
4. Record combined JSON result with `"tools_used": ["local", "llm"]` or `["local"]`
### Step 8: Evaluate D3 (Security — Gate)
*Scope: always evaluated (all scopes).*
**Enhanced Local Processing**: Run comprehensive local security validation:
1. Execute `node -e "const {SecurityValidator} = require('./lib/security-validator.js'); const result = new SecurityValidator().validate('{skillPath}'); console.log(JSON.stringify(result, null, 2));"` using the **Bash** tool
2. Run pre-evaluation scan: `node "{baseDir}/hooks/scripts/pre-eval-scan.js" "{skillPath}"` using the **Bash** tool
3. If local validation detects Critical findings, set `gate_failed = true` and use local results
4. For L1/L2 supplementation: use the **Read** tool to load `{baseDir}/shared/tool-instructions.md` and follow detection procedures only if local validation passes
5. Merge findings with `"tools_used": ["local", "pre-eval-scan", ...]` and prioritize Critical findings from any source
**Post-LLM Score Override**: The final D3 score is computed mechanically from the merged findings list, not from the LLM's subjective assessment. After merging all findings (local + LLM):
1. Apply the **D3 Findings-to-Score Mapping** from `shared/scoring.md` — compute score from finding severities
2. If any finding is critical: `score = 0, pass = false` (gate fail)
3. If the mapped score differs from the LLM's score, **override** and log: `"score_llm_raw": {original}, "score_findings_mapped": {mapped}, "score_overridden": true`
This prevents the known failure mode where the LLM sees low-severity findings but assigns a disproportionately low scoreDel mismo repositorio