skill-judge
Skill Judge evaluates Agent Skill design quality by assessing SKILL.md files against official specifications and established patterns from 17+ examples. Use this when auditing, reviewing, or improving skill packages to ensure they provide genuine expert knowledge delta rather than explaining concepts Claude already understands. It delivers multi-dimensional scoring and actionable recommendations for enhancement.
git clone --depth 1 https://github.com/softaworks/agent-toolkit /tmp/skill-judge && cp -r /tmp/skill-judge/dist/plugins/skill-judge/skills/skill-judge ~/.claude/skills/skill-judgeSKILL.md
# Skill Judge
Evaluate Agent Skills against official specifications and patterns derived from 17+ official examples.
---
## Core Philosophy
### What is a Skill?
A Skill is NOT a tutorial. A Skill is a **knowledge externalization mechanism**.
Traditional AI knowledge is locked in model parameters. To teach new capabilities:
```
Traditional: Collect data → GPU cluster → Train → Deploy new version
Cost: $10,000 - $1,000,000+
Timeline: Weeks to months
```
Skills change this:
```
Skill: Edit SKILL.md → Save → Takes effect on next invocation
Cost: $0
Timeline: Instant
```
This is the paradigm shift from "training AI" to "educating AI" — like a hot-swappable LoRA adapter that requires no training. You edit a Markdown file in natural language, and the model's behavior changes.
### The Core Formula
> **Good Skill = Expert-only Knowledge − What Claude Already Knows**
A Skill's value is measured by its **knowledge delta** — the gap between what it provides and what the model already knows.
- **Expert-only knowledge**: Decision trees, trade-offs, edge cases, anti-patterns, domain-specific thinking frameworks — things that take years of experience to accumulate
- **What Claude already knows**: Basic concepts, standard library usage, common programming patterns, general best practices
When a Skill explains "what is PDF" or "how to write a for-loop", it's compressing knowledge Claude already has. This is **token waste** — context window is a public resource shared with system prompts, conversation history, other Skills, and user requests.
### Tool vs Skill
| Concept | Essence | Function | Example |
|---------|---------|----------|---------|
| **Tool** | What model CAN do | Execute actions | bash, read_file, write_file, WebSearch |
| **Skill** | What model KNOWS how to do | Guide decisions | PDF processing, MCP building, frontend design |
Tools define capability boundaries — without bash tool, model can't execute commands.
Skills inject knowledge — without frontend-design Skill, model produces generic UI.
**The equation**:
```
General Agent + Excellent Skill = Domain Expert Agent
```
Same Claude model, different Skills loaded, becomes different experts.
### Three Types of Knowledge in Skills
When evaluating, categorize each section:
| Type | Definition | Treatment |
|------|------------|-----------|
| **Expert** | Claude genuinely doesn't know this | Must keep — this is the Skill's value |
| **Activation** | Claude knows but may not think of | Keep if brief — serves as reminder |
| **Redundant** | Claude definitely knows this | Should delete — wastes tokens |
The art of Skill design is maximizing Expert content, using Activation sparingly, and eliminating Redundant ruthlessly.
---
## Evaluation Dimensions (120 points total)
### D1: Knowledge Delta (20 points) — THE CORE DIMENSION
The most important dimension. Does the Skill add genuine expert knowledge?
| Score | Criteria |
|-------|----------|
| 0-5 | Explains basics Claude knows (what is X, how to write code, standard library tutorials) |
| 6-10 | Mixed: some expert knowledge diluted by obvious content |
| 11-15 | Mostly expert knowledge with minimal redundancy |
| 16-20 | Pure knowledge delta — every paragraph earns its tokens |
**Red flags** (instant score ≤5):
- "What is [basic concept]" sections
- Step-by-step tutorials for standard operations
- Explaining how to use common libraries
- Generic best practices ("write clean code", "handle errors")
- Definitions of industry-standard terms
**Green flags** (indicators of high knowledge delta):
- Decision trees for non-obvious choices ("when X fails, try Y because Z")
- Trade-offs only an expert would know ("A is faster but B handles edge case C")
- Edge cases from real-world experience
- "NEVER do X because [non-obvious reason]"
- Domain-specific thinking frameworks
**Evaluation questions**:
1. For each section, ask: "Does Claude already know this?"
2. If explaining something, ask: "Is this explaining TO Claude or FOR Claude?"
3. Count paragraphs that are Expert vs Activation vs Redundant
---
### D2: Mindset + Appropriate Procedures (15 points)
Does the Skill transfer expert **thinking patterns** along with **necessary domain-specific procedures**?
The difference between experts and novices isn't "knowing how to operate" — it's "how to think about the problem." But thinking patterns alone aren't enough when Claude lacks domain-specific procedural knowledge.
**Key distinction**:
| Type | Example | Value |
|------|---------|-------|
| **Thinking patterns** | "Before designing, ask: What makes this memorable?" | High — shapes decision-making |
| **Domain-specific procedures** | "OOXML workflow: unpack → edit XML → validate → pack" | High — Claude may not know this |
| **Generic procedures** | "Step 1: Open file, Step 2: Edit, Step 3: Save" | Low — Claude already knows |
| Score | Criteria |
|-------|----------|
| 0-3 | Only generic procedures Claude already knows |
| 4-7 | Has domain procedures but lacks thinking frameworks |
| 8-11 | Good balance: thinking patterns + domain-specific workflows |
| 12-15 | Expert-level: shapes thinking AND provides procedures Claude wouldn't know |
**What counts as valuable procedures**:
- Workflows Claude hasn't been trained on (new tools, proprietary systems)
- Correct ordering that's non-obvious (e.g., "validate BEFORE packing, not after")
- Critical steps that are easy to miss (e.g., "MUST recalculate formulas after editing")
- Domain-specific sequences (e.g., MCP server's 4-phase development process)
**What counts as redundant procedures**:
- Generic file operations (open, read, write, save)
- Standard programming patterns (loops, conditionals, error handling)
- Common library usage that's well-documented
**Expert thinking patterns look like**:
```markdown
Before [action], ask yourself:
- **Purpose**: What problem does this solve? Who uses it?
- **Constraints**: What are the hidden requirements?
- **Differentiation**: What makesAdd a skill to the project with validation and README generation
Sync root README.md with current skills inventory from skills/ directory
Use this agent when you need to visualize UI concepts through ASCII mockups before implementation. Examples: <example>Context: User has an idea for a dashboard layout with data tables and charts. user: 'I want to create a dashboard that shows user analytics with a sidebar navigation, main content area with charts, and a data table below' assistant: 'I'll use the ascii-ui-mockup-generator agent to create multiple ASCII mockup variations for your dashboard concept.' <commentary>The user wants to visualize a UI concept, so use the ascii-ui-mockup-generator to create multiple ASCII representations they can choose from.</commentary></example> <example>Context: User is designing a form layout with multiple input fields. user: 'I need a contact form with name, email, message fields and a submit button' assistant: 'Let me use the ascii-ui-mockup-generator to create several ASCII mockup options for your contact form layout.' <commentary>Since the user needs to visualize form layouts, use the ascii-ui-mockup-generator to provide multiple ASCII design options.</commentary></example>
codebase-pattern-finder is a useful subagent_type for finding similar implementations, usage examples, or existing patterns that can be modeled after. It will give you concrete code examples based on what you're looking for! It's sorta like codebase-locator, but it will not only tell you the location of files, it will also give you code details!
PROACTIVELY use when reviewing communication drafts or preparing difficult conversations. Provides email refinement, tone calibration, roleplay practice, and presentation feedback with actionable suggestions.
Default agent for handling complex, multi-step tasks with automatic delegation capabilities
Mermaid diagram specialist for creating flowcharts, sequence diagrams, ERDs,
Expert UI/UX design critic and advisor who provides research-backed, opinionated feedback on interfaces. Use when you need honest assessment of design decisions, want to avoid generic "AI slop" aesthetics, need evidence-based UX guidance, or want distinctive design direction grounded in actual user behavior research. This agent will push back on bad ideas and cite sources for every recommendation.