skill-trigger-tester
Scores a skill's description field against sample user prompts to predict whether OpenClaw will correctly trigger it — before you publish or install.
git clone --depth 1 https://github.com/ArchieIndian/openclaw-superpowers /tmp/skill-trigger-tester && cp -r /tmp/skill-trigger-tester/skills/core/skill-trigger-tester ~/.claude/skills/skill-trigger-testerSKILL.md
# Skill Trigger Tester
## What it does
OpenClaw maps user intent to skills by matching the user's message against each skill's `description:` field. A weak description means your skill silently never fires. A description that's too broad means it fires when it shouldn't.
Skill Trigger Tester helps you validate the trigger quality of a skill's description before publishing. You give it:
- The description string you're testing
- A set of "should fire" prompts (true positives)
- A set of "should not fire" prompts (true negatives)
It scores precision, recall, and gives an overall trigger quality grade (A–F) plus actionable suggestions.
## When to invoke
- Before publishing any new skill to ClawHub
- When a skill you expect to trigger isn't firing
- When a skill keeps firing on irrelevant prompts
- Inside `create-skill` workflow (Step 5: validation)
## Scoring model
The tool uses a keyword + semantic overlap heuristic against the description field:
| Metric | Meaning |
|---|---|
| **Recall** | % of "should fire" prompts that would match |
| **Precision** | % of matches that are actually "should fire" |
| **F1** | Harmonic mean of recall and precision |
Grade thresholds:
| Grade | F1 |
|---|---|
| A | ≥ 0.85 |
| B | ≥ 0.70 |
| C | ≥ 0.55 |
| D | ≥ 0.40 |
| F | < 0.40 |
## How to use
```bash
python3 test.py --description "Diagnoses skill discovery failures" \
--should-fire "why isn't my skill loading" \
"my skill disappeared from the registry" \
"check if my skills are healthy" \
--should-not-fire "write a skill" \
"install superpowers" \
"review my code"
python3 test.py --file skill-spec.yaml # Load test cases from YAML file
python3 test.py --format json # Machine-readable output
```
## Test spec file format
```yaml
description: "Diagnoses skill discovery failures — YAML parse errors, path violations"
should_fire:
- "why isn't my skill loading"
- "my skill disappeared"
- "check skill health"
should_not_fire:
- "write a new skill"
- "install openclaw"
```
## Procedure
**Step 1 — Write your test cases**
For each skill you're testing, list 3–5 prompts that should trigger it and 3–5 that should not. Be honest about edge cases.
**Step 2 — Run the scorer**
```bash
python3 test.py --description "<your description>" \
--should-fire "..." --should-not-fire "..."
```
**Step 3 — Interpret results**
- Grade A/B: description is well-calibrated. Publish.
- Grade C: borderline — add more specific keywords to the description, or narrow the wording.
- Grade D/F: description is too vague or uses jargon the user won't say. Rewrite and retest.
**Step 4 — Iterate**
Try alternative descriptions and compare scores side-by-side using `--compare`.
**Step 5 — Add test file to the skill directory**
Commit the `trigger-tests.yaml` spec alongside the skill. Future contributors can run it to verify trigger quality hasn't regressed.
## Common mistakes
- **Too generic**: `"Helps with skills"` — will either never fire or fire on everything
- **Technical jargon**: `"Validates SKILL.md frontmatter schema coherence"` — users don't say this
- **Action + object only**: `"Creates skills"` — add when/why context
- **Missing synonyms**: If users might say "check" or "verify" or "test", the description needs to capture the semantic range
## Output example
```
Skill Trigger Quality — skill-doctor
─────────────────────────────────────────────
Description: "Diagnoses silent skill discovery failures..."
Should fire (5 prompts): 4 / 5 matched recall = 0.80
Should not fire (5 prompts): 1 / 5 matched precision = 0.80
F1 score: 0.80 Grade: B
⚠ 1 false negative: "check if skills are healthy"
Suggestion: Add "healthy", "health", or "check" to the description.
⚠ 1 false positive: "check my code"
Suggestion: Narrow description to avoid generic "check" overlap.
```Syncs agent daily memory and MEMORY.md to an Obsidian vault so notes are human-browsable. Use nightly or on demand.
Structured ideation before any implementation. Use when starting any non-trivial task.
Scaffolds and validates new superpowers skills. Use when creating a new skill for this repository.
Executes plans task-by-task with verification. Use when implementing a plan.
Triggers a secondary verification pass for any agent output containing factual claims, numbers, dates, or named entities before the output is acted on
Crawls a new codebase to infer stack, conventions, and key invariants, then generates a PROJECT.md context file for the agent
Handles PR review feedback by fetching comments, grouping issues, fixing one group at a time, and verifying before replies.
Detects skill name shadowing and description-overlap conflicts that cause OpenClaw to trigger the wrong skill or silently ignore one when two skills compete for the same intent.