Skip to main content
ClaudeWave
Skill440 repo starsupdated yesterday

skill-monitor

skill-monitor analyzes Claude Code skill effectiveness over time by parsing session metrics from metrics.jsonl, computing per-skill aggregates like action rate and error frequency, and identifying degrading skills. Use this skill to monitor which skills are working well across sessions, deep-dive into specific skill performance, generate improvement recommendations, or adjust the time window for trend analysis from the default seven days to thirty days or longer.

Install in Claude Code
Copy
git clone --depth 1 https://github.com/oliver-kriska/claude-elixir-phoenix /tmp/skill-monitor && cp -r /tmp/skill-monitor/.claude/skills/skill-monitor ~/.claude/skills/skill-monitor
Then start a new Claude Code session; the skill loads automatically.

SKILL.md

# Skill Monitor

Closed-loop skill effectiveness monitoring. Reads session metrics,
computes per-skill signals, identifies what's working and what needs
improvement.

Inspired by the deploy-monitor-evaluate-improve feedback loop:
skills get better over time instead of staying static.

## Requirements

Requires `.claude/session-metrics/metrics.jsonl` from `/session-scan`.
If no data: suggest running `/session-scan` first.

## Usage

```
/skill-monitor                     # Dashboard: all skills
/skill-monitor --skill review      # Deep-dive on one skill
/skill-monitor --improve           # Generate improvement recommendations
/skill-monitor --window 30d        # Change comparison window (default: 7d)
```

## What Main Context Does

### Step 1: Parse Arguments

Extract from `$ARGUMENTS`:

- **`--skill NAME`**: Focus on one skill (e.g., `review`, `plan`, `investigate`)
- **`--improve`**: Spawn analysis agent for improvement recommendations
- **`--window PERIOD`**: Comparison window (`7d`, `30d`, `all`; default: `7d`)

### Step 2: Load Metrics

Read `.claude/session-metrics/metrics.jsonl`. For each entry, extract
the `skill_effectiveness` field (added by compute-metrics.py v2).

Filter by window period. Count sessions with and without skill usage.

If no `skill_effectiveness` data exists in metrics: "Metrics were
computed before skill tracking was added. Run `/session-scan --rescan`
to recompute."

**OTel `invocation_trigger` (CC v2.1.126+)**: when `compute-metrics.py`
ingests `claude_code.skill_activated` events, each invocation carries
an `invocation_trigger` of `"user-slash"`, `"claude-proactive"`, or
`"nested-skill"`. If absent (older sessions), default to
`"unknown"` — do NOT assume `"user-slash"`.

### Step 3: Compute Per-Skill Aggregates

For each skill found across all sessions, aggregate:

```
| Metric                  | Computation                                    |
|-------------------------|------------------------------------------------|
| Total invocations       | Sum of invocation_count across sessions        |
| Sessions used in        | Count of sessions containing this skill        |
| Action rate             | Weighted avg of per-session action_rate         |
| Avg post-errors         | Weighted avg of avg_post_errors                |
| Avg post-corrections    | Weighted avg of avg_post_corrections           |
| Outcome distribution    | Count of effective/friction/no_action/mixed    |
| Effectiveness score     | action_rate - (0.3 * avg_post_corrections)     |
| Adjusted score          | For analysis/check skills, use lower thresholds |
| Trigger distribution    | Counts of user-slash / claude-proactive / nested-skill / unknown |
| Proactive trigger rate  | claude-proactive / (user-slash + claude-proactive + nested-skill) |
| Auto-load gap           | Skills with 0 claude-proactive invocations across window |
```

**Auto-load gap detection (CC v2.1.126+)**: Skills with `auto-loaded`
behavior in their description (i.e., not `disable-model-invocation: true`)
are EXPECTED to fire as `claude-proactive`. A skill that is ONLY ever
invoked via `user-slash` is failing its description's routing intent.
Flag any auto-loadable skill where `proactive_trigger_rate == 0` over
the window. This is the structural answer to the "zero skill
auto-loading" gap from the 137-session analysis (see MEMORY.md).
**Confidence floor**: only flag if total invocations >= 5 in window.

**Skill type weighting**: Analysis and check skills (verify, triage,
perf, boundaries, pr-review, audit) have low action rates BY DESIGN —
their success is "found issues" or "confirmed things pass". Apply
adjusted thresholds:

| Skill Type | Flag Threshold | Expected Action Rate |
|------------|---------------|---------------------|
| Execution (work, quick, full) | < 0.5 | > 0.7 |
| Analysis (perf, boundaries, audit, pr-review) | < 0.3 | 0.3-0.5 |
| Check (verify, triage) | < 0.1 | 0.0-0.3 |
| Knowledge (compound, learn, brief) | < 0.5 | > 0.5 |

Also compute **baseline friction** (avg friction of sessions WITHOUT
any skill usage) vs **skill friction** (avg friction of sessions
WITH skill usage). Delta = skill_friction - baseline_friction.
Negative delta = skills reduce friction (good).

### Step 4: Display Dashboard

**Dashboard mode** (no `--skill`):

```
## Skill Effectiveness Dashboard (last {window})

Baseline friction (no skills): 0.32 | With skills: 0.18 | Delta: -0.14

| Skill           | Uses | Sessions | Slash/Proactive/Nested | Action% | Errors | Corr | Outcome   | Score |
|-----------------|------|----------|------------------------|---------|--------|------|-----------|-------|
| /phx:review     | 12   | 8        |    8 /  3 /  1         | 92%     | 0.5    | 0.1  | effective | 0.89  |
| /phx:plan       | 9    | 7        |    9 /  0 /  0         | 100%    | 0.2    | 0.0  | effective | 1.00  |
| /phx:investigate| 5    | 5        |    5 /  0 /  0         | 80%     | 1.2    | 0.4  | mixed     | 0.68  |

Skills needing attention:
- /phx:investigate (high post-errors)
- /phx:plan (auto-load gap — 0/9 proactive; description not routing)
```

Flag skills using type-adjusted thresholds (see weighting table above).
Also flag if avg_post_corrections > 1 or outcome is predominantly "friction".
**Also flag auto-load gap**: auto-loadable skills (without
`disable-model-invocation: true`) with proactive_trigger_rate == 0 and
total invocations >= 5. This is a description/routing problem — the skill
exists but Claude isn't loading it on its own.

When displaying flagged skills, note if the flag is "expected" for the
skill type (e.g., verify at 0.24 is normal for a check skill).

**Skill deep-dive** (`--skill NAME`):

Show per-session breakdown for that skill, including session IDs,
dates, individual outcome signals, AND `invocation_trigger` per
invocation. If a skill is dominated by `user-slash` triggers, surface
which 1-3 description keywords might unlock proactive routing —
cross-reference against the skill's current de