Skill66 repo starsupdated 29d ago

agent-health

Reads production/traces/agent-metrics.jsonl and displays a per-agent performance summary table for the current or a specified session. Highlights agents with high error rates or OPEN circuit breaker state.

View source Repository: software_development_department

Install in Claude Code

Copy

git clone --depth 1 https://github.com/tranhieutt/software_development_department /tmp/agent-health && cp -r /tmp/agent-health/.claude/skills/agent-health ~/.claude/skills/agent-health

Then start a new Claude Code session; the skill loads automatically.

Definition

SKILL.md

# Agent Health

Display a performance summary table from `production/traces/agent-metrics.jsonl`,
cross-referenced with `production/session-state/circuit-state.json` for live
circuit breaker states.

## Steps

### 1. Parse arguments

| Flag | Default | Description |
| :--- | :--- | :--- |
| `--session <branch>` | current branch | Filter entries by `session` field |
| `--agent <name>` | all | Show only this agent |
| `--since <date>` | no limit | Only entries with `date >= YYYY-MM-DD` |
| `--log` | false | If set, append a fresh metrics snapshot to `agent-metrics.jsonl` |

Get current branch: `git branch --show-current`.

### 2. Read data sources

Read both files in parallel:

- `production/traces/agent-metrics.jsonl` — historical metrics per agent per session
- `production/session-state/circuit-state.json` — live circuit breaker states

If `agent-metrics.jsonl` contains only the schema header line (no actual entries):

```text
📭 No agent metrics recorded yet for this session.
   Metrics are written when agents use /agent-health --log
   or at the end of a session via /save-state.

Circuit breaker states (live):
[show table from circuit-state.json only]
```

### 3. Aggregate metrics

For each agent, compute across the filtered entries:

- `total_tasks` = `tasks_completed` + `tasks_failed` + `tasks_blocked`
- `success_rate` = `tasks_completed / total_tasks * 100` (0 if no tasks)
- `error_rate` = latest `error_rate` field value
- `circuit_state` = from `circuit-state.json` (live, not from log)

### 4. Render health table

```text
🏥 Agent Health Report — session: <branch> · <date range>
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

Agent                  Tasks  ✅ Done  ❌ Failed  ⛔ Blocked  Success%  Circuit
──────────────────────────────────────────────────────────────────────────────
backend-developer          8       7          1          0      87.5%   🟢 CLOSED
frontend-developer         5       5          0          0     100.0%   🟢 CLOSED
qa-engineer                  6       4          2          0      66.7%   🟡 HALF-OPEN
data-engineer              2       2          0          0     100.0%   🟢 CLOSED
diagnostics                1       0          1          0       0.0%   🔴 OPEN
──────────────────────────────────────────────────────────────────────────────
TOTAL                     22      18          4          0      81.8%

⚠️  Agents needing attention:
  🔴 diagnostics      — Circuit OPEN · fallback: surface to user
  🟡 qa-engineer        — Circuit HALF-OPEN · 2 failures this session
```

Circuit state icons:
- `🟢 CLOSED` — healthy
- `🟡 HALF-OPEN` — recovering, monitor closely
- `🔴 OPEN` — bypassed, routed to fallback

Flag agents as needing attention if:
- `circuit_state` is `OPEN` or `HALF-OPEN`
- `success_rate` < 70%
- `tasks_failed` >= 2

### 5. Log snapshot (if --log)

If `--log` flag was passed, append one entry per active agent to
`production/traces/agent-metrics.jsonl`:

```jsonl
{"date":"<YYYY-MM-DD>","session":"<branch>","agent":"<agent>","tasks_completed":<N>,"tasks_failed":<N>,"tasks_blocked":<N>,"avg_tokens_est":<N>,"error_rate":<0.0-1.0>,"circuit_state":"CLOSED|OPEN|HALF-OPEN","notes":"<optional>"}
```

Get `circuit_state` from `circuit-state.json`. Estimate `avg_tokens_est` from
decision ledger entry count × 800 tokens (rough estimate per entry) if no exact
token data is available. Note this is an estimate and mark with `_est` suffix.

Print after logging:

```text
✅ Metrics snapshot logged → production/traces/agent-metrics.jsonl
   [N] agents recorded · <date>
```

### 6. Suggest actions

After the table, if any agents need attention:

```text
💡 Suggested actions:
  • /resume-from <task_id>        — recover failed task checkpoint
  • /trace-history --risk High    — audit high-risk decisions
  • Check circuit-state.json      — update OPEN agents once issue resolved
```

---

## How metrics get into the file

Agents append entries in two ways:

1. **Manual:** Run `/agent-health --log` at end of session
2. **Via `/save-state`:** When saving state with a `task_id`, metrics for the
   active agent are appended automatically

The file grows one JSON line per agent per session. Use `--since` to filter
to recent sessions and avoid reading stale data from weeks ago.

---

## Quick examples

```bash
# Summary for current session
/agent-health

# Check one agent across all time
/agent-health --agent qa-engineer

# Log a fresh snapshot and view it
/agent-health --log

# Review last 7 days
/agent-health --since 2026-04-09
```