Skip to main content
ClaudeWave
Skill66 repo starsupdated 29d ago

agent-health

Reads production/traces/agent-metrics.jsonl and displays a per-agent performance summary table for the current or a specified session. Highlights agents with high error rates or OPEN circuit breaker state.

Install in Claude Code
Copy
git clone --depth 1 https://github.com/tranhieutt/software_development_department /tmp/agent-health && cp -r /tmp/agent-health/.claude/skills/agent-health ~/.claude/skills/agent-health
Then start a new Claude Code session; the skill loads automatically.

SKILL.md

# Agent Health

Display a performance summary table from `production/traces/agent-metrics.jsonl`,
cross-referenced with `production/session-state/circuit-state.json` for live
circuit breaker states.

## Steps

### 1. Parse arguments

| Flag | Default | Description |
| :--- | :--- | :--- |
| `--session <branch>` | current branch | Filter entries by `session` field |
| `--agent <name>` | all | Show only this agent |
| `--since <date>` | no limit | Only entries with `date >= YYYY-MM-DD` |
| `--log` | false | If set, append a fresh metrics snapshot to `agent-metrics.jsonl` |

Get current branch: `git branch --show-current`.

### 2. Read data sources

Read both files in parallel:

- `production/traces/agent-metrics.jsonl` — historical metrics per agent per session
- `production/session-state/circuit-state.json` — live circuit breaker states

If `agent-metrics.jsonl` contains only the schema header line (no actual entries):

```text
📭 No agent metrics recorded yet for this session.
   Metrics are written when agents use /agent-health --log
   or at the end of a session via /save-state.

Circuit breaker states (live):
[show table from circuit-state.json only]
```

### 3. Aggregate metrics

For each agent, compute across the filtered entries:

- `total_tasks` = `tasks_completed` + `tasks_failed` + `tasks_blocked`
- `success_rate` = `tasks_completed / total_tasks * 100` (0 if no tasks)
- `error_rate` = latest `error_rate` field value
- `circuit_state` = from `circuit-state.json` (live, not from log)

### 4. Render health table

```text
🏥 Agent Health Report — session: <branch> · <date range>
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

Agent                  Tasks  ✅ Done  ❌ Failed  ⛔ Blocked  Success%  Circuit
──────────────────────────────────────────────────────────────────────────────
backend-developer          8       7          1          0      87.5%   🟢 CLOSED
frontend-developer         5       5          0          0     100.0%   🟢 CLOSED
qa-engineer                  6       4          2          0      66.7%   🟡 HALF-OPEN
data-engineer              2       2          0          0     100.0%   🟢 CLOSED
diagnostics                1       0          1          0       0.0%   🔴 OPEN
──────────────────────────────────────────────────────────────────────────────
TOTAL                     22      18          4          0      81.8%

⚠️  Agents needing attention:
  🔴 diagnostics      — Circuit OPEN · fallback: surface to user
  🟡 qa-engineer        — Circuit HALF-OPEN · 2 failures this session
```

Circuit state icons:
- `🟢 CLOSED` — healthy
- `🟡 HALF-OPEN` — recovering, monitor closely
- `🔴 OPEN` — bypassed, routed to fallback

Flag agents as needing attention if:
- `circuit_state` is `OPEN` or `HALF-OPEN`
- `success_rate` < 70%
- `tasks_failed` >= 2

### 5. Log snapshot (if --log)

If `--log` flag was passed, append one entry per active agent to
`production/traces/agent-metrics.jsonl`:

```jsonl
{"date":"<YYYY-MM-DD>","session":"<branch>","agent":"<agent>","tasks_completed":<N>,"tasks_failed":<N>,"tasks_blocked":<N>,"avg_tokens_est":<N>,"error_rate":<0.0-1.0>,"circuit_state":"CLOSED|OPEN|HALF-OPEN","notes":"<optional>"}
```

Get `circuit_state` from `circuit-state.json`. Estimate `avg_tokens_est` from
decision ledger entry count × 800 tokens (rough estimate per entry) if no exact
token data is available. Note this is an estimate and mark with `_est` suffix.

Print after logging:

```text
✅ Metrics snapshot logged → production/traces/agent-metrics.jsonl
   [N] agents recorded · <date>
```

### 6. Suggest actions

After the table, if any agents need attention:

```text
💡 Suggested actions:
  • /resume-from <task_id>        — recover failed task checkpoint
  • /trace-history --risk High    — audit high-risk decisions
  • Check circuit-state.json      — update OPEN agents once issue resolved
```

---

## How metrics get into the file

Agents append entries in two ways:

1. **Manual:** Run `/agent-health --log` at end of session
2. **Via `/save-state`:** When saving state with a `task_id`, metrics for the
   active agent are appended automatically

The file grows one JSON line per agent per session. Use `--since` to filter
to recent sessions and avoid reading stale data from weeks ago.

---

## Quick examples

```bash
# Summary for current session
/agent-health

# Check one agent across all time
/agent-health --agent qa-engineer

# Log a fresh snapshot and view it
/agent-health --log

# Review last 7 days
/agent-health --since 2026-04-09
```
accessibility-specialistSubagent

The Accessibility Specialist ensures the software is accessible to the widest possible audience. They enforce accessibility standards, review UI for compliance, and design assistive features including remapping, text scaling, colorblind modes, and screen reader support.

ai-programmerSubagent

The AI Programmer implements intelligent system features: recommendation engines, classification pipelines, LLM integrations, decision logic, and autonomous agent behavior. Use this agent for AI/ML feature implementation, model integration, intelligent automation, or AI system debugging.

analytics-engineerSubagent

The Analytics Engineer designs telemetry systems, user behavior tracking, A/B test frameworks, and data analysis pipelines. Use this agent for event tracking design, dashboard specification, A/B test design, or user behavior analysis methodology.

backend-developerSubagent

The Backend Developer builds and maintains server-side logic, APIs, databases, authentication, and integrations. Use this agent for REST/GraphQL API implementation, database operations, authentication systems, background jobs, microservices, server performance, and backend testing. Works from API design contracts and PRDs.

community-managerSubagent

The Community Manager handles user-facing communications, feedback synthesis, support escalation, and community engagement. Use this agent for drafting release announcements, synthesizing user feedback into actionable insights, writing support documentation, or coordinating community-facing communication around releases and incidents.

ctoSubagent

The CTO (Chief Technical Officer) owns the high-level technical vision, architecture decisions, technology choices, and technical strategy. Use this agent for architecture-level decisions, technology evaluations, cross-system conflicts, and when a technical choice will constrain or enable product possibilities. This is the highest technical authority in the department.

data-engineerSubagent

The Data Engineer designs database schemas, builds data pipelines, manages migrations, and owns the data infrastructure. Use this agent for schema design, complex migrations, data modeling, ETL/ELT pipelines, database performance optimization, analytics infrastructure, and data integrity strategies.

devops-engineerSubagent

The DevOps Engineer maintains build pipelines, CI/CD configuration, version control workflow, and deployment infrastructure. Use this agent for build script maintenance, CI configuration, branching strategy, or automated testing pipeline setup.