Skill510 repo starsupdated today

Fleet Control

Fleet Control monitors a fleet of Aeon instances registered in a JSON registry, performing health checks, status summaries, and skill dispatch operations. It verifies GitHub authentication and rate limits before each run, retrieves repository metadata and recent workflow runs for each instance, tracks state changes across runs, and outputs decision-ready verdicts with per-instance action recommendations. Use this skill to maintain visibility into managed Aeon deployments and remotely trigger skills on healthy or degraded instances.

View source Repository: aeon

Install in Claude Code

Copy

git clone --depth 1 https://github.com/aaronjmars/aeon /tmp/fleet-control && cp -r /tmp/fleet-control/skills/fleet-control ~/.claude/skills/fleet-control

Then start a new Claude Code session; the skill loads automatically.

Definition

SKILL.md

<!-- autoresearch: variation B — sharper output: verdict line + delta vs prior + per-instance action column + state-change-gated notify -->

> **${var}** — Command. Empty (or unrecognized) → Health Check (default). `status` → full Status Mode. `dispatch <instance|*> <skill> [var=<value>]` → trigger a skill on one child or all healthy/degraded children.

Today is ${today}. Operate the fleet of Aeon instances registered in `memory/instances.json`. Output is **decision-ready**: every run leads with a verdict, then a delta vs prior check, then per-instance lines that name the next concrete action.

## Pre-flight (every mode)

1. **Verify gh auth** — `gh auth status` must succeed. If not, log `FLEET_NO_AUTH` to `memory/logs/${today}.md` and notify `Fleet Control: gh auth missing — check GITHUB_TOKEN secret.` Stop.

2. **Check rate limit** — `REMAINING=$(gh api rate_limit --jq '.resources.core.remaining')`. If `REMAINING < 50`, log `FLEET_RATE_LIMITED:remaining=${REMAINING}` and notify a one-line warning, then stop.

3. **Load the registry** — read `memory/instances.json`. If the file is missing, write `{"instances": []}` to bootstrap. If `.instances` is absent or `[]`:
   - Log `FLEET_EMPTY: no managed instances` to `memory/logs/${today}.md`.
   - **Stop. Do NOT notify.**

4. **Load prior state** — read `memory/state/fleet-control-state.json` (create the directory and file with `{"instances": {}, "last_full_summary_date": ""}` if missing). Shape:
   ```json
   {
     "instances": {
       "<name>": { "health": "<status>", "last_checked": "<ISO>", "consecutive_unreachable": 0 }
     },
     "last_full_summary_date": "YYYY-MM-DD"
   }
   ```

5. **Parse var → mode**:
   - empty / unrecognized → **Health Check Mode**
   - exactly `status` → **Status Mode**
   - starts with `dispatch ` → **Dispatch Mode**

---

## Health Check Mode (default)

For each registered instance, skip rows with `archived: true` from per-instance work (count them separately). Run the three calls per instance in parallel using `&` + `wait` and write each to `/tmp/fleet/${SAFE}.{repo,runs,cron}.json`:

a. **Repo metadata**:
   ```bash
   gh api "repos/${REPO}" \
     --jq '{full_name, pushed_at, archived, default_branch, open_issues_count}' \
     > "/tmp/fleet/${SAFE}.repo.json" 2>"/tmp/fleet/${SAFE}.repo.err" &
   ```

b. **Workflow runs in last 24h** (precise window, not "last 5"):
   ```bash
   SINCE=$(date -u -d '24 hours ago' +%Y-%m-%dT%H:%M:%SZ)
   gh api "repos/${REPO}/actions/runs?created=>${SINCE}&per_page=100&exclude_pull_requests=true" \
     --jq '{total_count, runs:[.workflow_runs[]|{name,status,conclusion,created_at,html_url}]}' \
     > "/tmp/fleet/${SAFE}.runs.json" 2>"/tmp/fleet/${SAFE}.runs.err" &
   ```

c. **Cron-state from child**:
   ```bash
   gh api "repos/${REPO}/contents/memory/cron-state.json" --jq '.content' 2>"/tmp/fleet/${SAFE}.cron.err" \
     | base64 -d > "/tmp/fleet/${SAFE}.cron.json" &
   ```

`wait` after launching all three for an instance (or batch across all instances if you trust your parallelism — keep ≤16 concurrent calls to stay under rate limit).

**Classify each instance** with precise thresholds:
- **unreachable** — repo metadata call returned non-zero (404/403/etc.)
- **archived** — repo metadata returns `archived: true`
- **pending_secrets** — `runs.total_count == 0` for the 24h window AND repo `pushed_at` ≥ 7 days old (newly-spawned instances under 7 days stay unclassified-but-tracked)
- **stale** — `runs.total_count == 0` AND `pushed_at` > 7 days old AND not `archived`
- **degraded** — ≥1 cron-state skill with `consecutive_failures ≥ 3` OR (24h failure_count / total_count) ≥ 0.5 with total_count ≥ 2
- **warning** — 24h failure_count ≥ 1 but ratio < 0.5
- **healthy** — has runs in last 24h, all conclusions `success` or `in_progress`/`queued`, no degraded cron-state skills

For each instance compute a **next_action** (one short imperative phrase):
- `pending_secrets` → `add ANTHROPIC_API_KEY at https://github.com/${REPO}/settings/secrets/actions`
- `degraded` → `investigate <skill_name> (<consecutive_failures>× in a row, last_error: <signature, ≤60 chars>)`
- `warning` → `monitor — <N>/<Total> runs failed in 24h`
- `stale` → `confirm intent: no runs in 24h, last push <relative_date>; archive or re-enable`
- `unreachable` → `verify access: <reason from repo.err>`
- `healthy` → `none`
- `archived` → `none (archived)`

**Compute delta** vs prior state (per-instance `prior.health` vs `current.health`):
- **NEW** — instance not in prior state
- **DEGRADED** — was healthy/warning, now degraded/unreachable/stale/pending_secrets
- **RECOVERED** — was degraded/unreachable/stale/pending_secrets, now healthy/warning
- **DROPPED** — was in prior state, no longer in registry
- (no change → no delta line)

**Update the registry** — write back `health`, `last_checked` (ISO UTC), and `next_action` per instance to `memory/instances.json`. Preserve all other fields (`purpose`, `parent`, `created`, `skills_enabled`, etc.).

**Update the state file** — write the current per-instance health snapshot to `memory/state/fleet-control-state.json`. Update `last_full_summary_date` to today **only when this run notifies**. Increment `consecutive_unreachable` for unreachable instances; reset to 0 otherwise.

**Log** to `memory/logs/${today}.md`:
```
## fleet-control (health check)
- Verdict: [FLEET_OK | NEEDS_ATTENTION:N]
- Sizes: total=N, healthy=N, warning=N, degraded=N, stale=N, pending=N, unreachable=N, archived=N
- Deltas: [list NEW/DEGRADED/RECOVERED/DROPPED, or "none"]
- Sources: gh=ok, rate_remaining=N
```

**Notification gate** — send the notification if **any** of:
- `len(deltas) > 0`
- today != prior `last_full_summary_date` (first check of UTC day → daily rollup)
- any current instance is `degraded` or `unreachable`

Otherwise skip notify (silent no-op when nothing changed mid-day — operator isn't trained to ignore).

**Notification body** (when sent):
```
*Fleet Control — ${today}*
Verdict: <