Skill975 repo starsupdated today

agent-desktop

agent-desktop is a command-line tool that enables AI agents to observe and control macOS desktop applications by exposing their accessibility trees as structured JSON with reference-based element identifiers. Use it when building autonomous agents that need to interact with native applications through programmatic UI observation and control, rather than building agents to use the tool directly.

View source Repository: agent-desktop

Install in Claude Code

Copy

git clone --depth 1 https://github.com/lahfir/agent-desktop /tmp/agent-desktop && cp -r /tmp/agent-desktop/skills/agent-desktop ~/.claude/skills/agent-desktop

Then start a new Claude Code session; the skill loads automatically.

Definition

SKILL.md

# agent-desktop

CLI tool enabling AI agents to observe and control desktop applications via native OS accessibility trees.

**Core principle:** agent-desktop is NOT an AI agent. It is a tool that AI agents invoke. It outputs structured JSON with ref-based element identifiers. The observation-action loop lives in the calling agent.

## Installation

```bash
npm install -g agent-desktop
# or
bun install -g --trust agent-desktop
```

Requires macOS 12+ with Accessibility permission granted to your terminal. Screen Recording permission is also required for screenshots.

## Reference Files

Detailed documentation is split into focused reference files. Read them as needed:

| Reference | Contents |
|-----------|----------|
| `references/commands-observation.md` | snapshot, find, get, is, screenshot, list-surfaces — all flags, output examples |
| `references/commands-interaction.md` | click, type, set-value, select, toggle, scroll, drag, keyboard, mouse — choosing the right command |
| `references/commands-system.md` | launch, close, windows, clipboard, wait, batch, status, permissions, version |
| `references/workflows.md` | 12 common patterns: forms, menus, dialogs, scroll-find, drag-drop, async wait, anti-patterns |
| `references/macos.md` | macOS permissions/TCC, AX API internals, smart activation chain, surfaces, Notification Center, troubleshooting |

## The Observe-Act Loop (Progressive Skeleton Traversal)

Use **progressive skeleton traversal** as the default approach. It reduces token consumption 78-96% for dense apps by exploring the UI in two phases: a shallow skeleton overview, then targeted drill-downs into regions of interest.

```
1. SKELETON → agent-desktop snapshot --skeleton --app "App" -i --compact
Parse the overview. Identify the region containing your target.
Regions show children_count (e.g., "Sidebar" with children_count: 42).
Named containers at truncation boundary have refs for drill-down.
Keep the returned snapshot_id.

2. DRILL → agent-desktop snapshot --root @e3 --snapshot <snapshot_id> -i --compact
Expand the target region. Now you see its interactive elements.

3. ACT → agent-desktop click @e12 --snapshot <snapshot_id> (or type, select, toggle...)

4. VERIFY → agent-desktop snapshot --root @e3 --snapshot <snapshot_id> -i --compact
Re-drill the same region to confirm the state change.
Scoped invalidation: only @e3's subtree refs are replaced.

5. REPEAT → Continue drilling other regions or acting as needed.
```

**When to skip skeleton and use full snapshot instead:**
- Simple apps with few elements (Finder, Calculator, TextEdit)
- You already know the exact element name — use `find` instead
- Surface snapshots (menus, sheets, alerts) — these are already focused

**When skeleton shines:**
- Dense Electron apps (Slack, VS Code, Discord, Notion)
- Any app where full snapshot exceeds ~50 refs
- Multi-region workflows (sidebar + main content + toolbar)

## Ref System

- Refs assigned depth-first: `@e1`, `@e2`, `@e3`...
- An element gets a ref when it is addressable for an action: an interactive role (button, textfield, checkbox, link, menuitem, tab, slider, combobox, treeitem, cell, radiobutton, switch, ...) **or** any element advertising an action — so `scrollarea` (Scroll) and `disclosure` (Expand/Collapse) are ref-able and `scroll`/`expand`/`collapse` can target them
- A `SetFocus`-only affordance does not earn a ref on its own
- In skeleton mode, named/described containers at truncation boundary also get refs (drill-down targets with empty `available_actions`)
- Static text and non-actionable groups/containers remain in tree for context but have no ref
- Refs are deterministic within a snapshot but NOT stable across snapshots if UI changed
- Every snapshot returns `snapshot_id`; ref-consuming commands accept `--snapshot <snapshot_id>`, and explicit snapshot IDs do not require also passing `--session`
- `last_refmap.json` is only a latest-snapshot inspection artifact. The command path uses snapshot-scoped storage.
- After any action that changes UI, re-drill the affected region or re-snapshot
- **Scoped invalidation:** re-drilling `--root @e3` only replaces refs from @e3's previous drill — refs from other regions and the skeleton itself are preserved
- **Strict resolution:** stale refs return `STALE_REF`; duplicate plausible targets return `AMBIGUOUS_TARGET` instead of choosing arbitrarily.
- **Actionability:** ref actions check live visibility, stability, enabled state, supported action, policy, and editability before dispatch.
- **Headless vs headed:** ref actions are headless by default (AX-only, no cursor) and fail closed when only a physical gesture would work. `type` uses a focus-fallback base policy because typing needs focus but never moves the cursor. Pass the global `--headed` flag to permit cursor movement and focus stealing so physical fallbacks can complete; the AX path is still tried first, so `--headed` never regresses headless-capable elements. Raw cursor commands (`hover`, `drag`, `mouse-*`) are physical and require `--headed`; keyboard commands (`press`, `key-down`, `key-up`) are explicit low-level input.
- **Sessions:** use `--session <id>` for concurrent or multi-agent runs that share a latest snapshot pointer; batch entries may override with `"session": "id"`.
- **Trace:** use `--trace <path>` for JSONL diagnostics outside stdout; `--trace-strict` fails on trace setup and pre-action writes. Post-action success traces are best-effort because the desktop mutation already happened. Trace fields with sensitive key tokens such as `text`, `value`, `expected`, `name`, `username`, `description`, `label`, `query`, `secret`, `token`, `password`, `title`, `url`, `help`, or `placeholder` are redacted to `{ "redacted": true }` at every nesting depth. Token matching handles snake_case, kebab-case, and camelCase keys (for example, `source_window_title`, `api_token`, and `typedText` redact, while `filename` stays readable). To