Skip to main content
ClaudeWave
Skill252 repo starsupdated today

cua-driver

The cua-driver skill orchestrates macOS application automation by snapshotting the accessibility tree, performing actions like clicks and typing by element index, and reverifying with fresh snapshots. Use it when automating GUI tasks in background macOS apps while keeping the user's foreground application unchanged, such as navigating Finder, filling forms, or clicking buttons without disrupting the user's active workflow.

Install in Claude Code
Copy
git clone --depth 1 https://github.com/melandlabs/openloomi /tmp/cua-driver && cp -r /tmp/cua-driver/skills/cua-driver ~/.claude/skills/cua-driver
Then start a new Claude Code session; the skill loads automatically.

SKILL.md

# cua-driver

Orchestrates macOS app automation via `cua-driver`. Whenever a user
asks to drive a native macOS app, follow the loop in this skill rather
than calling tools ad-hoc — the snapshot-before-action invariant is not
optional and silently breaks if you skip it.

## The no-foreground contract — read this first

**The user's frontmost app MUST NOT change.** This is the whole
reason cua-driver exists. Users pay for the right to keep typing in
their editor while an agent drives another app in the background.
Violate this rule and every other nice property the driver gives
you (no cursor warp, no Space switch, no window raise) stops
mattering — you just shipped the Accessibility Inspector with extra
steps.

Before running any shell command, ask: **"does this raise,
activate, foreground, or make-key any app?"** If yes, don't run it.
Every one of the commands below activates the target on macOS and
is therefore forbidden unless the user **explicitly** asked for
frontmost state:

- **Every form of the `open` CLI — `open -a <App>`, `open -b
  <bundle-id>`, `open <file>`, `open <path-to-App.app>`, `open
  <url>` — always activates.** macOS routes all forms through
  LaunchServices, which unhides and foregrounds the target
  regardless of whether you passed an app name, a bundle id, a
  document, a URL, or the bundle path itself. The activation
  happens even when the only intent was "start the process."
  **Never use `open` for any app launch.** This includes launching
  a just-built .app from a local build dir (e.g. `open
  build/Build/Products/Debug/MyApp.app`) — resolve the
  `CFBundleIdentifier` from `Info.plist` and use `launch_app`
  with that id. See "The narrow carve-out" below for why
  `launch_app` is safe even when the app internally calls
  `NSApp.activate`.
- `osascript -e 'tell application "X" to activate'` —
  activates by design. Same for `... to open <file>`,
  `... to launch`, and anything with `activate` in the tell block.
- `osascript -e 'tell application "System Events" to ... frontmost'`
  in a mutating form (setting `frontmost` rather than reading it).
- AppleScript files that invoke `activate`, `launch`, or `open`
  against the target app.
- `cliclick` (moves the user's real cursor to the target coords
  before clicking — a focus-steal-equivalent even if the app's
  window state is unchanged).
- `CGEventPost` with `cghidEventTap` targeting a coordinate over
  a different app's window (warps the cursor, possibly activates
  on hit).
- `AppleScriptTask`, `NSAppleScript`, `Process` wrapping `osascript`
  that contains any of the above.
- `NSRunningApplication.activate(options:)` called from your own
  helper binary — same class.
- Dock clicks and any `open` invocation (see the first bullet —
  every form of `open` goes through LaunchServices which
  activates, full stop).
- **Keyboard shortcuts that semantically mean "focus here" —
  most notably Chrome / Safari / Arc's `⌘L` (focus omnibox) and
  Finder's `⌘⇧G` (Go to Folder).** These aren't pure key events —
  the receiving app interprets "user wants to type here" as
  activation intent and raises its window to be key. Even when
  delivered to a backgrounded pid via `hotkey`, the downstream app
  pulls focus. **For omnibox navigation specifically**, the correct
  path is `launch_app({bundle_id: "com.google.Chrome", urls:
  ["https://…"]})` — no omnibox dance, no `⌘L`, no focus-steal. Do
  NOT try `set_value` on the omnibox: Chrome's commit logic requires
  a "user-typed" signal that neither an AX value write nor
  `CGEvent.postToPid` keystrokes supply from a backgrounded pid —
  the URL lands in the field but Return fires as a no-op. See
  `WEB_APPS.md` → "Navigate to a URL" for the full pattern. The
  general principle: a shortcut that says "put my cursor inside this
  app" is a focus-steal; a shortcut that says "do this thing" (copy,
  save, quit) is fine.
- **Tab-switching shortcuts in browsers (`⌘1..⌘9`, `⌘]`, `⌘[`,
  `⌘⇧[`, `⌘⇧]`) are visibly disruptive even when delivered to a
  backgrounded pid.** The app's key handler processes the shortcut,
  the window re-renders the new tab's content, the user sees their
  tabs flipping. There is no AX-only workaround: page content (HTML,
  form state, `AXWebArea`) populates only for the focused tab;
  inspecting a background tab requires activating it, which is the
  visible flip. Observed with Dia; the same mechanic applies to every
  Chromium-family browser (Chrome, Arc, Brave, Edge).

  **Prefer the windows-over-tabs pattern**: for each URL you need to
  drive backgrounded, use `launch_app({bundle_id, urls: [url]})` —
  browsers open each URL in a new **window**. Each window has its own
  `window_id`, its own AX tree, and can be inspected / interacted with
  via `element_index` without activating or switching anything. Tabs
  are a UX grouping for humans; cua-driver workflows should default to
  windows. See `WEB_APPS.md` → "Tabs vs windows" for the full pattern.

  Tab-title enumeration (read-only) IS safe — walk a window's toolbar
  AX tree for `AXTab` / `AXRadioButton` children and read their
  `AXTitle`s. Tab switching (activating one) is not.

Reading frontmost state is fine (`osascript -e 'tell application
"System Events" to get name of first application process whose
frontmost is true'`). Mutating it is not.

**Corollary — the AXMenuBar rule.** `AXMenuBarItem` + AXPick
dispatches at the AX layer regardless of which app is frontmost,
but macOS's on-screen menu bar always belongs to the frontmost
app. If you drive a *backgrounded* app's menu bar, the AX call
succeeds but the viewer sees the dispatch rendered over the
*frontmost* app's menu bar — confusing in any observed session and
routinely a silent no-op too, because action menu items go
`DISABLED` when their owning app isn't the key window. **So: only
use menu-bar navigation when the target is already frontmost.** For
backgrounded targets, read state via in-window AX (window title,
toolbar `AXStaticText`) and
agent-browserSkill

Browser automation CLI for AI agents. Use when the user needs to interact with websites, including navigating pages, filling forms, clicking buttons, taking screenshots, extracting data, testing web apps, or automating any browser task. Triggers include requests to "open a website", "fill out a form", "click a button", "take a screenshot", "scrape data from a page", "test this web app", "login to a site", "automate browser actions", or any task requiring programmatic web interaction.

docxSkill

Use this skill whenever the user wants to create, read, edit, or manipulate Word documents (.docx files). Triggers include: any mention of "Word doc", "word document", ".docx", or requests to produce professional documents with formatting like tables of contents, headings, page numbers, or letterheads. Also use when extracting or reorganizing content from .docx files, inserting or replacing images in documents, performing find-and-replace in Word files, working with tracked changes or comments, or converting content into a polished Word document. If the user asks for a "report", "memo", "letter", "template", or similar deliverable as a Word or .docx file, use this skill. Do NOT use for PDFs, spreadsheets, Google Docs, or general coding tasks unrelated to document generation.

frontend-designSkill

Create distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, artifacts, posters, or applications (examples include websites, landing pages, dashboards, React components, HTML/CSS layouts, or when styling/beautifying any web UI). Generates creative, polished code and UI design that avoids generic AI aesthetics.

openloomi-apiSkill

openloomi API documentation and reference. Use when working with openloomi backend APIs, AI, authentication, characters, messages, files, integrations, billing, or any server-side functionality. Triggers: API endpoints, backend routes, authentication, cloud API, integrations

openloomi-connectorsSkill

openloomi Connectors tools - manage platform integrations (OAuth connections, list accounts, check status). Triggers: connect platform, integration status, list accounts, disconnect

openloomi-feature-guideSkill

Use this when users ask about openloomi features, capabilities, or how to use it. Examples: 'openloomi 怎么用', '你能做什么', 'What can you do?', 'How does openloomi work?', 'Tell me about openloomi features', 'What platforms does openloomi support?', 'How do I use scheduled tasks?', 'What is Insights system?', 'How do I connect Telegram?', 'How to create automation?', '什么是 openloomi 事件?

openloomi-memorySkill

openloomi Memory tools - search memory files, knowledge base, and chat insights. Triggers: memory search, knowledge base, documents, insights

pdfSkill

Use this skill whenever the user wants to do anything with PDF files. This includes reading or extracting text/tables from PDFs, combining or merging multiple PDFs into one, splitting PDFs apart, rotating pages, adding watermarks, creating new PDFs, filling PDF forms, encrypting/decrypting PDFs, extracting images, and OCR on scanned PDFs to make them searchable. If the user mentions a .pdf file or asks to produce one, use this skill.