Skill10.8k repo starsupdated 1mo ago

hive.browser-automation

hive.browser-automation teaches the coordinate and workflow conventions for ClaudeWave's real-browser CDP tools. Use it before calling any browser_* tool to understand fractional viewport coordinates (0..1), the screenshot-plus-click pattern for shadow-DOM elements, and production site quirks (rich-text editor button states, CSP headers, LinkedIn/Reddit/X failures). This skill covers Chrome via the Beeline extension and explains why proportional coordinates survive vision-model resizing better than pixel values.

View source Repository: hive

Install in Claude Code

Copy

git clone --depth 1 https://github.com/aden-hive/hive /tmp/hive.browser-automation && cp -r /tmp/hive.browser-automation/core/framework/skills/_preset_skills/browser-automation ~/.claude/skills/hive.browser-automation

Then start a new Claude Code session; the skill loads automatically.

Definition

SKILL.md

# GCU Browser Automation

All GCU browser tools drive a real Chrome instance through the Beeline extension and Chrome DevTools Protocol (CDP). That means clicks, keystrokes, and screenshots are processed by the actual browser's native hit testing, focus, and layout engines — **not** a synthetic event layer. Understanding this unlocks strategies that make hard sites easy.

## Coordinates

Every browser tool that takes or returns coordinates operates in **fractions of the viewport (0..1 for both axes)**. Read a target's proportional position off `browser_screenshot` — "this button is about 35% from the left and 20% from the top" → pass `(0.35, 0.20)`. Rect-returning tools (`browser_get_rect`, `browser_shadow_query`, and the `rect` inside `focused_element`) also return fractions. The tools convert to CSS pixels internally before dispatching to Chrome.

```
browser_screenshot()                  → image + cssWidth/cssHeight in meta
browser_click_coordinate(x, y)        → x, y are fractions 0..1
browser_hover_coordinate(x, y)        → fractions
browser_press_at(x, y, key)           → fractions
browser_get_rect(selector) → rect     → rect.cx / rect.cy are fractions
browser_shadow_query(...)  → rect     → same
```

**Why fractions:** every vision model (Claude ~1.15 MP target, GPT-4o 512-px tiles, Gemini, local VLMs) resizes or tiles images differently before the model sees the pixels. Proportions survive every such transform; pixel coordinates only "work" per-model and silently break when you swap backends. Four-decimal precision (`0.0001` ≈ 0.17 CSS px on a 1717-wide viewport) is more than enough for the tightest targets.

**Exception for zoomed elements:** pages that use `zoom` or `transform: scale()` on a container (LinkedIn's `#interop-outlet`, some embedded iframes) render in a scaled local coordinate space. `getBoundingClientRect` there may not match CDP's hit space. Prefer `browser_shadow_query` (which handles the math and returns fractions) or visually pick coordinates from a screenshot. Avoid raw `browser_evaluate` + `getBoundingClientRect()` for coord lookup — that returns CSS px and will be wrong when fed to click tools.

## Screenshot + coordinates is shadow-agnostic — prefer it on shadow-heavy sites

Start with `browser_snapshot` when you need to inspect the page structure or find ordinary controls. If the snapshot does not show the thing you need, shows stale or misleading refs, or cannot prove where a visible target is, take `browser_screenshot` and use the screenshot + coordinate path. This is especially useful on sites that use Shadow DOM heavily

Why:

- **CDP hit testing walks shadow roots natively.** `browser_click_coordinate(x, y)` routes through Chrome's native hit tester, which traverses open shadow roots automatically. You don't need to know the shadow structure.
- **Keyboard dispatch follows focus** into shadow roots. After a click focuses an input (even one three shadow levels deep), `browser_press(...)` with no selector dispatches keys to `document.activeElement`'s computed focus target.
- **Screenshots render the real layout** regardless of DOM implementation.

Whereas `wait_for_selector`, `browser_click(selector=...)`, `browser_type(selector=...)` all use `document.querySelector` under the hood, which **stops at shadow boundaries**. They cannot see elements inside shadow roots. For shadow-DOM inputs, use `browser_type_focused` after focusing via click-coordinate.

### Recommended workflow on shadow-heavy sites

1. `browser_screenshot()` → JPEG; meta includes `cssWidth`/`cssHeight` for reference.
2. Identify the target visually → estimate its proportional position `(fx, fy)` where each is in `0..1`.
3. `browser_click_coordinate(fx, fy)` → tool converts to CSS px and dispatches; CDP native hit testing focuses the element. **The response includes `focused_element: {tag, id, role, contenteditable, rect, inFrame?, ...}`** — use it to verify you actually focused what you intended. `rect` is in fractions (same space as your input). When focus is inside a same-origin iframe, the descriptor reports the inner element and adds `inFrame: [...]` breadcrumbs.
4. `browser_type_focused(text="...")` → inserts text into `document.activeElement` (traverses into same-origin iframes automatically). Shadow roots, iframes, Lexical, Draft.js, ProseMirror all just work. Use `browser_type(selector, text)` instead when you have a reliable CSS selector for a light-DOM element.
5. Verify via `browser_screenshot` OR `browser_get_attribute` on a known-reachable marker (e.g. check that the Send button's `aria-disabled` flipped to `false`).

### The click→type loop (canonical pattern)

1. Call `browser_click_coordinate(x, y)` to click the target element.
2. Check the `focused_element` field in the response — it tells you what actually received focus (tag, id, role, contenteditable, rect).
3. If the focused element is editable, call `browser_type_focused(text="...")` to insert text. Use tools to verify the text took effect — prefer checking the underlying `.value` / `innerText` via `browser_evaluate` or confirming the submit button enabled. A screenshot alone can mislead: narrow input boxes visually clip long text, so only a portion may appear on screen even though the full string was accepted.
4. If it is NOT editable, your click landed on the wrong thing — refine coordinates and retry. Do NOT reach for `browser_evaluate` + `execCommand('insertText')` or shadow-root traversals. The problem is the click target, not the typing method.

`browser_click` (selector-based) also returns `focused_element`, so the same check works whether you clicked by selector or coordinate.

### Empirically verified (2026-04-11)

Tested against `https://www.reddit.com/r/programming/` whose search input lives at:

```
document > reddit-search-large [shadow]
         > faceplate-search-input#search-input [shadow]
         > input[name="q"]
```

### Shadow-piercing selectors

When you DO want a selector-based approach and k