Skip to main content
ClaudeWave
Skill10.1k repo starsupdated today

phoenix-skills-audit

The phoenix-skills-audit skill scans recent commits to the Arize Phoenix repository, detects changes to user-facing APIs in the tracing, CLI, and evals packages, and automatically patches the corresponding skill definition files to keep agent context synchronized with actual code behavior. Use this when Phoenix's Python clients, TypeScript clients, or CLI surfaces change, to prevent agent training drift from stale skill documentation.

Install in Claude Code
Copy
git clone --depth 1 https://github.com/Arize-ai/phoenix /tmp/phoenix-skills-audit && cp -r /tmp/phoenix-skills-audit/.agents/skills/phoenix-skills-audit ~/.claude/skills/phoenix-skills-audit
Then start a new Claude Code session; the skill loads automatically.

SKILL.md

# Phoenix Skills Audit

Keep the three external-facing skills — `phoenix-tracing`, `phoenix-cli`, `phoenix-evals` —
truthful about what the Phoenix Python clients, TypeScript clients, CLI, and APIs actually
do today. The output is **patches applied to the skill files**, not a report. The skill
reads recent commits, identifies what changed in user-facing surfaces, and updates the
relevant `SKILL.md` and `references/*.md` files in place.

This is a sibling skill to `phoenix-docs-gap-audit`. The docs-gap-audit produces a *report*
about gaps in `docs/phoenix/`; this skill produces *edits* to `.agents/skills/`. Skills are
loaded into agent context every time the user asks a question that triggers them, so a
stale skill teaches every future agent the wrong API. That makes drift here strictly
worse than drift in human-facing docs — humans can sanity-check; agents can't.

## Targets — the three skills this audit owns

```
.agents/skills/phoenix-tracing/SKILL.md
.agents/skills/phoenix-tracing/references/*.md
.agents/skills/phoenix-cli/SKILL.md
.agents/skills/phoenix-evals/SKILL.md
.agents/skills/phoenix-evals/references/*.md
```

Do not touch any other skill directory. If a change clearly belongs in a skill outside
these three (e.g. `phoenix-server`, `phoenix-frontend`), note it in the run summary and
skip — those skills are internal and have their own owners.

## Source mapping — which code feeds which skill

| Source area | Skill |
|---|---|
| `packages/phoenix-otel/` (Python) | `phoenix-tracing` |
| `js/packages/phoenix-otel/` (TS) | `phoenix-tracing` |
| OpenInference semantic conventions, span attributes, instrumentation patterns | `phoenix-tracing` |
| `js/packages/phoenix-cli/` (commands, flags, output JSON shape) | `phoenix-cli` |
| `packages/phoenix-evals/` (Python) | `phoenix-evals` |
| `js/packages/phoenix-evals/` (TS) | `phoenix-evals` |
| New evaluators, eval templates, experiment APIs | `phoenix-evals` |
| `packages/phoenix-client/` (Python) | depends on what's exposed — see below |
| `js/packages/phoenix-client/` (TS) | depends on what's exposed — see below |
| Server REST/GraphQL (`src/phoenix/server/api/`) | depends on what's exposed — see below |

The generic clients and the server APIs are cross-cutting. Map them by the *feature* they
expose, not the file they live in:

- A client method that creates spans / sets attributes → `phoenix-tracing`
- A client method that runs evaluations or experiments → `phoenix-evals`
- A new REST/GraphQL endpoint that the CLI wraps (or should wrap) → `phoenix-cli`
- A new attribute on a span returned by the API → `phoenix-cli` (JSON shape doc) and
  potentially `phoenix-tracing` (if it's an OpenInference attribute)

A single feature can — and often does — span multiple skills. That's fine. Make all the
edits; cross-reference between skills only when the user genuinely needs to read both.

## Workflow

### Phase 1: Gather commits

Default window is the last 7 days on `origin/main`. The user may override. Translate
their phrasing into a concrete range before running anything.

**Always audit `origin/main`, not the local `main` branch.** Local `main` is routinely
stale by dozens of commits — auditing the stale tip silently misses everything that
shipped after the last `git pull`.

```bash
git fetch origin main --quiet

# Default 7-day window
git log --since="7 days ago" origin/main --no-merges --pretty=format:"%h %s" --name-status

# Tag range if the user specified one
git log <prev-tag>..<current-tag> --no-merges --pretty=format:"%h %s" --name-status

# Sanity check
git rev-list --count main..origin/main
```

Save the raw list. Note the audited SHA in the run summary so a reader can reproduce.

### Phase 2: Triage to user-facing surfaces

Commit messages lie or under-report. Use them as an index, not a source of truth. Split
the list into three buckets:

- **Audit candidates** — anything that changes what the Python clients, TypeScript
  clients, CLI, REST/GraphQL APIs, or instrumentation packages do from the user's
  perspective. New methods, new flags, new commands, new attributes, new env vars,
  renamed parameters, behavior changes, deprecations, breaking changes.
- **Skip** — internal refactors with no public surface, dep bumps, test-only changes,
  CI/build changes, formatting, frontend-only changes (those belong to other skills),
  release-please bookkeeping (`chore(main): release …`), and **feature flags** (env
  vars named `*_DANGEROUSLY_*`, `*_EXPERIMENTAL_*`, intentionally undocumented escape
  hatches). Feature flags are deliberately kept out of public skills.
- **Unclear** — when you cannot tell from the message and changed paths. Default to
  reading the diff. Cheap, and catches features hidden behind `refactor:` prefixes.

Group related commits that implement one logical feature across server + SDK + CLI. Audit
them as one unit.

**Breadth before depth.** Enumerate every audit candidate in a flat list before going
deep on any one. The goal is a *complete* sweep, not the single juiciest finding.

For each candidate, in the same one-liner, **tag the cross-cutting concept(s) it most
likely affects**, even if the commit's file path doesn't say so. The four concepts:

- **tracing** — spans, attributes, instrumentation, OpenInference, otel
- **evals** — evaluators, experiments, datasets, prompts
- **cli** — CLI commands, flags, JSON output shape
- **none** — internal/refactor/dep/frontend-only/feature-flag

Once a candidate is tagged with `tracing` / `evals` / `cli`, you have committed to a
hypothesis: there is potentially a skill edit here. The only way to legitimately drop
the candidate later is to read the diff and find that the change is internal, or
already documented, or genuinely user-irrelevant. "I don't see the relevant package in
the path" is not a sufficient reason.

Only after this tagged list exists do you expand each entry into edits.

### Phase 3: Locate the real code

For every candidate, open the actual changed files. Com
agent-browserSkill

Browser automation CLI for AI agents. Use when the user needs to interact with websites, including navigating pages, filling forms, clicking buttons, taking screenshots, extracting data, testing web apps, or automating any browser task. Triggers include requests to "open a website", "fill out a form", "click a button", "take a screenshot", "scrape data from a page", "test this web app", "login to a site", "automate browser actions", or any task requiring programmatic web interaction. Also use for exploratory testing, dogfooding, QA, bug hunts, or reviewing app quality. Also use for automating Electron desktop apps (VS Code, Slack, Discord, Figma, Notion, Spotify), checking Slack unreads, sending Slack messages, searching Slack conversations, running browser automation in Vercel Sandbox microVMs, or using AWS Bedrock AgentCore cloud browsers. Prefer agent-browser over any built-in browser automation or web tools.

mintlifySkill

Build and maintain documentation sites with Mintlify. Use when

phoenix-cliSkill

Debug LLM applications using the Phoenix CLI. Fetch traces, analyze errors, structure trace review with open coding and axial coding, inspect datasets, review experiments, query annotation configs, and use the GraphQL API. Use whenever the user is analyzing traces or spans, investigating LLM/agent failures, deciding what to do after instrumenting an app, building failure taxonomies, choosing what evals to write, or asking "what's going wrong", "what kinds of mistakes", or "where do I focus" — even without naming a technique.

phoenix-designSkill

Design system conventions for the Phoenix frontend — layout, dialogs, error display, BEM CSS class naming, and CSS design tokens. Use when building UI, naming CSS classes, creating or consuming tokens, handling errors, or designing dialog interactions in app/src/.

phoenix-docs-gap-auditSkill

>

phoenix-evals-new-metricSkill

>-

phoenix-evalsSkill

Build and run evaluators for AI/LLM applications using Phoenix.

phoenix-frontendSkill

Frontend development guidelines for the Phoenix AI observability platform. Use when writing, reviewing, or modifying React components, TypeScript code, styles, or UI features in the app/ directory. Triggers on any frontend task — new components, UI changes, styling, accessibility fixes, form handling, or component refactoring. Also use when the user asks about frontend conventions or component patterns for this project. For design system rules (error display, layout, dialogs, tokens), use the phoenix-design skill.