Skip to main content
ClaudeWave
Skill10.1k repo starsupdated today

phoenix-pxi-playwright

phoenix-pxi-playwright is a Playwright test harness for authoring and debugging end-to-end tests for Phoenix's built-in PXI AI assistant. Use it when writing frontend specifications for PXI agent behavior, creating LLM-as-judge rubrics to evaluate assistant responses, asserting tool usage and backend spans, persisting test runs as Phoenix experiments, or troubleshooting PXI E2E test failures. The harness provides shared fixtures, constants, utilities, and abstractions like `pxi.askAndWait()` and `judge()` to standardize PXI test development.

Install in Claude Code
Copy
git clone --depth 1 https://github.com/Arize-ai/phoenix /tmp/phoenix-pxi-playwright && cp -r /tmp/phoenix-pxi-playwright/.agents/skills/phoenix-pxi-playwright ~/.claude/skills/phoenix-pxi-playwright
Then start a new Claude Code session; the skill loads automatically.

SKILL.md

# Phoenix PXI Playwright Tests

Use this skill when authoring or maintaining Playwright specs for PXI, Phoenix's built-in AI assistant. The concrete harness lives in `app/tests/pxi/`; this skill is the authoring guide for using and extending that harness.

## Start Here

- Read the existing example spec first: `app/tests/pxi/docs-smoke.spec.ts`.
- Reuse the shared fixture and driver from `app/tests/pxi/fixtures.ts`.
- Reuse shared constants from `app/tests/pxi/constants.ts` and shared types from `app/tests/pxi/types.ts`.
- Put pure parsing/API helpers in `app/tests/pxi/utils.ts` rather than in specs or fixture classes.
- Reuse the generic AI SDK judge from `app/tests/pxi/judge.ts`.
- Reuse experiment persistence from `app/tests/pxi/experimentPersistence.ts`.
- Add one entry to `PXI_EXPERIMENT_EXAMPLES` in `app/tests/pxi/experimentPersistence.ts` for every new PXI spec scenario. All specs share the same dataset, and the update upload treats that registry as the complete desired set of examples.
- Do not create a bespoke PXI driver, duplicate experiment client, or duplicate PXI tool schemas in a spec.

## Current Harness

The current harness provides these abstractions:

- `test` and `expect` from `./fixtures`: PXI-aware Playwright fixture exports.
- `constants.ts`: default assistant and judge model/project constants.
- `types.ts`: shared PXI harness types such as `PxiTurn`.
- `utils.ts`: pure utilities for API response validation and span/tool parsing.
- `pxi.open()`: opens PXI for the test session.
- `pxi.acknowledgeConsent()`: accepts PXI consent for the test session.
- `pxi.askAndWait(prompt)`: sends a user prompt and waits for the assistant turn. It does not require a backend TOOL span; add explicit tool assertions in the spec.
- `pxi.expectNoAgentError()`: asserts the visible PXI session did not surface an agent error.
- `pxi.expectBackendToolSpanCalled(turn)`: asserts the PXI turn produced at least one persisted backend TOOL span and merges those backend tool names into `turn.calledTools`. Use this for server/MCP-backed tools such as docs tools, not for purely client-executed external tools.
- `pxi.expectDocsToolCalled(turn)`: asserts the PXI turn used runtime docs tooling via Phoenix-observed tool spans.
- `pxi.getMetadata()`: collects PXI metadata for persistence.
- `judge({ model, system, prompt, assistantText, rubric })`: evaluates an assistant answer with AI SDK `generateText` and structured `Output.object`.
- `evaluatePxiOutcome({ assertions, judgeInput })`: runs deterministic assertions and LLM judging while preserving failed post-turn outcomes for experiment persistence.
- `assertPxiOutcome(outcome)`: fails the Playwright test after persistence, preferring the original deterministic assertion failure when one exists.
- `persistPxiExperiment({ request, record })`: stores the PXI interaction, judge result, and metadata as a Phoenix experiment.

## Authoring Workflow

1. Add the scenario prompt and expected output to `PXI_EXPERIMENT_EXAMPLES` so the shared PXI E2E dataset gets one example per test scenario.
2. Put the scenario prompt, PXI user instructions, and judge rubric in the spec file so the test is readable top-to-bottom. Import the scenario from `PXI_EXPERIMENT_EXAMPLES` rather than duplicating prompt strings.
3. Drive PXI through the real UI with `pxi.open`, `pxi.acknowledgeConsent`, and `pxi.askAndWait`.
4. Add deterministic assertions before judge assertions, such as expected text, no agent error, or expected tool use.
5. Put all post-turn deterministic assertions inside `evaluatePxiOutcome`, including `pxi.expectBackendToolSpanCalled(turn)`, so failures after PXI returns an answer still get persisted.
6. After `pxi.askAndWait` returns a turn, persist both passing and failing outcomes. Do not let deterministic assertion failures skip experiment persistence.
7. Use `evaluatePxiOutcome` instead of writing per-spec `try/catch` blocks. It runs `judge` even when deterministic Playwright assertions fail, then combines the judge explanation with a sanitized, truncated Playwright assertion message in the persisted failed evaluation.
8. Run the targeted spec with isolated ports before reporting success.

## Spec Pattern

```ts
import {
  persistPxiExperiment,
  PXI_EXPERIMENT_EXAMPLES,
} from "./experimentPersistence";
import { expect, test } from "./fixtures";
import { getRequiredJudgeApiKeyEnv } from "./judge";
import { assertPxiOutcome, evaluatePxiOutcome } from "./outcome";

const EXPERIMENT_EXAMPLE = PXI_EXPERIMENT_EXAMPLES.someScenario;
const USER_PROMPT = EXPERIMENT_EXAMPLE.prompt;
const JUDGE_RUBRIC = [
  "The answer satisfies the user request.",
  "The answer is grounded in the expected Phoenix context.",
  "The answer does not invent unsupported facts.",
];
const JUDGE_API_KEY_ENV = getRequiredJudgeApiKeyEnv();

test.describe("PXI scenario", () => {
  test("handles the scenario", async ({
    browserName,
    page,
    pxi,
    request,
  }, testInfo) => {
    test.skip(
      browserName !== "chromium",
      "PXI real-LLM smoke runs once in chromium."
    );
    test.skip(
      process.env.PXI_E2E !== "true",
      "Set PXI_E2E=true to run PXI E2E tests."
    );
    test.skip(
      !process.env.OPENAI_API_KEY,
      "OPENAI_API_KEY is required for the PXI assistant."
    );
    test.skip(
      !process.env[JUDGE_API_KEY_ENV],
      `${JUDGE_API_KEY_ENV} is required for the PXI E2E judge.`
    );

    await pxi.open();
    await pxi.acknowledgeConsent();

    const turn = await pxi.askAndWait(USER_PROMPT);
    const outcome = await evaluatePxiOutcome({
      assertions: async () => {
        await pxi.expectNoAgentError();
        // For server/MCP-backed tools only. Client-executed external tools can
        // be asserted through visible tool UI or final app state instead.
        await pxi.expectBackendToolSpanCalled(turn);
        expect(turn.assistantText).toContain("deterministic expected text");
      },
      judgeInput: {
        system: "You are judgi
agent-browserSkill

Browser automation CLI for AI agents. Use when the user needs to interact with websites, including navigating pages, filling forms, clicking buttons, taking screenshots, extracting data, testing web apps, or automating any browser task. Triggers include requests to "open a website", "fill out a form", "click a button", "take a screenshot", "scrape data from a page", "test this web app", "login to a site", "automate browser actions", or any task requiring programmatic web interaction. Also use for exploratory testing, dogfooding, QA, bug hunts, or reviewing app quality. Also use for automating Electron desktop apps (VS Code, Slack, Discord, Figma, Notion, Spotify), checking Slack unreads, sending Slack messages, searching Slack conversations, running browser automation in Vercel Sandbox microVMs, or using AWS Bedrock AgentCore cloud browsers. Prefer agent-browser over any built-in browser automation or web tools.

mintlifySkill

Build and maintain documentation sites with Mintlify. Use when

phoenix-cliSkill

Debug LLM applications using the Phoenix CLI. Fetch traces, analyze errors, structure trace review with open coding and axial coding, inspect datasets, review experiments, query annotation configs, and use the GraphQL API. Use whenever the user is analyzing traces or spans, investigating LLM/agent failures, deciding what to do after instrumenting an app, building failure taxonomies, choosing what evals to write, or asking "what's going wrong", "what kinds of mistakes", or "where do I focus" — even without naming a technique.

phoenix-designSkill

Design system conventions for the Phoenix frontend — layout, dialogs, error display, BEM CSS class naming, and CSS design tokens. Use when building UI, naming CSS classes, creating or consuming tokens, handling errors, or designing dialog interactions in app/src/.

phoenix-docs-gap-auditSkill

>

phoenix-evals-new-metricSkill

>-

phoenix-evalsSkill

Build and run evaluators for AI/LLM applications using Phoenix.

phoenix-frontendSkill

Frontend development guidelines for the Phoenix AI observability platform. Use when writing, reviewing, or modifying React components, TypeScript code, styles, or UI features in the app/ directory. Triggers on any frontend task — new components, UI changes, styling, accessibility fixes, form handling, or component refactoring. Also use when the user asks about frontend conventions or component patterns for this project. For design system rules (error display, layout, dialogs, tokens), use the phoenix-design skill.