Skip to main content
ClaudeWave
Skill10.1k repo starsupdated today

pxi-eval-dataset

The pxi-eval-dataset skill generates minimal, targeted YAML test datasets for Phoenix's evaluation framework, each containing 10-50 synthetic examples that verify a specific tool, skill, or agent behavior through deterministic code evaluators. Use this skill when building regression test suites for PXI tools by first identifying the target behavior, surveying existing evaluators, designing examples that exercise distinct dimensions of that behavior, and structuring each example with required fields (input, expected tool calls/arguments, splits) so it runs through the experiment harness and scores deterministically.

Install in Claude Code
Copy
git clone --depth 1 https://github.com/Arize-ai/phoenix /tmp/pxi-eval-dataset && cp -r /tmp/pxi-eval-dataset/.agents/skills/pxi-eval-dataset ~/.claude/skills/pxi-eval-dataset
Then start a new Claude Code session; the skill loads automatically.

SKILL.md

# pxi-eval-dataset

Produce a small, well-targeted YAML dataset that drops into
`evals/pxi/datasets/<name>.yaml`, runs through
`evals/pxi/harness/run_experiment.py`, and is scored by deterministic
code evaluators under `evals/pxi/evaluators/`.

The aim is a **minimal but representative** set of synthetic examples —
think unit tests, not a benchmark. 10–50 examples, each covering a
distinct dimension. Add more only when a new example tests something no
existing example does.

Every example must be scorable by deterministic / heuristic / code logic.
Every example must include a non-empty `splits:` list. Use
`splits: [regression]` for small committed regression suites unless the
user explicitly asks for a `dev` or `val` dataset; keep `regression`,
`dev`, and `val` disjoint.

---

## Workflow

### 1. Identify the target

Confirm with the user what's under test: a specific PXI tool name (e.g.
`set_time_range`), a skill, or a higher-level behavior. For tools:

- Read the tool definition at
  `src/phoenix/server/agents/toolsets/external/tools/<name>.py` — the
  `ToolDefinition`'s description and `parameters_json_schema` are what
  the LLM actually sees.
- Search the phoenix src for any server-side implementation function
  that backs the tool. (External tools execute browser-side and may
  have none — that's fine, the docstring + schema are the spec.)
- Read `src/phoenix/server/agents/toolsets/__init__.py::build_toolset`
  and `src/phoenix/server/agents/toolsets/external/__init__.py::build_external_toolset`
  to learn the availability conditions — which `ChatContext` must be
  present for the tool to be exposed (e.g. `set_spans_filter` only
  exists when `deps.contexts.project` is set).

### 2. Survey the evaluators that currently exist

List `evals/pxi/evaluators/` and read each module. For every
evaluator, note:

- the evaluator name and `@create_evaluator(...)` decorator,
- what fields it consumes from `expected` (e.g. `expected.tools.required`,
  `expected.tool_call_args[<tool>]`),
- its score / label semantics and any helpful failure metadata,
- the class of assertion it supports (tool selection, tool arguments,
  assistant text, multi-call sequencing, ...).

Also peek at `evals/pxi/tests/test_evaluators.py` for canonical input
shapes.

**Do NOT hard-code knowledge of which evaluators exist** — re-read the
directory every time. The set will grow.

### 3. Match the target to the evaluators — pause if a gap exists

Decide whether the behavior under test can be scored by what's there
today.

- **Yes** → continue. The chosen evaluator names go in the dataset's
  `evaluators:` field (required, top-level). The runner uses ONLY the
  listed evaluators; unrelated ones are not invoked, so the dashboard
  stays free of vacuous passes from evaluators that don't apply to
  this dataset. Available names live in
  `evals/pxi/evaluators/__init__.py` (`EVALUATORS_BY_NAME`).
- **No** → stop and summarize the gap to the user. Propose the shape
  of a new evaluator:
  - name (snake_case),
  - the `expected.<field>` it would read,
  - score / label semantics,
  - whether it's tool-call-shaped, text-shaped, structural, etc.

  Then ask: "Should I add this evaluator before we generate examples?"
  If yes:
  - implement it under `evals/pxi/evaluators/<file>.py`,
  - add unit-test coverage to `evals/pxi/tests/test_evaluators.py`,
  - export it from `evals/pxi/evaluators/__init__.py`,
  - then continue.

  If no, scope the dataset down to assertions the existing evaluators
  can score, and tell the user what coverage that costs.

### 4. Enumerate coverage dimensions

Walk this checklist for the target. For each row, write down which
queries you'll add to cover it. Skip rows that don't apply (e.g.
booleans on a tool with no boolean field).

- **Parameter coverage** — every required field, every optional field,
  every meaningful field combination.
- **Value coverage** — every enum literal; for strings: empty,
  whitespace, special characters, very long; for numbers: zero,
  negative, boundary; for booleans: both polarities.
- **Combination coverage** — fields that interact (e.g. a filter
  condition paired with a scope toggle, two args that must agree).
- **Negative coverage** — queries where the tool should NOT be called.
- **Ambiguity coverage** — borderline queries that test correctness
  under uncertainty.

Polarity and difficulty targets are stated once in step 5 below; aim
for those across the whole dataset, not within each row.

If you cannot fill at least 10 rows, the target is probably too
narrow — confirm scope with the user.

### 5. Draft queries

For each coverage dimension, write one or more queries that feel like
they came from a real Phoenix user. Mix across:

- **Voice:** imperative ("show me LLM spans"), declarative ("I want
  to see only errors"), question ("what spans took over 5s?"),
  fragment ("LLM only").
- **Polish:** clean prose, terse fragments, casual typos,
  abbreviations, incomplete sentences.
- **Personas:** new user setting up Phoenix on a new project; engineer
  debugging an agent or RAG pipeline; PM exploring trace quality; AI
  engineer writing evals; annotator marking traces; researcher
  exploring trends. Sample across them — don't sound like one author.
- **Difficulty:** ~30% obvious, ~50% moderate, ~20% ambiguous /
  tricky.
- **Polarity:** at least 30% negative (tool should NOT be called).

Anti-patterns to avoid:

- Don't paraphrase the tool's docstring.
- Don't use technical jargon a real user wouldn't reach for.
- Don't write 20 minor variations of the same intent.

### 6. Annotate expected outputs — subprocess annotation

**Ground truth must be generated in a fresh subprocess, not inline.**
When annotation happens in the same context that drafted the queries, the
agent builds a prior from its own examples: it anchors on condition
patterns it established early, collapses ambiguous cases into overconfident
annotations, and fills in `tool_call_args` based on what earlier ex
agent-browserSkill

Browser automation CLI for AI agents. Use when the user needs to interact with websites, including navigating pages, filling forms, clicking buttons, taking screenshots, extracting data, testing web apps, or automating any browser task. Triggers include requests to "open a website", "fill out a form", "click a button", "take a screenshot", "scrape data from a page", "test this web app", "login to a site", "automate browser actions", or any task requiring programmatic web interaction. Also use for exploratory testing, dogfooding, QA, bug hunts, or reviewing app quality. Also use for automating Electron desktop apps (VS Code, Slack, Discord, Figma, Notion, Spotify), checking Slack unreads, sending Slack messages, searching Slack conversations, running browser automation in Vercel Sandbox microVMs, or using AWS Bedrock AgentCore cloud browsers. Prefer agent-browser over any built-in browser automation or web tools.

mintlifySkill

Build and maintain documentation sites with Mintlify. Use when

phoenix-cliSkill

Debug LLM applications using the Phoenix CLI. Fetch traces, analyze errors, structure trace review with open coding and axial coding, inspect datasets, review experiments, query annotation configs, and use the GraphQL API. Use whenever the user is analyzing traces or spans, investigating LLM/agent failures, deciding what to do after instrumenting an app, building failure taxonomies, choosing what evals to write, or asking "what's going wrong", "what kinds of mistakes", or "where do I focus" — even without naming a technique.

phoenix-designSkill

Design system conventions for the Phoenix frontend — layout, dialogs, error display, BEM CSS class naming, and CSS design tokens. Use when building UI, naming CSS classes, creating or consuming tokens, handling errors, or designing dialog interactions in app/src/.

phoenix-docs-gap-auditSkill

>

phoenix-evals-new-metricSkill

>-

phoenix-evalsSkill

Build and run evaluators for AI/LLM applications using Phoenix.

phoenix-frontendSkill

Frontend development guidelines for the Phoenix AI observability platform. Use when writing, reviewing, or modifying React components, TypeScript code, styles, or UI features in the app/ directory. Triggers on any frontend task — new components, UI changes, styling, accessibility fixes, form handling, or component refactoring. Also use when the user asks about frontend conventions or component patterns for this project. For design system rules (error display, layout, dialogs, tokens), use the phoenix-design skill.