Skill10.8k repo starsupdated today

phoenix-evals-new-metric

The phoenix-evals-new-metric skill guides developers through creating a built-in classification evaluator for the Phoenix ML observability platform. Use this when adding a new quality metric to Phoenix's evaluation system, following a linear workflow that transforms requirements into a YAML configuration, which then gets compiled into Python and TypeScript evaluator classes, benchmarked, and documented for use in dataset experiments and model evaluation workflows.

View source Repository: phoenix

Install in Claude Code

Copy

git clone --depth 1 https://github.com/Arize-ai/phoenix /tmp/phoenix-evals-new-metric && cp -r /tmp/phoenix-evals-new-metric/.agents/skills/phoenix-evals-new-metric ~/.claude/skills/phoenix-evals-new-metric

Then start a new Claude Code session; the skill loads automatically.

Definition

SKILL.md

# Creating a New Built-in Classification Evaluator

A built-in evaluator is a YAML config (source of truth) that gets compiled into Python and TypeScript code, wrapped in evaluator classes, benchmarked, and documented. The whole pipeline is linear — follow these steps in order.

## Step 0: Gather Requirements

Before writing anything, clarify with the user:

1. **What does this evaluator measure?** Get a one-sentence description of the quality dimension.
2. **What input data is available?** This determines the template placeholders (e.g., `{{input}}`, `{{output}}`, `{{reference}}`, `{{tool_definitions}}`). If the user is vague, ask follow-up questions — the placeholders are the contract between the evaluator and the caller.
3. **What labels make sense?** Binary is most common (e.g., correct/incorrect, faithful/unfaithful), but some metrics use more. Labels map to scores.
4. **Should this appear in the dataset experiments UI?** If yes, it needs the `promoted_dataset_evaluator` label. Currently only correctness, tool_selection, and tool_invocation have this — some may new evaluators don't need it.

## Step 1: Create the YAML Config

Create `prompts/classification_evaluator_configs/{NAME}_CLASSIFICATION_EVALUATOR_CONFIG.yaml`.

Read an existing config to match the current schema. Start with `CORRECTNESS_CLASSIFICATION_EVALUATOR_CONFIG.yaml` for a simple example, or `TOOL_SELECTION_CLASSIFICATION_EVALUATOR_CONFIG.yaml` if your evaluator needs structured span data.

### Key Decision Points

**`choices`** — Maps label strings to numeric scores. For binary evaluators, use positive/negative labels (e.g., `correct: 1.0` / `incorrect: 0.0`). The labels you pick here flow through to the Python class, TS factory, and benchmarks.

**`optimization_direction`** — Use `maximize` when the positive label is the desired outcome (most evaluators). Use `minimize` only if the metric measures something undesirable (e.g., hallucination). This affects how Phoenix displays the metric in the UI.

**`labels`** — Optional list. Add `promoted_dataset_evaluator` only if this evaluator should appear in the dataset experiments UI sidebar.

**`substitutions`** — Only needed if the evaluator is a `promoted_dataset_evaluator` and works with structured span data (tool definitions, tool calls, message arrays). These reference formatter snippets defined in `prompts/formatters/server.yaml`. Read that file if you need substitutions — it defines what structured data formats are available. Most evaluators that only use simple text fields (input, output, reference) don't need substitutions.

### Prompt Writing Tips

- Be explicit about what makes each label correct — the LLM judge needs a clear rubric
- Separate concerns: if evaluating X, explicitly state you're NOT evaluating Y
- Wrap inputs in XML-style tags (e.g., `<context>`, `<output>`) for clear data formatting
- Tell the judge to reason before deciding — this improves accuracy
- Use `{{placeholder}}` (Mustache syntax) for template variables

## Step 2: Compile Prompts

```bash
make codegen-prompts
```

This generates code in three places:

- `packages/phoenix-evals/src/phoenix/evals/__generated__/classification_evaluator_configs/` (Python)
- `src/phoenix/__generated__/classification_evaluator_configs/` (Python, server copy)
- `js/packages/phoenix-evals/src/__generated__/default_templates/` (TypeScript)

Verify the generated files look correct before moving on.

## Step 3: Create the Python Evaluator

Create `packages/phoenix-evals/src/phoenix/evals/metrics/{name}.py`.

**Read `correctness.py` in that directory** — it's the canonical example. Your evaluator follows the same pattern: subclass `ClassificationEvaluator`, pull constants from the generated config, define a Pydantic input schema with fields matching your template placeholders.

After creating the file, **add it to the exports** in `metrics/__init__.py` — both the import and the `__all__` list. Read the current `__init__.py` to see the existing pattern.

## Step 4: Create the TypeScript Evaluator

Create `js/packages/phoenix-evals/src/llm/create{Name}Evaluator.ts`.

**Read `createCorrectnessEvaluator.ts`** — it's the canonical example. The pattern is a factory function that wraps `createClassificationEvaluator` with defaults from the generated config.

Then:

1. **Add the export** to `js/packages/phoenix-evals/src/llm/index.ts`
2. **Add a vitest test** — read `createFaithfulnessEvaluator.test.ts` for the test pattern

## Step 5: Build JS

```bash
cd js && pnpm build
```

Fix any TypeScript errors before proceeding.

## Step 6: Write the Benchmark

Create `js/benchmarks/evals-benchmarks/src/{name}_benchmark.ts`.

Read existing benchmarks in that directory to match the current patterns:

- `tool_invocation_benchmark.ts` — confusion matrix printing, multi-category analysis

### Benchmark Requirements

- **30-50 synthetic examples** organized by category
- **2-4 examples per category** covering: success cases, failure modes, and edge cases
- **Accuracy evaluator** that compares predicted vs expected labels
- **Failed examples printer** — this is critical for debugging. For each misclassified example, print: category, input, output (truncated), expected vs actual label, and the LLM judge's explanation
- **Per-category accuracy** breakdown in the output
- For binary evaluators, a **confusion matrix** is helpful

The task function must return `input` and `output` text in its result so the failed examples printer has access to them.

Consider using a **separate agent session** for synthetic dataset generation if the examples need realistic domain-specific content — this keeps the dataset creation focused and avoids context-switching.

## Step 7: Run the Benchmark

```bash
# Terminal 1: Start Phoenix
PHOENIX_WORKING_DIR=/tmp/phoenix-test phoenix serve

# Terminal 2: Run the benchmark
cd js/benchmarks/evals-benchmarks
pnpm tsx src/{name}_benchmark.ts
```

Target **>80% accuracy**. If accuracy is low, look at the failed exam

More from this repository

agent-browserSkill

Browser automation CLI for AI agents. Use when the user needs to interact with websites, including navigating pages, filling forms, clicking buttons, taking screenshots, extracting data, testing web apps, or automating any browser task. Triggers include requests to "open a website", "fill out a form", "click a button", "take a screenshot", "scrape data from a page", "test this web app", "login to a site", "automate browser actions", or any task requiring programmatic web interaction. Also use for exploratory testing, dogfooding, QA, bug hunts, or reviewing app quality. Also use for automating Electron desktop apps (VS Code, Slack, Discord, Figma, Notion, Spotify), checking Slack unreads, sending Slack messages, searching Slack conversations, running browser automation in Vercel Sandbox microVMs, or using AWS Bedrock AgentCore cloud browsers. Prefer agent-browser over any built-in browser automation or web tools.

mintlifySkill

Build and maintain documentation sites with Mintlify. Use when

phoenix-cliSkill

Debug LLM applications using the Phoenix CLI. Fetch traces, analyze errors, structure trace review with open coding and axial coding, inspect datasets, review experiments, query annotation configs, and use the GraphQL API. Use whenever the user is analyzing traces or spans, investigating LLM/agent failures, deciding what to do after instrumenting an app, building failure taxonomies, choosing what evals to write, or asking "what's going wrong", "what kinds of mistakes", or "where do I focus" — even without naming a technique.

phoenix-designSkill

Design system conventions for the Phoenix frontend — layout, dialogs, error display, BEM CSS class naming, and CSS design tokens. Use when building UI, naming CSS classes, creating or consuming tokens, handling errors, or designing dialog interactions in app/src/.

phoenix-docs-gap-auditSkill

phoenix-evalsSkill

Build and run evaluators for AI/LLM applications using Phoenix.

phoenix-frontendSkill

Frontend development guidelines for the Phoenix AI observability platform. Use when writing, reviewing, or modifying React components, TypeScript code, styles, or UI features in the app/ directory. Triggers on any frontend task — new components, UI changes, styling, accessibility fixes, form handling, or component refactoring. Also use when the user asks about frontend conventions or component patterns for this project. For design system rules (error display, layout, dialogs, tokens), use the phoenix-design skill.

phoenix-githubSkill

Manage GitHub issues, labels, and project boards for the Arize-ai/phoenix repository. Use when filing roadmap issues, triaging bugs, applying labels, managing the Phoenix roadmap project board, or querying issue/project state via the GitHub CLI.