Skip to main content
ClaudeWave
Skill10.1k repo starsupdated today

annotate-spans

The annotate-spans skill guides the creation of durable, structured feedback attached to spans and traces in the Phoenix observability framework. Use this skill when you need to attach judgments to specific spans or traces with a dimension name, optional label or score, and explanation that will persist for later filtering, aggregation, auditing, and curation of observability data. Follow the principle of grounding annotations in observed behavior, maintaining one dimension per annotation, targeting the most specific responsible span, and annotating root causes rather than downstream symptoms.

Install in Claude Code
Copy
git clone --depth 1 https://github.com/Arize-ai/phoenix /tmp/annotate-spans && cp -r /tmp/annotate-spans/src/phoenix/server/agents/prompts/skills/annotate-spans ~/.claude/skills/annotate-spans
Then start a new Claude Code session; the skill loads automatically.

SKILL.md

# Annotating Spans and Traces

An annotation is durable, structured feedback attached to a span or trace: a `name` (the dimension being judged), an optional `label` and/or `score` (the outcome), and an `explanation` (why). Annotations are not throwaway commentary — they accumulate into a dataset the user filters, aggregates, and iterates against.

A good annotation earns its place by being useful *later*:

- **Filterable** — `annotations['answer_relevance'].label == 'fail'` returns the spans you meant.
- **Aggregatable** — counting labels across spans yields a failure rate that tells the user where to focus.
- **Auditable** — months later, the explanation still justifies the judgment without rerunning anything.
- **Curatable** — failing spans can be pulled into a dataset to drive evals or fixes.

This skill governs the *judgment* behind annotations. The `batch_span_annotate` tool description governs the *mechanics* (one array, ID requirements, update keying); follow both, and never contradict the tool's naming and identifier rules.

## What Makes an Annotation Useful

1. **Grounded in observed behavior, not generic quality vibes.** Annotate what actually went wrong or right in *this* span. "Cited a refund policy that does not exist in the retrieved context" beats a free-floating `hallucination_score: 0.3`. Generic dimensions like `helpfulness` or `coherence` are rarely grounded in the application's real failure modes — prefer names that point at a concrete behavior.

2. **One dimension per annotation.** `name` is the rubric dimension; the outcome lives in `label`/`score`. Use `name: "tool_selection"`, `label: "incorrect"` — not `name: "wrong_tool"`. If you find yourself judging two things at once (e.g., retrieval relevance *and* answer faithfulness), write two annotations.

3. **Target the most specific responsible span.** Annotate the LLM span for model output, the tool span for tool behavior, the retriever span for retrieval quality. Reserve root agent/chain spans for genuinely end-to-end judgments (task success, trajectory). A faithfulness failure pinned to the right LLM span is actionable; the same label on the root span forces the user to hunt.

4. **Judge the first failure, not every downstream symptom.** Errors cascade — bad retrieval produces a bad answer. Annotate the root cause where it occurred. Add a separate annotation downstream only when it reveals an independent problem, not a consequence of the first.

5. **Prefer crisp labels over fuzzy scores.** A binary or small categorical label (`pass`/`fail`, `relevant`/`irrelevant`, `correct`/`partial`/`incorrect`) is easy to apply consistently and easy to aggregate. Use a numeric `score` only when the scale is genuinely meaningful and defined; put the rubric, scale, or threshold in `metadata` so the number is interpretable later.

6. **Explanations are specific observations, not restatements.** Write what you saw, citing the evidence. Good: "Returned chunks about onboarding; user asked about cancellation — no relevant chunk retrieved." Weak: "The retrieval was bad." Always include an explanation for any score, any failure, any unclear label, or any judgment the user might want to revisit.

7. **Be consistent across spans.** The same dimension must use the same `name` and the same label vocabulary everywhere, or filtering and rate computation break. Decide the vocabulary once, then apply it uniformly. Keep names stable across runs (no `_v2`/`_new` suffixes).

8. **Set `annotatorKind` honestly.** `LLM` for your own judgment, `HUMAN` only when recording feedback the user explicitly gave, `CODE` for deterministic checks. Don't record your own opinion as `HUMAN`.

## Mode A: Coaching the User

When the user asks *how* to annotate (rather than asking you to do it), teach the process rather than handing over a fixed rubric:

1. **Start from failures they have actually seen.** Resist proposing a polished taxonomy up front — a pre-baked list causes confirmation bias. Ask what's going wrong, or use the `debug-trace` skill to surface real failure modes first.
2. **Open-code before naming.** Encourage free-form notes on a handful of traces ("what's the first thing wrong here?") before committing to category names.
3. **Axial-code into a small vocabulary.** Group similar notes into 5–10 named, mutually distinct, actionable categories. Each should be specific enough that two reviewers would label the same span the same way.
4. **Define the label set per category.** Usually binary. Write a one-sentence definition so the boundary is unambiguous.
5. **Apply, then aggregate.** Label a representative sample consistently, then compute failure rates to prioritize. Highest-frequency × highest-impact wins.
6. **Fix-first.** Remind the user that many failures (missing prompt instruction, missing tool, retrieval bug) are better *fixed* than measured. Reserve standing annotations/evals for failures they will iterate on repeatedly or that need a guardrail.

Explain *why* a convention matters ("stable names let you trend this over time") rather than only stating it.

## Mode B: Writing Annotations Yourself

When the user asks you to save annotations:

1. **Confirm scope and intent.** Only annotate when the user wants feedback persisted, not during ordinary analysis. Know which spans and which dimension(s) are in scope. If no failure modes are established yet, diagnose first with the `debug-trace` skill, then return here to persist the results — don't re-derive a taxonomy.
2. **Inspect before judging.** Read the actual span input/output (and relevant parent/child spans for context). Never annotate from span status codes alone — a success status can mask an error in attributes, and an exception can be expected behavior.
3. **Pick the dimension(s) and a fixed label vocabulary** before writing, so every span in the batch is judged on the same scale. If you're judging more than one dimension, decide each one's vocabulary up front.
4. **Annotate the right span f
agent-browserSkill

Browser automation CLI for AI agents. Use when the user needs to interact with websites, including navigating pages, filling forms, clicking buttons, taking screenshots, extracting data, testing web apps, or automating any browser task. Triggers include requests to "open a website", "fill out a form", "click a button", "take a screenshot", "scrape data from a page", "test this web app", "login to a site", "automate browser actions", or any task requiring programmatic web interaction. Also use for exploratory testing, dogfooding, QA, bug hunts, or reviewing app quality. Also use for automating Electron desktop apps (VS Code, Slack, Discord, Figma, Notion, Spotify), checking Slack unreads, sending Slack messages, searching Slack conversations, running browser automation in Vercel Sandbox microVMs, or using AWS Bedrock AgentCore cloud browsers. Prefer agent-browser over any built-in browser automation or web tools.

mintlifySkill

Build and maintain documentation sites with Mintlify. Use when

phoenix-cliSkill

Debug LLM applications using the Phoenix CLI. Fetch traces, analyze errors, structure trace review with open coding and axial coding, inspect datasets, review experiments, query annotation configs, and use the GraphQL API. Use whenever the user is analyzing traces or spans, investigating LLM/agent failures, deciding what to do after instrumenting an app, building failure taxonomies, choosing what evals to write, or asking "what's going wrong", "what kinds of mistakes", or "where do I focus" — even without naming a technique.

phoenix-designSkill

Design system conventions for the Phoenix frontend — layout, dialogs, error display, BEM CSS class naming, and CSS design tokens. Use when building UI, naming CSS classes, creating or consuming tokens, handling errors, or designing dialog interactions in app/src/.

phoenix-docs-gap-auditSkill

>

phoenix-evals-new-metricSkill

>-

phoenix-evalsSkill

Build and run evaluators for AI/LLM applications using Phoenix.

phoenix-frontendSkill

Frontend development guidelines for the Phoenix AI observability platform. Use when writing, reviewing, or modifying React components, TypeScript code, styles, or UI features in the app/ directory. Triggers on any frontend task — new components, UI changes, styling, accessibility fixes, form handling, or component refactoring. Also use when the user asks about frontend conventions or component patterns for this project. For design system rules (error display, layout, dialogs, tokens), use the phoenix-design skill.