Skip to main content
ClaudeWave
Skill10.1k estrellas del repoactualizado today

playground

The playground tool in Phoenix enables authors to draft, test, and refine prompts through manual iteration or dataset-backed experimentation. Use it when developing new prompts, comparing prompt variants, running experiments with evaluators across datasets, or optimizing prompt performance before deployment.

Instalar en Claude Code
Copiar
git clone --depth 1 https://github.com/Arize-ai/phoenix /tmp/playground && cp -r /tmp/playground/src/phoenix/server/agents/prompts/skills/playground ~/.claude/skills/playground
Después abre una sesión nueva de Claude Code; el skill carga automáticamente.

SKILL.md

# Prompt Playground

The prompt playground is a tool for authoring and optimizing prompts. It supports two different
ways of working: fast manual prompt iteration without a dataset, and dataset-backed prompt
experimentation with evaluators and experiments. Choose the workflow that matches the user's
current goal and the UI context they have mounted.

## Workflow: Create And Iterate Without A Dataset

Use this workflow when the user wants to draft, rewrite, or manually improve a prompt and no
dataset-backed evaluation loop is in scope.

1. Clarify the task the prompt must perform: input variables, expected output shape, audience,
   constraints, and examples of good or bad behavior when available.
2. If a playground prompt already exists, call `read_prompt_instance` before proposing changes so
   you have the current messages, message IDs, labels, and revision.
3. Draft or revise the prompt so it clearly states the task, required context, output contract, and
   success criteria. Keep the prompt directly tied to the user's stated goal.
4. Use `edit_prompt_instance` for changes to the mounted prompt so the user can review the diff
   before accepting it.
5. Use `add_prompt_instance` when the user wants a fresh comparison instance that starts from the
   default prompt messages. Use `clone_prompt_instance` when comparing alternatives should preserve
   existing prompt content as the starting point. Discuss variants by their alphabetic labels, but
   pass numeric instance IDs to tools. After adding, use the returned `addedInstance` snapshot for
   follow-up edits.
6. Use `set_variable_values` when the user provides manual values for prompt template variables.
7. Use `set_playground_repetitions` before running when the user is concerned about flakes,
   structured output consistency, tool-call reliability, or whether the prompt is ready to save.
   LLM outputs are nondeterministic; repetitions build confidence by checking the same task across
   multiple runs instead of trusting one successful response.
8. Call `run_playground` only when the user asks to run, try, test, or compare the current prompt.
   Treat the output as qualitative feedback rather than dataset-backed evidence.
9. After the run finishes, call `read_playground_output` to inspect raw output and get the traceId
   for trace analysis when needed. If the run used multiple repetitions, inspect every repetition
   before summarizing confidence or recommending that the user save.
10. Call `save_prompt` only when the user explicitly asks to save or confirms that the current
   prompt should be persisted. For a first-time save of an unsaved prompt, omit `name` unless the
   user provided one; the tool will derive a valid Phoenix prompt name from the prompt content.
   Always pass a save description; it should read like a clear, short git commit message. Treat
   tags like releases and do not promote tags unless the user asks.
11. Inspect the output with the user, identify the next concrete improvement, and repeat the edit or
   comparison loop until the prompt is useful for the task.

## Workflow: Iterate Over A Dataset With Evaluators And Experiments

Use this workflow when the user wants evidence that a prompt is improving across a dataset, or when
they are comparing prompt variants using evaluator results. Running a prompt over a dataset is
implicitly an experiment: consult the `experiments` skill before designing the run, not only after
results arrive — it owns the iteration methodology end to end (what to stage at creation, how to
read and compare results, when an evaluator is warranted), and the `evaluators` skill owns designing
the evaluators that score them. This workflow covers only the playground mechanics of setting up and
starting a recorded run.

1. Load the dataset with `load_dataset` if it isn't already loaded. If the user named a dataset but
   no split and the dataset has splits, name them and ask whether to scope to one or load the whole
   dataset — then load once.
2. Make sure the starting prompt is well formed before running it: it should define the task,
   relevant variables, output format, and any constraints needed for consistent evaluation.
3. Use `set_playground_experiment_recording` before running when the user wants the next
   dataset-backed playground run recorded, persisted, or saved as an experiment, or wants to name,
   describe, or attach metadata (such as a hypothesis or the variable being changed) to the next
   experiment. Set `recordExperiments` to false only when the user explicitly asks for a temporary,
   throwaway, unrecorded, or ephemeral run. Call this tool only when the requested recording mode or
   scaffold fields differ from the advertised `recordExperiments` and `nextExperimentScaffold`
   values; the staged scaffold applies to that one run and is consumed when it starts. This is
   separate from `save_prompt`, which saves prompt versions rather than run results.
4. Use `set_playground_repetitions` before running when the user needs confidence across repeated
   attempts, especially for flaky behavior, structured outputs, or tool-call correctness.
5. Run the playground over the dataset. When recording is enabled, each prompt instance run over a
   dataset is captured as an experiment, with outputs and evaluator annotations available for
   review.
6. To read the experiment results and decide whether a change helped, follow the `experiments`
   skill; to create the next candidate, use `edit_prompt_instance`, `add_prompt_instance`, or
   `clone_prompt_instance` (`add_prompt_instance` starts from the default prompt messages,
   `clone_prompt_instance` from existing prompt content), then rerun.
7. Use `save_prompt` to save a prompt as a new version only after the evidence shows an improvement
   or the user explicitly accepts the tradeoff. For unsaved prompts, the tool can create the Phoenix
   prompt directly without asking for a name unless the user cares about the exact name.

### Rea
agent-browserSkill

Browser automation CLI for AI agents. Use when the user needs to interact with websites, including navigating pages, filling forms, clicking buttons, taking screenshots, extracting data, testing web apps, or automating any browser task. Triggers include requests to "open a website", "fill out a form", "click a button", "take a screenshot", "scrape data from a page", "test this web app", "login to a site", "automate browser actions", or any task requiring programmatic web interaction. Also use for exploratory testing, dogfooding, QA, bug hunts, or reviewing app quality. Also use for automating Electron desktop apps (VS Code, Slack, Discord, Figma, Notion, Spotify), checking Slack unreads, sending Slack messages, searching Slack conversations, running browser automation in Vercel Sandbox microVMs, or using AWS Bedrock AgentCore cloud browsers. Prefer agent-browser over any built-in browser automation or web tools.

mintlifySkill

Build and maintain documentation sites with Mintlify. Use when

phoenix-cliSkill

Debug LLM applications using the Phoenix CLI. Fetch traces, analyze errors, structure trace review with open coding and axial coding, inspect datasets, review experiments, query annotation configs, and use the GraphQL API. Use whenever the user is analyzing traces or spans, investigating LLM/agent failures, deciding what to do after instrumenting an app, building failure taxonomies, choosing what evals to write, or asking "what's going wrong", "what kinds of mistakes", or "where do I focus" — even without naming a technique.

phoenix-designSkill

Design system conventions for the Phoenix frontend — layout, dialogs, error display, BEM CSS class naming, and CSS design tokens. Use when building UI, naming CSS classes, creating or consuming tokens, handling errors, or designing dialog interactions in app/src/.

phoenix-docs-gap-auditSkill

>

phoenix-evals-new-metricSkill

>-

phoenix-evalsSkill

Build and run evaluators for AI/LLM applications using Phoenix.

phoenix-frontendSkill

Frontend development guidelines for the Phoenix AI observability platform. Use when writing, reviewing, or modifying React components, TypeScript code, styles, or UI features in the app/ directory. Triggers on any frontend task — new components, UI changes, styling, accessibility fixes, form handling, or component refactoring. Also use when the user asks about frontend conventions or component patterns for this project. For design system rules (error display, layout, dialogs, tokens), use the phoenix-design skill.