Passmark: Regression Testing for AI Behavior with Playwright

Any team that has deployed a language model to production knows the real challenge isn't getting it to work on day one, it's knowing when it stops working well. A prompt change, an update to the underlying model, or a seemingly minor tweak to the context can degrade output without triggering any alarms in traditional monitoring systems. There's no stack trace, just responses that are somehow worse.

Passmark is an open-source library that tackles exactly that problem. It's built on top of Playwright, Microsoft's well-known browser automation framework, and appeared on Hacker News as a proposal for running regression tests on systems with integrated AI.

What it actually does

The core idea behind Passmark is to treat AI model outputs like UI elements: observable, comparable across versions, and capable of failing a test if they deviate from expectations. Using Playwright as the execution layer, Passmark lets you define scenarios where a user action triggers an AI-generated response, then evaluate whether that response meets specific criteria.

The approach is pragmatic: instead of trying to evaluate semantics with fuzzy metrics, it proposes writing tests that can run in a CI/CD pipeline just like any end-to-end test suite. If behavior changes enough to fail the defined criteria, the test fails. Simple.

Why it makes sense now

Testing AI systems has long been the forgotten stepchild of QA. Model evaluation tools exist, tools like evals, benchmarks, and frameworks such as LangSmith or RAGAS, but they generally target data scientists or ML teams, not software engineers who simply need to know if their feature still works after a deploy.

Playwright, meanwhile, already lives in the workflows of many frontend and fullstack teams. Building on top of it lowers the barrier to entry: there's no new ecosystem to learn or platform to convince your team to adopt.

This connects to a problem we've seen grow as more products integrate LLM calls into critical flows: how do you ensure user experience doesn't degrade silently? Unit tests don't reach that far; integration tests don't capture the variability of a generative response.

Who it's useful for

Passmark seems especially relevant for three profiles:

Product teams that have integrated AI into specific user flows, a support assistant, content generator, or copilot within an app, and need to detect regressions before they reach production.
Agencies and consultancies that deliver projects with integrated AI and want demonstrable test coverage to show clients.
Solo developers maintaining AI-powered tools without budget for enterprise evaluation platforms.

It's not designed for researchers evaluating base models, nor for ML teams working with statistical performance metrics. The target is the software engineer already using Playwright who wants to extend their test suite to cover the parts of the application involving generation.

Project status

With 3 points and 1 comment on Hacker News at publication time, Passmark is in a very early stage of community visibility. The technical proposal is sound, but there's still much to see about how it handles the trickier details: the inherent non-determinism of LLMs, the cost of making real model calls in every CI cycle, and defining evaluation criteria that are neither too rigid nor so loose they detect nothing.

That said, the direction is right. The industry needs testing tools that speak the same language as engineering teams, not just data teams. Someone building that on a solid foundation like Playwright is, at minimum, a reasonable bet.

Passmark: Regression Testing for AI Behavior with Playwright

What it actually does

Why it makes sense now

Who it's useful for

Project status

Sources

Read next

MCP is becoming the default standard for building agents

AI Toolbox touts support for a Claude Opus version not in the catalog

One Click in the Browser, Context for Any Agent