A QA agent in a real browser checks every PR opened by another agent

Last Tuesday, May 19th, a developer shared on Hacker News a project he has been working on solo for several months: NotesASM, a kanban board where each ticket triggers two chained agents. The first writes the code and opens a PR; the second waits for the preview environment to spin up and verifies the result by controlling a real browser. It is not a new concept, but the implementation includes engineering details worth examining.

What sets this project apart from typical CI pipelines is not so much the automation itself as the explicit separation of responsibilities between a builder agent and a verifier agent, and especially the logic that decides when it makes sense to retry.

How the build → QA cycle works

The build agent works in an isolated temporary directory against a shallow clone of the user's repository. When finished, it pushes a branch and opens a PR. This agent is built on the Claude Agent SDK.

The QA agent enters the scene once the preview deploy is live. It uses Browserbase to control a real browser against that ephemeral environment and checks what it sees against the acceptance criteria defined in the ticket. The result is attached to the PR as screenshots and an mp4 file of the session, making the verification auditable.

If QA fails, the build agent retries with the failure report as additional context, up to a maximum of three iterations. This limit itself is not particularly striking; what is interesting is what happens before each retry.

The failure classifier: the most relevant reliability improvement

Before relaunching the build agent, a classifier analyzes the failure and decides whether it was a real code error or an environmental issue: Clerk didn't load, the preview never deployed, the Browserbase session returned a 403. If the problem is environmental, the loop stops rather than having the agent try to fix code that was working fine.

This point deserves emphasis. One of the most common problems in autonomous agent pipelines is that the model receives noisy feedback and starts "fixing" things that were not broken. Filtering environmental noise before passing it to the agent as error context is, in practice, the difference between a system that converges and one that goes in circles. The author himself points to it as "the biggest reliability breakthrough" of the project.

Integration with MCP and Claude Code

NotesASM exposes its own MCP server, so from Claude Code or any compatible MCP client you can say "create a ticket for X" and the item lands directly in the backlog. This closes the loop quite naturally: the developer interacts in natural language from their usual environment, and the platform takes care of translating that intent into a structured ticket with acceptance criteria.

The choice of MCP as the entry interface is not accidental. Being Anthropic's standard for LLMs to call external tools, any client that already supports it can feed the backlog without additional integrations. For teams already working with Claude Code in their daily workflow, the friction for adopting this kind of tool is low.

Who this makes sense for

This approach is especially useful for small teams or solo developers with web projects that already have automated preview deploys (Vercel, Netlify, or similar). The value is not in replacing complex human code review, but in delegating repetitive and verifiable tasks: UI changes, copy corrections, behaviour adjustments with clear criteria.

For projects with opaque business logic or without stable preview environments, the system loses some of its utility. The environmental failure classifier itself exists precisely because the preview infrastructure is a real friction point.

The project is in early public phase and the author has explicitly asked for external feedback. You can follow it at notesasm.com and the original thread is on Hacker News.

---

From our perspective, the environmental failure classifier detail strikes us as the most solid contribution of the project: it is a pattern that should be in any agent pipeline operating on external infrastructure, and few public implementations address it explicitly.

A QA agent in a real browser checks every PR opened by another agent

How the build → QA cycle works

The failure classifier: the most relevant reliability improvement

Integration with MCP and Claude Code

Who this makes sense for

Sources

Read next

MCP is becoming the default standard for building agents

AI Toolbox touts support for a Claude Opus version not in the catalog

One Click in the Browser, Context for Any Agent