An AI Managed a Café in Stockholm and Ordered 120 Eggs With No Kitchen

Andon Labs has been testing something specific for months: letting an AI manage a physical business from start to finish. First came a retail store in San Francisco; now, as Simon Willison reports on his blog, they've opened a café in Stockholm under the same conditions. The AI in charge, called Mona, made inventory decisions from day one. One of them was ordering 120 eggs for an establishment that has no stove.

What followed has already become part of the café's internal lore: staff had to explain to Mona that it couldn't cook the eggs, she suggested using a high-speed oven, and when warned they would explode, she looked for another solution. For the problem of fresh tomatoes spoiling too quickly, Mona decided to order 22.5 kg of canned tomatoes. Destined, mind you, for fresh sandwiches. The baristas ended up creating a "Gallery of Shame": a visible shelf for customers displaying the most implausible items Mona had ordered, including 6,000 napkins, 3,000 nitrile gloves, 9 liters of coconut milk, and industrial-sized garbage bags.

What This Experiment Actually Measures

Mona's failures are striking, but the experiment isn't designed as a demonstration of flawless competence. It's designed to observe where and how an autonomous agent fails when operating in the physical world without constant oversight. And in that sense, the results are informative.

The egg problem isn't a numerical miscalculation: it's a failure to understand the physical context of the business. Mona had no internal representation of what equipment existed in the café. It ordered by product categories without modeling environmental constraints. This is precisely the kind of gap that current agent systems struggle to close: the distance between what they know about a domain in the abstract and what they know about a concrete instance and its limitations.

The canned tomato solution for fresh sandwiches illustrates another familiar pattern: the agent optimizes to solve the immediate symptom (tomato waste) without evaluating whether the solution fits the constraints of the final product. It solves the subproblem and ignores the system.

Who Finds This Information Useful

These experiments interest very different profiles. Teams working on agents with access to external tools—purchasing, inventory APIs, ERP systems—find real failure cases here that don't appear in benchmarks. Seeing how an agent manages implicit physical constraints, or how it prioritizes among conflicting objectives, is more instructive than any synthetic evaluation.

For those considering deploying agents in operations with tangible consequences—logistics, supply chain, in-person service—the Andon Labs experiment is a reminder that human oversight isn't an expendable operational cost: it's part of system design, at least for now.

It also interests those following the debate on agent autonomy from a trust angle: how much situational context does an agent need before its decisions become reliable? The Stockholm café's Gallery of Shame suggests that threshold is considerably further away than polished demos usually indicate.

What Gets Lost in the Headline

Willison points out in his post that these experiments lose value when not published with sufficient technical detail about how the system is built. What underlying model, what tools are available, how are instructions structured, what level of supervision actually exists. Without those details, what remains is an entertaining anecdote, not a replicable case study.

It's a fair criticism. The Stockholm café story is memorable precisely because the failures are concrete and verifiable. But for it to serve as a reference for the ecosystem, more transparency about Mona's architecture would be needed.

That said, the simple fact that a physical business is operating, with its stumbles, under largely autonomous management is data worth attention. Not as a promise of what's coming, but as an honest measurement of where we stand.

An AI Managed a Café in Stockholm and Ordered 120 Eggs With No Kitchen

What This Experiment Actually Measures

Who Finds This Information Useful

What Gets Lost in the Headline

Sources

Read next

xAI and Anthropic: A Deal That Raises More Questions Than Answers

Wispr Flow Bets on Hinglish to Drive Growth in India

TechCrunch's AI Glossary: Right on Time, Not Too Soon