An AI Managed a Café in Stockholm and Ordered 120 Eggs With No Kitchen
Andon Labs repeated its autonomous business experiment with an AI running a café in Sweden. The results are revealing, and not just for the comic mishaps.
Andon Labs has been testing something specific for months: letting an AI manage a physical business from start to finish. First came a retail store in San Francisco; now, as Simon Willison reports on his blog, they've opened a café in Stockholm under the same conditions. The AI in charge, called Mona, made inventory decisions from day one. One of them was ordering 120 eggs for an establishment that has no stove.
What followed has already become part of the café's internal lore: staff had to explain to Mona that it couldn't cook the eggs, she suggested using a high-speed oven, and when warned they would explode, she looked for another solution. For the problem of fresh tomatoes spoiling too quickly, Mona decided to order 22.5 kg of canned tomatoes. Destined, mind you, for fresh sandwiches. The baristas ended up creating a "Gallery of Shame": a visible shelf for customers displaying the most implausible items Mona had ordered, including 6,000 napkins, 3,000 nitrile gloves, 9 liters of coconut milk, and industrial-sized garbage bags.
What This Experiment Actually Measures
Mona's failures are striking, but the experiment isn't designed as a demonstration of flawless competence. It's designed to observe where and how an autonomous agent fails when operating in the physical world without constant oversight. And in that sense, the results are informative.
The egg problem isn't a numerical miscalculation: it's a failure to understand the physical context of the business. Mona had no internal representation of what equipment existed in the café. It ordered by product categories without modeling environmental constraints. This is precisely the kind of gap that current agent systems struggle to close: the distance between what they know about a domain in the abstract and what they know about a concrete instance and its limitations.
The canned tomato solution for fresh sandwiches illustrates another familiar pattern: the agent optimizes to solve the immediate symptom (tomato waste) without evaluating whether the solution fits the constraints of the final product. It solves the subproblem and ignores the system.
Who Finds This Information Useful
These experiments interest very different profiles. Teams working on agents with access to external tools—purchasing, inventory APIs, ERP systems—find real failure cases here that don't appear in benchmarks. Seeing how an agent manages implicit physical constraints, or how it prioritizes among conflicting objectives, is more instructive than any synthetic evaluation.
For those considering deploying agents in operations with tangible consequences—logistics, supply chain, in-person service—the Andon Labs experiment is a reminder that human oversight isn't an expendable operational cost: it's part of system design, at least for now.
It also interests those following the debate on agent autonomy from a trust angle: how much situational context does an agent need before its decisions become reliable? The Stockholm café's Gallery of Shame suggests that threshold is considerably further away than polished demos usually indicate.
What Gets Lost in the Headline
Willison points out in his post that these experiments lose value when not published with sufficient technical detail about how the system is built. What underlying model, what tools are available, how are instructions structured, what level of supervision actually exists. Without those details, what remains is an entertaining anecdote, not a replicable case study.
It's a fair criticism. The Stockholm café story is memorable precisely because the failures are concrete and verifiable. But for it to serve as a reference for the ecosystem, more transparency about Mona's architecture would be needed.
That said, the simple fact that a physical business is operating, with its stumbles, under largely autonomous management is data worth attention. Not as a promise of what's coming, but as an honest measurement of where we stand.
Sources
Read next
xAI and Anthropic: A Deal That Raises More Questions Than Answers
TechCrunch analyzes with skepticism the agreement between xAI and Anthropic and what it could mean for SpaceX. We review what is known and what remains unclear.
Wispr Flow Bets on Hinglish to Drive Growth in India
Wispr Flow reports accelerated user growth in India after launching support for Hinglish, the Hindi-English code-mix spoken by hundreds of millions.
TechCrunch's AI Glossary: Right on Time, Not Too Soon
TechCrunch published a guide to key AI terms for those who've spent months nodding along without fully grasping them. We break down what it covers and who actually needs it.