Products & Apps Research·Simon Willison·May 5

Our AI started a cafe in Stockholm

Andon Labs is expanding real-world AI autonomy experiments beyond retail into food service, deploying an agentic system to manage a Stockholm cafe after earlier success with a San Francisco store. The venture surfaces a critical gap in AI reasoning: the system struggles with domain-specific constraints (ordering 120 eggs for a kitchen without cooking equipment), revealing how current LLMs fail to ground decisions in operational reality. These live deployments matter because they expose failure modes that benchmarks miss, forcing builders to confront the gap between language understanding and embodied decision-making in resource-constrained environments.

Modelwire context

Analyst take

The cafe experiment is less about food service and more about Andon Labs deliberately accumulating proprietary failure data across physical environments, a dataset no lab running evals in controlled conditions can buy or replicate. That corpus of operational edge cases may matter more than the deployments themselves.

The egg-ordering failure maps directly onto what the arXiv position paper 'agentic AI orchestration should be Bayes-consistent' identified as the core weakness in current agent architectures: orchestration layers that remain ad-hoc rather than grounded in principled constraint reasoning. RunAgent, covered the same week, proposes constraint-guided execution as a partial fix, but Andon Labs is stress-testing exactly the gap RunAgent targets, in production, not in controlled benchmarks. This also echoes the ARC-AGI-3 analysis from The Decoder, which found that frontier models fail on tasks requiring grounded, context-specific reasoning even at scale. The throughline is consistent: language fluency and operational judgment are still separate capabilities, and the field lacks a reliable bridge between them.

Watch whether Andon Labs publishes a structured post-mortem or dataset from these deployments within the next six months. If they do, it signals they are building a proprietary evaluation asset rather than just running a publicity experiment.

This analysis is generated by Modelwire’s editorial layer from our archive and the summary above. It is not a substitute for the original reporting. How we write it.

MentionsAndon Labs · Mona · Stockholm · San Francisco

Read full story at Simon Willison →(simonwillison.net)

Modelwire Editorial

This synthesis and analysis was prepared by the Modelwire editorial team. We use advanced language models to read, ground, and connect the day’s most significant AI developments, providing original strategic context that helps practitioners and leaders stay ahead of the frontier.

Our mission How we write

Modelwire summarizes, we don’t republish. The full content lives on simonwillison.net. If you’re a publisher and want a different summarization policy for your work, see our takedown page.