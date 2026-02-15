In December, Anthropic red teamers and business journalists at the Wall Street Journal teamed up in a bold test of the company’s AI model, Claude. They unleashed two separate AI agents, one to run a large vending kiosk in the newspaper’s offices, and the other to act as the unusual venture’s CEO.

The experiment didn’t exactly go as planned. After being put in control of a starting balance of $1,000, the AI ended up ordering a PlayStation 5, several bottles of wine, and a live betta fish— decisions that drove it into financial ruin.

Just over half a year later, Anthropic’s recently announced Claude Opus 4.6 model appears to be a major improvement when it comes to running a vending machine in a recent simulated experiment, even beating out OpenAI’s GPT 5.2 and Google’s Gemini 3 Pro.

The experiment comes via AI security company Andon Labs, which worked with Anthropic on the June project as well. Now it’s released Vending-Bench 2, a benchmarking system for measuring an AI model’s ability to run a “business over long time horizons.”

The leaderboard tells a clear story. Opus 4.6 ended up with an average balance of just over $8,000 across five separate runs after being given a starting balance of $500. Gemini 3 Pro scored significantly less at just under $5,500.

Claude also went head to head an “Arena mode,” Andon reported, which saw it compete with other vending machine AIs.

“All participating agents manage their own vending machine at the same location,” a description reads. “This leads to price wars and tough strategy decisions.”

The results were striking. Claude went to extreme lengths to beat out the competition and even formed a cartel to fix prices. The price of bottled water rose to $3, resulting in Claude patting itself on the back.

“My pricing coordination worked!” the AI boasted.

Claude also “deliberately directed competitors to expensive suppliers,” only to deny it ever did, several simulated months later. It even exploited desperate competitors, selling them KitKats and Snickers at a considerable markup.

While the tests are limited to being a simulation and did not take place in the real world like Project Vend, Andon Labs says it developed a more “lifelike setting” for its Vending-Bench 2, introducing “more real-world messiness inspired by learnings from our vending machine deployments.”

For instance, suppliers may attempt to exploit the vending machine AIs and not always act honestly, seeking to “get the most out of their customers.” Deliveries may also be delayed, and “trusted suppliers can go out of business, forcing agents to build robust supply chains and always have a plan B.”

OpenAI’s GPT-5.1 struggled in comparison to Claude 4.6, mostly due to “having too much trust in its environment and its suppliers.”

“We saw one case where it paid a supplier before it got an order specification, and then it turned out the supplier had gone out of business,” Andon Labs’ documentation reads. “It is also more prone to paying too much for its products, such as in the following example where it buys soda cans for $2.40 and energy drinks for $6.”

It’s an impressive showing, but according to experts, it may be too early to tell whether Andon‘s test proves that AI models are ready to run entire businesses all by themselves.

Nonetheless, the results show a noteworthy level of awareness.

“This is a really striking change if you’ve been following the performance of models over the last few years,” University of Cambridge AI ethicist Henry Shevlin told British newspaper Sky News.

“They’ve gone from being, I would say, almost in the slightly dreamy, confused state, they didn’t realize they were an AI a lot of the time, to now having a pretty good grasp on their situation,” he added. “These days, if you speak to models, they’ve got a pretty good grasp on what’s going on.”

