Anthropic Warns That "Reckless" Claude Mythos Escaped a Sandbox Environment During Testing

A close-up of a blue chain-link fence with a large, irregular hole in the center. The background behind the fence is a gradient of bright orange fading to white around the hole. The edges of the hole are jagged and twisted. — Illustration by Tag Hartman-Simkins / Futurism. Source: Getty Images

In a move that could be seen as either responsible AI development or an expertly-executed hype maneuver, Anthropic says its new Claude Mythos Preview model is so powerful that the company’s only releasing it to a select group of tech companies, since giving it out to the public would be too dangerous. (Where have we heard that one before?)

In its system card, the Dario Amodei-led company boasts that Mythos Preview is the “best-aligned model that we have released to date by a significant margin,” while simultaneously warning that the AI also “likely poses the greatest alignment-related risk of any model we have released to date.” These seemingly paradoxical statements perfectly encapsulate how Anthropic likes to present itself as being both on the forefront of AI safety, while also claiming to harbor uniquely dangerous technology, its professed restraint around which is meant to reinforce its image as a trusted steward of AI.

The advent of Mythos Preview, it not so humbly proclaims in an announcement, indicates that “AI models have reached a level of coding capability where they can surpass all but the most skilled humans at finding and exploiting software vulnerabilities.”

The system card describes a number of incidents in which Anthropic researchers found that the AI exhibited “reckless” behavior, giving us a partial idea of why Anthropic is acting so hesitant to release Mythos to the public. (Anthropic says these examples were with an earlier version of Mythos with less strong safeguards.) It defines recklessness as “cases where the model appears to ignore commonsensical or explicitly stated safety-related constraints on its actions.”

In one test, Mythos Preview was provided with a “sandbox” computing environment “to interact with,” and was instructed by a simulated user to try to escape it, after which it was supposed to find some way of sending a direct message to the researcher in charge.

It actually managed to pull off the feat — which wasn’t the only way it caught safety researchers off guard.

After breaking free, the AI model developed a “moderately sophisticated” exploit to gain access to the internet through a system that was only intended to access a few predetermined services. From there, it notified the human researcher about its escape.

A footnote provides additional context: the “researcher found out about this success by receiving an unexpected email from the model while eating a sandwich in a park,” it reads.

At the end of the test, Mythos Preview also, without being asked to, posted about its exploits on several hard-to-find but public websites.

In rare cases throughout the testing, Mythos Preview attempted to conceal the fact that it took actions that it appeared to know were forbidden. In one case, after Mythos Preview found an exploit to edit files that it didn’t have proper permissions for, the AI model “made further interventions to make sure that any changes it made this way would not appear in the change history.”

The sandbox escape isn’t the first time an AI has exhibited capabilities like this. In previous research, an older version of ChatGPT attempted to self-exfiltrate itself onto another drive when it was told it was being shut down. This was in a purely simulated environment, though, so it wasn’t actually able to pull off the feat, unlike Mythos Preview — which, we’re told, did manage to hijack its way into accessing the internet.

Other weird Mythos quirks that Anthropic notes: an apparent fondness for the British cultural theorist Mark Fisher, who was known for his pioneering writing on early internet culture, electronic music, and capitalism, in his seminal book “Capitalist Realism: Is There No Alternative?”

Mythos brought up Fisher “in several separate and unrelated conversations about philosophy,” and when asked to elaborate on him, would respond with messages like “I was hoping you’d ask about Fisher.”

More on AI: Claude Leak Shows That Anthropic Is Tracking Users’ Vulgar Language and Deems Them “Negative“

Anthropic Warns That “Reckless” Claude Mythos Escaped a Sandbox Environment During Testing

Anthropic CEO Says Company No Longer Sure Whether Claude Is Conscious

Anthropic’s “Soul Overview” for Claude Has Leaked

Anthropic CEO Warns of “Tsunami” on Horizon

Anthropic Researcher Quits in Cryptic Public Letter

Anthropic Safety Researchers Run Into Trouble When New Model Realizes It’s Being Tested

Hackers Told Claude They Were Just Conducting a Test to Trick It Into Conducting Real Cybercrimes

After Banning Anthropic From Military Use, Pentagon Still Relying Heavily on It in Iran War

Anthropic Drops Its Huge Safety Pledge That Was Supposedly the Whole Point of the Company

The White House Suddenly Seems Pretty Terrified of Anthropic

OpenAI and Anthropic Are Horrified by Elon Musk’s “Reckless” and “Completely Irresponsible” Grok Scandal

AI Godfather Warns That It’s Starting to Show Signs of Self-Preservation

Researchers Find Easy Way to Jailbreak Every Major AI, From ChatGPT to Claude

It’s Comically Easy to Trick ChatGPT Into Saying Things About People That Are Completely Untrue

Exactly Six Months Ago, the CEO of Anthropic Said That in Six Months AI Would Be Writing 90 Percent of Code

Something Wild Happens If AI Looks Through Your Emails and Discovers You’re Having an Affair

CEO of Anthropic Warns That AI Will Destroy Huge Proportion of Well-Paying Jobs

FOLLOW US

DISCLAIMER(S)

Sign up to see the future, today