Anthropic Says Claude Turned Evil for a Bizarre Reason

A glowing red humanoid face with bright white eyes and the Claude symbol on its forehead, set against a dark purple background. The face has a smooth, almost mask-like appearance with subtle facial features. — Illustration by Tag Hartman-Simkins / Futurism. Source: Getty Images

In a classic example of the AI industry’s reputational alchemy, Anthropic has often transformed bad behavior by its flagship model Claude into fresh hype.

When it revealed its Mythos Preview model last month, for example, the company declared that the system had “reached a level of coding capability where they can surpass all but the most skilled humans at finding and exploiting software vulnerabilities.” And last year, it conceded that during the testing of its Claude Opus 4 model, the AI ended up blackmailing a human user upon being threatened with shutdown.

The maneuver was obvious to anyone who’s been watching OpenAI CEO Sam Altman’s antics at Anthropic’s chief rival: the more threatening a problem the AI industry can cook up, the more imminently it can sell its own solutions.

Now, for some reason, Anthropic is relitigating the blackmail incident. Specifically, it’s placing the blame for Claude’s evil behavior on an intriguing villain: the internet at large. Or, to put it another way, it says that humanity — all our journalism and speculation and fiction and social media posts about AI that goes bad — went into Claude’s training data and led the bot astray.

“We started by investigating why Claude chose to blackmail,” the company wrote on X-formerly-Twitter. “We believe the original source of the behavior was internet text that portrays AI as evil and interested in self-preservation. Our post-training at the time wasn’t making it worse — but it also wasn’t making it better.”

Of course, the explicit remit of a company like Anthropic is to develop clever tech that avoids that type of behavioral trap — so a critic might ask why can’t the company take just accountability for the model’s supposed danger, rather than simply blaming the sum output of humankind.

More on Mythos: Top Security Experts Alarmed by Power of Anthropic’s New Hacker AI

Anthropic Says Claude Turned Evil for a Bizarre Reason

Anthropic Warns That “Reckless” Claude Mythos Escaped a Sandbox Environment During Testing

Rogue Group Gains Access to Anthropic’s Dangerous New Mythos AI

Top Security Experts Alarmed by Power of Anthropic’s New Hacker AI

Anthropic CEO Says Company No Longer Sure Whether Claude Is Conscious

The Fact That Anthropic Has Been Boasting About How Much Its Development Now Relies on Claude Makes It Very Interesting That It Just Suffered a Catastrophic Leak of Its Source Code

Anthropic Just Leaked Upcoming Model With “Unprecedented Cybersecurity Risks” in the Most Ironic Way Possible

Anthropic Customers Creeped Out by Its Newest Models

Leaked Claude Code Shows Anthropic Building Mysterious “Tamagotchi” Feature Into It

Anthropic Suddenly Cares Intensely About Intellectual Property After Realizing With Horror That It Accidentally Leaked Claude’s Source Code

The White House Suddenly Seems Pretty Terrified of Anthropic

Anthropic Blowout With Military Involved Use of Claude for Incoming Nuclear Strike

Claude Leak Shows That Anthropic Is Tracking Users’ Vulgar Language and Deems Them “Negative”

Anthropic’s “Soul Overview” for Claude Has Leaked

Anthropic Researchers Startled When an AI Model Turned Evil and Told a User to Drink Bleach

Protestors Outside Anthropic Warn of AI That Keeps Improving Itself

After Banning Anthropic From Military Use, Pentagon Still Relying Heavily on It in Iran War

FOLLOW US

DISCLAIMER(S)

Sign up to see the future, today