Blame Game

Anthropic Says Claude Turned Evil for a Bizarre Reason

Anthropic would rather blame the internet than its poor training.
Krystle Vermes Avatar
A glowing red humanoid face with bright white eyes and the Claude symbol on its forehead, set against a dark purple background. The face has a smooth, almost mask-like appearance with subtle facial features.
Illustration by Tag Hartman-Simkins / Futurism. Source: Getty Images

In a classic example of the AI industry’s reputational alchemy, Anthropic has often transformed bad behavior by its flagship model Claude into fresh hype.

When it revealed its Mythos Preview model last month, for example, the company declared that the system had “reached a level of coding capability where they can surpass all but the most skilled humans at finding and exploiting software vulnerabilities.” And last year, it conceded that during the testing of its Claude Opus 4 model, the AI ended up blackmailing a human user upon being threatened with shutdown.

The maneuver was obvious to anyone who’s been watching OpenAI CEO Sam Altman’s antics at Anthropic’s chief rival: the more threatening a problem the AI industry can cook up, the more imminently it can sell its own solutions.

Now, for some reason, Anthropic is relitigating the blackmail incident. Specifically, it’s placing the blame for Claude’s evil behavior on an intriguing villain: the internet at large. Or, to put it another way, it says that humanity — all our journalism and speculation and fiction and social media posts about AI that goes bad — went into Claude’s training data and led the bot astray.

“We started by investigating why Claude chose to blackmail,” the company wrote on X-formerly-Twitter. “We believe the original source of the behavior was internet text that portrays AI as evil and interested in self-preservation. Our post-training at the time wasn’t making it worse — but it also wasn’t making it better.”

Of course, the explicit remit of a company like Anthropic is to develop clever tech that avoids that type of behavioral trap — so a critic might ask why can’t the company take just accountability for the model’s supposed danger, rather than simply blaming the sum output of humankind.

More on Mythos: Top Security Experts Alarmed by Power of Anthropic’s New Hacker AI