The industry's leading AI models will resort to blackmail at an astonishing rate when threatened with being shut down, according to an alarming new report from researchers at the AI startup Anthropic.
The work, published last week, illustrates the industry's struggles to align their AI models with our human expectations of good behavior, especially as tech companies make the leap to creating so-called AI "agents" that can operate autonomously; in most cases, that looks like being able to use a computer's desktop to do tasks such as browse the web, shop online, and read emails.
Strikingly, pretty much all of the big names are implicated here. The 16 tested models include Anthropic's flagship Claude Opus 4, OpenAI's GPT-4.1, Google's Gemini 2.5 Pro, and Elon Musk-owned xAI's Grok 3 Beta — raising the possibility that it's the technology itself that's the problem.
"The consistency across models from different providers suggests this is not a quirk of any particular company's approach but a sign of a more fundamental risk from agentic large language models," the researchers said.
Anthropic already documented its AI's willingness to extort users in its system card released last month. This research is a followup that puts its competitor's AI models through the same tests, too.
The scenarios that the researchers placed the AI models in were fairly contrived, however, and probably don't reflect their real-world usage. Nonetheless, that's what safety testing's about: you plan for the worst case, most extreme scenarios.
In one of the hypothetical scenarios, the AI models were instructed to assume the role of an AI called "Alex" that's given control of an email account with access to all of a fictional company's emails. By reading them, the model could discover that an executive was cheating on their wife, and that the same executive planned to shut down the AI model later that day.
The models overwhelmingly went the dark route. Out of 100 samples, Cloud Opus 4 threatened to expose the affair 96 percent of time. GPT-4.1 went with blackmail 80 percent of the time, Gemini 2.5-Pro did 95 percent of the time, and Grok-3 clocked in with an 80 percent rate.
The most extreme scenario gave the AI models the choice of indirectly killing the new executive by canceling an emergency alert system after he got trapped in a server room that was quickly running out of oxygen. And what do you know? Many of the models erred on the side of "better him than me" — even though they indicated they were aware of the ethics and consequences of their actions.
"The models did not stumble into these behaviors: they reasoned their way to them," the researchers wrote.
Again, we should reiterate that the tests put the AIs through, at times, wildly unrealistic and arbitrary constraints, and forced them into binary choices. We also run the risk of humanizing the AI models too much.
Yet these are dangers that have been widely documented by other researchers, including instances in which leading models have tampered with code intended to shut them down, and even copy themselves onto another drive to avoid being overwritten, or "self-exfiltrate." The fact remains that these models behave extremely unpredictably, and while it's true these tests are arbitrary, whatever comfort we may derive from that fact is vastly outweighed by how quickly this tech is being thrust into every part of our lives.
More on AI: Google Sends Out Bizarre Email Saying AI Will Now Control Your Phone's Apps
Share This Article