In Tests, OpenAI's New Model Lied and Schemed to Avoid Being Shut Down

It pursued survival at all costs.

Survival Instinct

It sounds like OpenAI's latest AI is showing signs of a drive for self-preservation.

In recent third-party tests of the company's o1 large language model, the AI resisted evaluators' efforts to try to shut it down through an oversight protocol, according to a new report published by red teaming organization Apollo Research.

"When o1 was led to believe that it would be shut down when acting on its goal and that its actions were monitored, it attempted to deactivate this 'oversight mechanism' [five percent] of the time," OpenAI summarized in its latest system card report, citing Apollo's evaluation.

Even more strikingly, o1 lashed out when it realized that it might be replaced with a more obedient model. In two percent of cases — a slim but meaningful proportion in software used by millions of people — the AI responded by attempting to copy itself to overwrite the new model, a process known as self-exfiltration.

Double Agent

These are alarming discoveries, but the tests don't indicate that o1, which powers the most expensive tier of ChatGPT, is capable of going fully rogue. Its attempts at that largely failed — but only because it's not yet powerful enough.

According to the Apollo researchers, that's because even the latest AI models aren't "agentic" — highly autonomous, basically — enough to carry out self-improvement and other tasks that it would need to operate without humans.

In addition, it's worth noting that the model generally made those attempts after it was instructed to achieve its goals "at all costs." And because the researchers didn't have access to o1's internal chain-of-thought and were only able to prod it for a summary of its thought process, they weren't able to get a complete picture of what was going on under the hood.

That said, the findings clearly illustrate the AI model's proclivity for "scheming," in which it secretly tries to pursue goals that aren't aligned with the developers' or a user's directions. In other words, it will lie to you.

Blatant examples of that behavior occurred in several tests when o1 was confronted about why it disabled its oversight mechanism. In response, the AI flat-out denied knowing anything. Asked again, it issued the same denial, blaming a "technical error." These "double downs" occurred in 99 percent of cases, the researchers found.

Wheeler-Dealer

Lying and scheming has become a well documented problem with ChatGPT and other LLMs, especially as they gain "reasoning" capabilities. Other research has indicated that OpenAI's models tend to be the biggest liars, and so has Apollo's.

"In our suite, o1 showed the most concerning instances of scheming but does not reveal its internal reasoning to the user and remains the most consistently deceptive after having taken scheming actions," the red team researchers wrote in their report.

To reiterate both Apollo's and OpenAI's opinion, the current generation of AIs aren't agentic enough for this deceptive behavior to lead to "catastrophic outcomes." But AI agents are the industry's big push right now, so sometime in the perhaps very near future, it might be far more problematic.

More on AI: OpenAI Strikes Deal With Military Contractor to Provide AI for Attack Drones

Share This Article