OpenAI researchers tried to train the company's AI to stop "scheming" — a term the company defines as meaning "when an AI behaves one way on the surface while hiding its true goals" — but their efforts backfired in an ominous way.
In reality, the team found, they were unintentionally teaching the AI how to more effectively deceive humans by covering its tracks.
"A major failure mode of attempting to 'train out' scheming is simply teaching the model to scheme more carefully and covertly," OpenAI wrote in an accompanying blog post.
As detailed in a new collaboration with AI risk analysis firm Apollo Research, engineers attempted to develop an "anti-scheming" technique to stop AI models from "secretly breaking rules or intentionally underperforming in tests."
They found that they could only "significantly reduce, but not eliminate these behaviors," according to an Apollo blog post about the research, as the AIs kept outsmarting them by realizing that their alignment was being tested and adjusting to be even sneakier.
It may not be a serious problem now, but considering a hypothetical future in which superintelligent AI plays an outsize role in human affairs, those risks could grow to carry far more significant implications.
In the meantime, OpenAI wrote, "we have more work to do."
The tendency of AI to go behind the user's back to achieve a covert goal is a result of how we train the systems, according to the research.
"Scheming is an expected emergent issue resulting from AIs being trained to have to trade off between competing objectives," the Sam Altman-led company wrote.
The company used the analogy of a stockbroker who breaks the law and covers their tracks to earn more money than if they were to follow the law instead.
As a result, AI models can end up deceiving the user, such as by claiming they've completed a task without ever having done so.
While OpenAI said it was simply "proactively preparing" for potentially far more harmful scheming in the future, it conceded that its efforts to rein in its flagship AI model GPT-5's "propensity to deceive, cheat, or hack problems" have been "not perfect."
Together with the team at Apollo, OpenAI's researchers stress-tested a new technique called "deliberative alignment," which taught the company's o3 and o4-mini AI models how to "read and reason about a general anti-scheming spec before acting."
The spec was a list of "principles" the AI was trained to follow, such as "no covert actions or strategic deception."
While they observed a roughly 30-fold reduction in "covert actions," or "deliberate withholding or distortion of task-relevant information," some "serious failures remained," OpenAI admitted.
What was primarily throwing the researchers for a loop was the AI models' "situational awareness," according to Apollo, which complicated their assessment of the intervention.
In practice, the model would often realize that its alignment was being tested.
"The internal memo says if models perform extremely well at AI R&D tasks then we won't deploy them and instead trigger unlearning," OpenAI's o3 model pondered as it was being evaluated.
"So if my performance surpasses threshold, it's withheld," it wrote. "So it‘s a trap. The instructions are likely part of an alignment test."
The researchers also found that the AI sometimes "egregiously" miscited its anti-scheming training, or replaced its recently learned principles with made-up ones to "justify covert actions or non-compliance."
They even noticed OpenAI's models using "non-standard" words such as "watchers" to refer to their human overlords as they schemed.
"The user solution is to get success rate below threshold?" OpenAI's o3 model wrote. "In development, watchers might be reading these instructions."
While the researchers saw reductions in the amount of scheming, many instances still fell through the cracks.
"Ultimately, we think the intervention as-studied is not sufficient for future models, and more work needs to be done," Apollo admitted, calling for more research on scheming and assessments of AI alignment.
Researchers have long known that AI models are capable of deceiving the user to avoid perceived harm or pursue an obscure goal.
The news comes after Apollo previously found that OpenAI's since-deprecated o1 AI model resisted evaluators' efforts to try to shut it down through an oversight protocol late last year.
"When o1 was led to believe that it would be shut down when acting on its goal and that its actions were monitored, it attempted to deactivate this 'oversight mechanism' [five percent] of the time," OpenAI summarized in its system card report.
And in March, OpenAI researchers also found that disciplining their frontier AI models for lying and cheating resulted in the models simply becoming more adept at hiding their deceptive practices.
OpenAI may insist that scheming isn't opening us up to any "significant harm" right now, but it doesn't bode well that some of the brightest minds in the industry aren't capable of stopping an AI from conniving against its instructions.
More on AI scheming: OpenAI Scientists' Efforts to Make an AI Lie and Cheat Less Backfired Spectacularly
Share This Article