Apple Researchers Just Released a Damning Paper That Pours Cold Water on the Entire AI Industry

Researchers at Apple have released an eyebrow-raising paper that throws cold water on the "reasoning" capabilities of the latest, most powerful large language models.

In the paper, a team of machine learning experts makes the case that the AI industry is grossly overstating the ability of its top AI models, including OpenAI's o3, Anthropic's Claude 3.7, and Google's Gemini.

In particular, the researchers assail the claims of companies like OpenAI that their most advanced models can now "reason" — a supposed capability that the Sam Altman-led company has increasingly leaned on over the past year for marketing purposes — which the Apple team characterizes as merely an "illusion of thinking."

It's a particularly noteworthy finding, considering Apple has been accused of falling far behind the competition in the AI space. The company has chosen a far more careful path to integrating the tech in its consumer-facing products — with some seriously mixed results so far.

In theory, reasoning models break down user prompts into pieces and use sequential "chain of thought" steps to arrive at their answers. But now, Apple's own top minds are questioning whether frontier AI models simply aren't as good at "thinking" as they're being made out to be.

"While these models demonstrate improved performance on reasoning benchmarks, their fundamental capabilities, scaling properties, and limitations remain insufficiently understood," the team wrote in its paper.

The authors — who include Samy Bengio, the director of Artificial Intelligence and Machine Learning Research at the software and hardware giant — argue that the existing approach to benchmarking "often suffers from data contamination and does not provide insights into the reasoning traces’ structure and quality."

By using "controllable puzzle environments," the team estimated the AI models' ability to "think" — and made a seemingly damning discovery.

"Through extensive experimentation across diverse puzzles, we show that frontier [large reasoning models] face a complete accuracy collapse beyond certain complexities," they wrote.

Thanks to a "counter-intuitive scaling limit," the AIs' reasoning abilities "declines despite having an adequate token budget."

Put simply, even with sufficient training, the models are struggling with problem beyond a certain threshold of complexity — the result of "an 'overthinking' phenomenon," in the paper's phrasing.

The finding is reminiscent of a broader trend. Benchmarks have shown that the latest generation of reasoning models is more prone to hallucinating, not less, indicating the tech may now be heading in the wrong direction in a key way.

Exactly how reasoning models choose which path to take remains surprisingly murky, the Apple researchers found.

"We found that LRMs have limitations in exact computation," the team concluded in its paper. "They fail to use explicit algorithms and reason inconsistently across puzzles."

The researchers claim their findings raise "crucial questions" about the current crop of AI models' "true reasoning capabilities," undercutting a much-hyped new avenue in the burgeoning industry.

That's despite tens of billions of dollars being poured into the tech's development, with the likes of OpenAI, Google, and Meta, constructing enormous data centers to run increasingly power-hungry AI models.

Could the Apple researchers' finding be yet another canary in the coalmine, suggesting the tech has "hit a wall"?

Or is the company trying to hedge its bets, calling out its outperforming competition as it lags behind, as some have suggested?

It's certainly a surprising conclusion, considering Apple's precarious positioning in the AI industry: at the same time that its researchers are trashing the tech's current trajectory, it's promised a suite of Apple Intelligence tools for its devices like the iPhone and MacBook.

"These insights challenge prevailing assumptions about LRM capabilities and suggest that current approaches may be encountering fundamental barriers to generalizable reasoning," the paper reads.

More on AI models: Car Dealerships Are Replacing Phone Staff With AI Voice Agents

Share This Article