OpenAI Research Finds That Even Its Best Models Give Wrong Answers a Wild Proportion of the Time

OpenAI has released a new benchmark dubbed "SimpleQA" to measure the accuracy of its AI models. The results are damning. — *Image: Getty / Futurism*

BS Generator

OpenAI has released a new benchmark, dubbed “SimpleQA,” that’s designed to measure the accuracy of the output of its own and competing artificial intelligence models.

In doing so, the AI company has revealed just how bad its latest models are at providing correct answers. In its own tests, its cutting edge o1-preview model, which was released last month, scored an abysmal 42.7 percent success rate on the new benchmark.

In other words, even the cream of the crop of recently announced large language models (LLMs) is far more likely to provide an outright incorrect answer than a right one — a concerning indictment, especially as the tech is starting to pervade many aspects of our everyday lives.

Wrong Again

Competing models, like Anthropic’s, scored even lower on OpenAI’s SimpleQA benchmark, with its recently released Claude-3.5-sonnet model getting only 28.9 percent of questions right. However, the model was far more inclined to reveal its own uncertainty and decline to answer — which, given the damning results, is probably for the best.

Worse yet, OpenAI found that its own AI models tend to vastly overestimate their own abilities, a characteristic that can lead to them being highly confident in the falsehoods they concoct.

LLMs have long suffered from “hallucinations,” an elegant term AI companies have come up with to denote their models’ well-documented tendency to produce answers that are complete BS.

Despite the very high chance of ending up with complete fabrications, the world has embraced the tech with open arms, from students generating homework assignments to developers employed by tech giants generating huge swathes of code.

And the cracks are starting the show. Case in point, an AI model used by hospitals and built on OpenAI tech was caught this week introducing frequent hallucinations and inaccuracies while transcribing patient interactions.

Cops across the United States are also starting to embrace AI, a terrifying development that could lead to law enforcement falsely accusing the innocent or furthering troubling biases.

OpenAI’s latest findings are yet another worrying sign that current LLMs are woefully unable to reliably tell the truth.

It’s a development that should serve as a reminder to treat any output of any LLM out there with plenty of skepticism and a willingness to go over the generated text with a fine-toothed comb.

Whether it’s a problem that can be solved with even bigger training sets — something AI leaders are rushing to assure investors of — remains an open question.

More on OpenAI: AI Model Used By Hospitals Caught Making Up Details About Patients, Inventing Nonexistent Medications and Sexual Acts