OpenAI Admits That Its New Model Still Hallucinates More Than a Third of the Time

If a partner or friend made stuff up a significant percentage of the time that you asked a question, it would be a huge problem for the relationship.

But apparently it's different for OpenAI's hot new model. Using SimpleQA, the company's in-house factuality benchmarking tool, OpenAI admitted in its release announcement that its new large language model (LLM) GPT-4.5 hallucinates — which is AI parlance for confidently spewing fabrications and presenting them as fact — 37 percent of the time.

Yes, you read that right: in tests, the latest AI model from a company that's worth hundreds of billions of dollars is telling lies for more than one out of every three answers it gives.

As if that wasn't bad enough, OpenAI is actually trying to spin GPT-4.5's bullshitting problem as a good thing because — get this — it doesn't hallucinate as much as the company's other LLMs.

The same graph [can we embed a screenshot below?] that showed how often the new model spews nonsense also reports that GPT-4o, a purportedly advanced "reasoning" model, hallucinates 61.8 percent of the time on the SimpleQA benchmark. OpenAI's o3-mini, a cheaper and smaller version of its reasoning model, was found to hallucinate a whopping 80.3 percent of the time.

Of course, the problem isn't unique to OpenAI.

"At present, even the best models can generate hallucination-free text only about 35 percent of the time," explained Wenting Zhao, a Cornell doctoral student who co-wrote a paper last year about AI hallucination rates, in an interview about the research with TechCrunch. "The most important takeaway from our work is that we cannot yet fully trust the outputs of model generations."

Beyond the incredulity of a company getting hundreds of billions of dollars in investments for products that have such issues telling the truth, it says a lot about the AI industry at large that these are the things they're selling us: expensive, resource-consuming systems that are supposed to be approaching human-level intelligence but still can't get basic facts right.

As OpenAI's LLMs plateau in performance, the company is clearly grasping at straws to re-steer the hype ship back on the course it seemed to chart when ChatGPT first dropped.

But to do that, we're probably going to need to see a real breakthrough, not more of the same.

More on AI hallucinations: Even the Most Advanced AI Has a Problem: If It Doesn’t Know the Answer, It Makes One Up

Share This Article