AI Model Used By Hospitals Caught Making Up Details About Patients, Inventing Nonexistent Medications and Sexual Acts

Generative AI and medical documents. What could go wrong?

Health Scare

In a new investigation from The Associated Press, dozens of experts have found that Whisper, an AI-powered transcription tool made by OpenAI, is plagued with frequent hallucinations and inaccuracies, with the AI model often inventing completely unrelated text.

What's even more concerning, though, is who's relying on the tech, according to the AP: despite OpenAI warning that its model shouldn't be used in "high-risk domains," over 30,000 medical workers and 40 health systems are using Nabla, a tool built on Whisper, to transcribe and summarize patient interactions — almost certainly with inaccurate results.

In a medical environment, this could have "really grave consequences," Alondra Nelson, a professor at the Institute for Advanced Study, told the AP.

"Nobody wants a misdiagnosis," Nelson said. "There should be a higher bar."

Whisper Campaign

Nabla chief technology officer Martin Raison told the AP that the tool was fine-tuned on medical language. Even so, it can't escape the inherent unreliability of its underlying model.

One machine learning engineer who spoke to the AP said he discovered hallucinations in half of the over 100 hours of Whisper transcriptions he looked at. Another who examined 26,000 transcripts said he found hallucinations in almost all of them.

Whisper performed poorly even with well-recorded, short audio samples, according to a recent study cited by the AP. Over millions of recordings, there could be tens of thousands of hallucinations, researchers warned.

Another team of researchers revealed just how egregious these errors can be. Whisper would inexplicably add racial commentary, they found, such as making up a person's race without instruction, and also invent nonexistent medications. In other cases, the AI would describe violent and sexual acts that had no basis in the original speech. They even found baffling instances of YouTuber lingo, such as a "like and subscribe," being dropped into the transcript.

Overall, nearly 40 percent of these errors were harmful or concerning, the team concluded, because they could easily misrepresent what the speaker had actually said.

We noticed in 2023 that, even when an audio file had ended, Whisper had a habit of hallucinating additional sentences that were never spoken. And, re-running Whisper on the same file yielded different hallucinations - see below example (hallucinations in red) (1/14) pic.twitter.com/uXrI6P58gj
— Allison Koenecke (@allisonkoe) June 3, 2024

Ground Truth

The scope of the damage could be immense. According to Nabla, its tool has been used to transcribe an estimated seven million medical visits, the paperwork for all of which could now have pernicious inaccuracies somewhere in the mix.

And worryingly, there's no way to verify if the AI transcriptions are accurate, because the tool deletes the original audio recordings "for data safety reasons," according to Raison. Unless the medical workers themselves kept a copy of the recording, any hallucinations will stand as part of the official record.

"You can't catch errors if you take away the ground truth," William Saunders, a research engineer who quit OpenAI in protest, told the AP.

Nabla officials said they are aware that Whisper can hallucinate and are addressing the problem, per the AP. Being "aware" of the problem, however, seemingly didn't stop the company from pushing experimental and as yet extremely unreliable tech onto the medical industry in the first place.

More on AI: After Teen's Suicide, Character.AI Is Still Hosting Dozens of Suicide-Themed Chatbots

Share This Article