Get Me a Human, Stat

Something Extremely Scary Happens When Advanced AI Tries to Give Medical Advice to Real World Patients

"It’s like having a student who aces practice tests but fails when the questions are worded differently."
Victor Tangermann Avatar
Researchers have found that frontier AI models fail spectacularly when the familiar formats of medical exams are even slightly altered.
Image: Getty / Futurism

Last week, Google AI pioneer Jad Tarifi sparked controversy when he told Business Insider that it no longer makes sense to get a medical degree — since, in his telling, artificial intelligence will render such an education obsolete by the time you’re a practicing doctor.

Companies have long touted the tech as a way to free up the time of overworked doctors and even aid them in specialized skills, including scanning medical imagery for tumors. Hospitals have already been rolling out AI tech to help with administrative work.

But given the current state of AI — from widespread hallucinations to “deskilling” experienced by doctors over-relying on it — there’s reason to believe that med students should stick it out.

If anything, in fact, the latest research suggests we need human healthcare professionals now more than ever.

As PsyPost reports, researchers have found that frontier AI models fail spectacularly when the familiar formats of medical exams are even slightly altered, greatly undermining their ability to help patients in the real world — and raising the possibility that, instead, they could cause great harm by providing garbled medical advice in high-stakes health scenarios.

As detailed in a paper published in the journal JAMA Network Open, things quickly fell apart for models including OpenAI’s GPT-4o and Anthropic’s Claude 3.5 Sonnet when the wording of questions in a benchmark test was only slightly adjusted.

The idea was to probe the nature of how large language models arrive at their answers: by predicting the probability of each subsequent word — and not through a human-level understanding of complex medical terms.

“We have AI models achieving near perfect accuracy on benchmarks like multiple-choice based medical licensing exam questions,” Stanford University PhD student and coauthor Suhana Bedi told PsyPost. “But this doesn’t reflect the reality of clinical practice. We found that less than five percent of papers evaluate LLMs on real patient data, which can be messy and fragmented.”

The results left a lot to be desired. According to Bedi, “most models (including reasoning models) struggled” when it came to “Administrative and Clinical Decision Support tasks.”

The researchers suggest that “complex reasoning scenarios” in their benchmark threw the AIs for a loop since they “couldn’t be solved through pattern matching alone” — which happens to be “exactly the kind of clinical thinking that matters in real practice,” per Bedi.

“With everyone talking about deploying AI in hospitals, we thought this was a very important question to answer,” Bedi told PsyPost.

For their benchmark test, the researchers made a clever adjustment to trip up the AIs: they replaced the correct answers of multiple-choice questions with the option “none of the other answers.” This change forced the AI models to actually reason their way to the right answer — and not rely on picking up familiar language patterns.

The team noticed a significant decline in accuracy when presented with their new test, as compared to their answers to the original questions. For instance, OpenAI’s GPT-4o showed a reduction of 25 percent, while Meta’s Llama model showed a drop of almost 40 percent.

The results suggest current AI systems may be vastly over-relying on recognizing language patterns, making them inadequate for real-world clinical use.

“It’s like having a student who aces practice tests but fails when the questions are worded differently,” Bedi told PsyPost. “For now, AI should help doctors, not replace them.”

The research highlights the importance of finding new ways to evaluate the proficiency of AI models. That’s especially true for an extremely high-stakes environment like a hospital.

“Until these systems maintain performance with novel scenarios, clinical applications should be limited to nonautonomous supportive roles with human oversight,” the researchers wrote in their paper.

More on AI doctors: Founder of Google’s Generative AI Team Says Don’t Even Bother Getting a Law or Medical Degree, Because AI’s Going to Destroy Both Those Careers Before You Can Even Graduate

I’m a senior editor at Futurism, where I edit and write about NASA and the private space sector, as well as topics ranging from SETI and artificial intelligence to tech and medical policy.


TAGS IN THIS STORY