Image by Getty / Futurism

You might feel okay with using ChatGPT to write a tedious email or two, but would you trust it with being your doctor? New research suggests that you probably shouldn't.

After being presented with 150 medical cases, the AI chatbot only gave a correct diagnosis less than half of the time, as detailed in a new study published in the journal Plos One.

The findings show that in its current form, ChatGPT "is not accurate as a diagnostic tool," the researchers wrote, which would call into question the efforts by companies like Google that are experimenting with chatbots being used in hospitals. And as AI models are being released specifically for medical purposes, the authors worry that the public will overestimate the technology's capabilities.

"If people are scared, confused, or just unable to access care, they may be reliant on a tool that seems to deliver medical advice that's 'tailor-made' for them," study co-author and Western University assistant professor Amrit Kirpalani told Live Science.

"I think as a medical community (and among the larger scientific community) we need to be proactive about educating the general population about the limitations of these tools in this respect. They should not replace your doctor yet."

In their experiment, the researchers fed ChatGPT's large language model GPT 3.5 a variety of medical cases from Medscape, an online resource for medical professionals, that had already been accurately diagnosed. They also only chose cases after August 2021, to ensure that they weren't included in ChatGPT's training data.

To make things fair, ChatGPT also got to look at patient history, any findings from physical examinations, and lab and imaging results — all things your average human doc would have access to.

With each case, the bot had to choose from four different multiple-choice answers, with only one being correct. It also had to explain its reasoning behind the diagnosis, and in some cases, provide citations.

If ChatGPT was a med student, it would've gotten a flat-out F: it only made the correct diagnosis 49 percent of the time, and gave "complete and relevant" answers just 52 percent of the time.

Overall accuracy, however, was a lot better. This criterion considered ChatGPT's adeptness at discarding the wrong choices across all the multiple-choice options. It scored 74 percent — meaning it was surprisingly good at recognizing what was incorrect.

That's impressive, but the chatbot still struggled to pick out the final, correct diagnosis. Its biggest deficiencies compared to human doctors was that it struggled to interpret numerical values, and was unable to interpret medical images. The researchers found that it would occasionally hallucinate, too, and would sometimes ignore key information.

That being said, they suggest that AI could be useful for teaching trainee doctors and even assisting fully-fledged ones — so long as the final call relies upon human healthcare providers.

"I think with AI and chatbots specifically, the medical community will ultimately find that there's a huge potential to augment clinical decision-making, streamline administrative tasks, and enhance patient engagement," Kirpalani told LiveScience.

More on AI: Doctors Using AI to Automatically Generate Clinical Notes on Patients


Share This Article