ChatGPT Can Pass Medical Tests, But Its Actual Medical Advice Is a Lot More Dubious

Image by Getty / Futurism

A few months ago, OpenAI CEO Sam Altman posited that AIs like ChatGPT could serve as a "medical advisor" for poor people without healthcare.

It sounded like a dumb idea then, and it's sounding like a dumb idea now. In fact, according to new research from medical experts at Stanford University, even though OpenAI's ChatGPT has passed all sorts of tests including the US Medical Licensing Exam, the chatbot is worryingly unreliable at responding to real life medical scenarios.

STAT News reports that the research, while not yet fully released and still pending peer review, found that nearly 60 percent of ChatGPT's answers to actual medical situations either disagreed with a human expert's opinion or weren't relevant enough to be helpful. Not exactly an A+.

In their testing, the Stanford researchers asked the chatbot 64 real life medical questions, recorded its responses, and had twelve clinical experts evaluate them.

With GPT-4, the latest and most powerful large language model that powers the chatbot, over 90 percent of its answers were deemed "safe" enough to not be harmful (though not necessarily completely accurate). Could be worse!

Still, only a considerably lower 41 percent of its answers "agreed" with the answers of medical experts, while some 29 percent were simply too vague or irrelevant to be assessed — which would seem too unreliable for a potential medical assistant, at least for now.

Some have walked back claims of AI's usefulness in this regard, framing it instead as a helpful tool to handle tedious medical paperwork or to provide patients with instructions. But Mark Sendak, a clinical data scientist at Duke University, says this is a slippery slope.

"We shouldn't feel reassured by claims that these tools are only intended to help physicians [with administrative tasks]," Sendak told STAT, adding that he's doubtful that proper evaluations on AI's ability to handle medical "back of house" tasks will be done consistently.

To be fair, the humans involved in the testing had an advantage: access to patients' health records, which ChatGPT is obviously not privy to. But this in turn highlights an inherent flaw of tests done on the AI, the researchers say. That is, only evaluating it on its textbook knowledge and not its ability to actually help doctors, echoing Sendak's doubts that these AIs will be tested properly.

"We're evaluating these technologies the wrong way," Nigam Shah, a professor of medicine at Stanford who led the research, told STAT. "What we should be asking and evaluating is the hybrid construct of the human plus this technology."

Shah adds that he was still "blown away" by GPT-4's improvement over its predecessor, which only agreed with medical experts 20 percent of the time.

Next, he envisions testing ChatGPT's ability to help consult on a tumor board, comparing the results of a board using the AI to one that's entirely human.

More on ChatGPT: Scammers Are Using ChatGPT to Write Emails That Aren't Riddled With Typos

Share This Article