ChatGPT Outperforms Human Doctors At Accurately Diagnosing Patients

Image by Getty / Futurism

Could a chatbot be your doctor someday? It's looking more likely than you'd think.

In a recently published study, 50 doctors were asked to diagnose medical conditions from examining case reports, with some of them being randomly assigned to use ChatGPT to help with their decision making.

During the experiment, the participating doctors were graded on not just the correctness of their final diagnosis, but on how well they could explain their thought process.

Based on that criteria, the doctors who worked by themselves scored 74 percent on average, and those who collaborated with the AI chatbot to reach their diagnosis scored 76 percent.

But both groups were vastly outperformed by something that never went to med school: ChatGPT, acting on its own, blew the human docs out of the water with an average score of 90 percent.

The research, published in the journal JAMA Network Open, was small in scope — the 50 doctors only examined six case studies — but nonetheless has striking implications about the role of AI in the medical field — and perhaps the biases held by human doctors.

"I was shocked at the results," study coauthor Adam Rodman, an internal medicine specialist at Beth Israel Deaconess Medical Center in Boston, said in an interview on the New York Times podcast Hard Fork. "My hypothesis going in was that people using [ChatGPT] would be the best. So I am surprised by this."

These cases, based on real medical patients, were intentionally challenging. Nevertheless, ChatGPT overwhelmingly prevailed. According to Rodman, this could be as much of a testament to the AI model's capabilities as it is to a human doctor's stubbornness.

The MDs using ChatGPT, for example, may have felt resistant to the chatbot's second opinion and dismissed it as wrong, doubling down on their first guess as a result.

Another factor that could explain why the doctors lagged behind the technology is that they simply weren't familiar with using it.

But Rodman pushed back against the takeaway that ChatGPT is more competent than your average human doc. "The difference is that the people who put the case together, like, the information, if you want to think about the prompts, were expert clinicians," he said on the podcast. "We organize it in such a way."

In other words, human medical professionals did all the hard work of accurately gathering and presenting medical information in the final case reports — something an AI can't do, at least yet. To declare that the AI obviously trumps the doctors is sort of like a chef taking all the credit for a tasty meal when they used someone else's recipe.

The study was also primarily designed to test how effectively a chatbot could help doctors, which turned out to be not very much — and not to demonstrate that AI was superior. In fact, similar studies have found ChatGPT to be terrible at diagnosing cases.

What's striking, though, is that that the experiment was conducted a year ago using an older version of ChatGPT. The results could be even more impressive now.

"Maybe AI models are better at making diagnoses than human doctors. But I don't think that's the case with GPT-4 Turbo, which was the model that was used here," Rodman said. "But it's going to be true at some point, and we're quickly approaching that."

More on AI: Elon Musk Recommends Feeding Your Medical Scans Into His Grok AI

Share This Article