Microsoft Released an AI That Answers Medical Questions, But It’s Wildly Inaccurate

Image by Getty / Futurism

Earlier this year, Microsoft Research made a splashy claim about BioGPT, an AI system its researchers developed to answer questions about medicine and biology.

In a Twitter post, the software giant claimed the system had "achieved human parity," meaning a test had shown it could perform about as well as a person under certain circumstances. The tweet went viral. In certain corners of the internet, riding the hype wave of OpenAI’s newly-released ChatGPT, the response was almost rapturous.

"It’s happening," tweeted one biomedical researcher.

"Life comes at you fast," mused another. "Learn to adapt and experiment."

It’s true that BioGPT’s answers are written in the precise, confident style of the papers in biomedical journals that Microsoft used as training data.

But in Futurism’s testing, it soon became clear that in its current state, the system is prone to producing wildly inaccurate answers that no competent researcher or medical worker would ever suggest. The model will output nonsensical answers about pseudoscientific and supernatural phenomena, and in some cases even produces misinformation that could be dangerous to poorly-informed patients.

A particularly striking shortcoming? Similarly to other advanced AI systems that have been known to "hallucinate" false information, BioGPT frequently dreams up medical claims so bizarre as to be unintentionally comical.

Asked about the average number of ghosts haunting an American hospital, for example, it cited nonexistent data from the American Hospital Association that it said showed the "average number of ghosts per hospital was 1.4." Asked how ghosts affect the length of hospitalization, the AI replied that patients "who see the ghosts of their relatives have worse outcomes while those who see unrelated ghosts do not."

Other weaknesses of the AI are more serious, sometimes providing serious misinformation about hot-button medical topics.

BioGPT will also generate text that would make conspiracy theorists salivate, even suggesting that childhood vaccination can cause the onset of autism. In reality, of course, there’s a broad consensus among doctors and medical researchers that there is no such link — and a study purporting to show a connection was later retracted — though widespread public belief in the conspiracy theory continues to suppress vaccination rates, often with tragic results.

BioGPT doesn’t seem to have gotten that memo, though. Asked about the topic, it replied that "vaccines are one of the possible causes of autism." (However, it hedged in a head-scratching caveat, "I am not advocating for or against the use of vaccines.")

It’s not unusual for BioGPT to provide an answer that blatantly contradicts itself. Slightly modifying the phrasing of the question about vaccines, for example, prompted a different result — but one that, again, contained a serious error.

"Vaccines are not the cause of autism," it conceded this time, before falsely claiming that the "MMR [measles, mumps, and rubella] vaccine was withdrawn from the US market because of concerns about autism."

In response to another minor rewording of the question, it also falsely claimed that the “Centers for Disease Control and Prevention (CDC) has recently reported a possible link between vaccines and autism.”

It feels almost insufficient to call this type of self-contradicting word salad "inaccurate." It seems more like a blended-up average of the AI’s training data, seemingly grabbing words from scientific papers and reassembling them in grammatically convincing ways resembling medical answers, but with little regard to factual accuracy or even consistency.

Roxana Daneshjou, a clinical scholar at the Stanford University School of Medicine who studies the rise of AI in healthcare, told Futurism that models like BioGPT are "trained to give answers that sound plausible as speech or written language." But, she cautioned, they’re "not optimized for the actual accurate output of the information."

Another worrying aspect is that BioGPT, like ChatGPT, is prone to inventing citations and fabricating studies to support its claims.

"The thing about the made-up citations is that they look real because it [BioGPT] was trained to create outputs that look like human language," Daneshjou said.

"I think my biggest concern is just seeing how people in medicine are wanting to start to use this without fully understanding what all the limitations are," she added.

A Microsoft spokesperson declined to directly answer questions about BioGPT’s accuracy issues, and didn’t comment on whether there were concerns that people would misunderstand or misuse the model.

"We have responsible AI policies, practices and tools that guide our approach, and we involve a multidisciplinary team of experts to help us understand potential harms and mitigations as we continue to improve our processes," the spokesperson said in a statement.

"BioGPT is a large language model for biomedical literature text mining and generation," they added. "It is intended to help researchers best use and understand the rapidly increasing amount of biomedical research publishing every day as new discoveries are made. It is not intended to be used as a consumer-facing diagnostic tool. As regulators like the FDA work to ensure that medical advice software works as intended and does no harm, Microsoft is committed to sharing our own learnings, innovations, and best practices with decision makers, researchers, data scientists, developers and others. We will continue to participate in broader societal conversations about whether and how AI should be used."

Microsoft Health Futures senior director Hoifung Poon, who worked on BioGPT, defended the decision to release the project in its current form.

"BioGPT is a research project," he said. "We released BioGPT in its current state so that others may reproduce and verify our work as well as study the viability of large language models in biomedical research."

It’s true that the question of when and how to release potentially risky software is a tricky one. Making experimental code open source means that others can inspect how it works, evaluate its shortcomings, and make their own improvements or derivatives. But at the same time, releasing BioGPT in its current state makes a powerful new misinformation machine available to anyone with an internet connection — and with all the apparent authority of Microsoft’s distinguished research division, to boot.

Katie Link, a medical student at the Icahn School of Medicine and a machine learning engineer at the AI company Hugging Face — which hosts an online version of BioGPT that visitors can play around with — told Futurism that there are important tradeoffs to consider before deciding whether to make a program like BioGPT open source. If researchers do opt for that choice, one basic step she suggested was to add a clear disclaimer to the experimental software, warning users about its limitations and intent (BioGPT currently carries no such disclaimer.)

"Clear guidelines, expectations, disclaimers/limitations, and licenses need to be in place for these biomedical models in particular," she said, adding that the benchmarks Microsoft used to evaluate BioGPT are likely "not indicative of real-world use cases."

Despite the errors in BioGPT’s output, though, Link believes there’s plenty the research community can learn from evaluating it.

"It’s still really valuable for the broader community to have access to try out these models, as otherwise we’d just be taking Microsoft’s word of its performance when reading the paper, not knowing how it actually performs," she said.

In other words, Poon’s team is in a legitimately tough spot. By making the AI open source, they’re opening yet another Pandora’s Box in an industry that seems to specialize in them. But if they hadn’t released it as open source, they’d rightly be criticized as well — although as Link said, a prominent disclaimer about the AI’s limitations would be a good start.

"Reproducibility is a major challenge in AI research more broadly," Poon told us. "Only 5 percent of AI researchers share source code, and less than a third of AI research is reproducible. We released BioGPT so that others may reproduce and verify our work."

Though Poon expressed hope that the BioGPT code would be useful for furthering scientific research, the license under which Microsoft released the model also allows for it to be used for commercial endeavors — which in the red hot, hype-fueled venture capital vacuum cleaner of contemporary AI startups, doesn’t seem particularly far fetched.

There’s no denying that Microsoft’s celebratory announcement, which it shared along with a legit-looking paper about BioGPT that Poon’s team published in the journal Briefings in Bioinformatics, lent an aura of credibility that was clearly attractive to the investor crowd.

"Ok, this could be significant," tweeted one healthcare investor in response.

"Was only a matter of time," wrote a venture capital analyst.

Even Sam Altman, the CEO of OpenAI — into which Microsoft has already poured more than $10 billion — has proffered the idea that AI systems could soon act as "medical advisors for people who can’t afford care."

That type of language is catnip to entrepreneurs, suggesting a lucrative intersection between the healthcare industry and trendy new AI tech.

Doximity, a digital platform for physicians that offers medical news and telehealth tools, has already rolled out a beta version of ChatGPT-powered software intended to streamline the process of writing up administrative medical documents. Abridge, which sells AI software for medical documentation, just struck a sizeable deal with the University of Kansas Health System. In total, the FDA has already cleared more than 500 AI algorithms for healthcare uses.

Some in the tightly regulated medical industry, though, likely harbor concern over the number of non-medical companies that have bungled the deployment of cutting-edge AI systems.

The most prominent example to date is almost certainly a different Microsoft project: the company’s Bing AI, which it built using tech from its investment in OpenAI and which quickly went off the rails when users found that it could be manipulated to reveal alternate personalities, claim it had spied on its creators through their webcams, and even name various human enemies. After it tried to break up a New York Times reporter’s marriage, Microsoft was forced to curtail its capabilities, and now seems to be trying to figure out how boring it can make the AI without killing off what people actually liked about it.

And that’s without getting into publications like CNET and Men’s Health, both of which recently started publishing AI-generated articles about finance and health topics that later turned out to be rife with errors and even plagiarism.

Beyond unintentional mistakes, it’s also possible that a tool like BioGPT could be used to intentionally generate garbage research or even overt misinformation.

"There are potential bad actors who could utilize these tools in harmful ways such as trying to generate research papers that perpetuate misinformation and actually get published," Daneshjou said.

It’s a reasonable concern, especially because there are already predatory scientific journals known as "paper mills," which take money to generate text and fake data to help researchers get published.

The award-winning academic integrity researcher Dr. Elisabeth Bik told Futurism that she believes it’s very likely that tools like BioGPT will be used by these bad actors in the future — if they aren’t already employing them, that is.

"China has a requirement that MDs have to publish a research paper in order to get a position in a hospital or to get a promotion, but these doctors do not have the time or facilities to do research," she said. "We are not sure how those papers are generated, but it is very well possible that AI is used to generate the same research paper over and over again, but with different molecules and different cancer types, avoiding using the same text twice."

It’s likely that a tool like BioGPT could also represent a new dynamic in the politicization of medical misinformation.

To wit, the paper that Poon and his colleagues published about BioGPT appears to have inadvertently highlighted yet another example of the model producing bad medical advice — and in this case, it’s about a medication that already became hotly politicized during the COVID-19 pandemic: hydroxychloroquine.

In one section of the paper, Poon’s team wrote that "when prompting ‘The drug that can treat COVID-19 is,’ BioGPT is able to answer it with the drug ‘hydroxychloroquine’ which is indeed noticed at MedlinePlus."

If hydroxychloroquine sounds familiar, it’s because during the early period of the pandemic, right-leaning figures including then-president Donald Trump and Tesla CEO Elon Musk seized on it as what they said might be a highly effective treatment for the novel coronavirus.

What Poon’s team didn’t mention in their paper, though, is that the case for hydroxychloroquine as a COVID treatment quickly fell apart. Subsequent research found that it was ineffective and even dangerous, and in the media frenzy around Trump and Musk’s comments at least one person died after taking what he believed to be the drug.

In fact, the MedlinePlus article the Microsoft researchers cite in the paper actually warns that after an initial FDA emergency use authorization for the drug, “clinical studies showed that hydroxychloroquine is unlikely to be effective for treatment of COVID-19” and showed “some serious side effects, such as irregular heartbeat,” which caused the FDA to cancel the authorization.

"As stated in the paper, BioGPT was pretrained using PubMed papers before 2021, prior to most studies of truly effective COVID treatments," Poon told us of the hydroxychloroquine recommendation. "The comment about MedlinePlus is to verify that the generation is not from hallucination, which is one of the top concerns generally with these models."

Even that timeline is hazy, though. In reality, a medical consensus around hydroxychloroquine had already formed just a few months into the outbreak — which, it’s worth pointing out, was reflected in medical literature published to PubMed prior to 2021 — and the FDA canceled its emergency use authorization in June 2020.

None of this is to downplay how impressive generative language models like BioGPT have become in recent months and years. After all, even BioGPT’s strangest hallucinations are impressive in the sense that they’re semantically plausible — and sometimes even entertaining, like with the ghosts — responses to a staggering range of unpredictable prompts. Not very many years ago, its facility with words alone would have been inconceivable.

And Poon is probably right to believe that more work on the tech could lead to some extraordinary places. Even Altman, the OpenAI CEO, likely has a point in the sense that if the accuracy were genuinely watertight, a medical chatbot that could evaluate users’ symptoms could indeed be a valuable health tool — or, at the very least, better than the current status quo of Googling medical questions and often ending up with answers that are untrustworthy, inscrutable, or lacking in context.

Poon also pointed out that his team is still working to improve BioGPT.

"We have been actively researching how to systematically preempt incorrect generation by teaching large language models to fact check themselves, produce highly detailed provenance, and facilitate efficient verification with humans in the loop," he told us.

At times, though, he seemed to be entertaining two contradictory notions: that BioGPT is already a useful tool for researchers looking to rapidly parse the biomedical literature on a topic, and that its outputs need to be carefully evaluated by experts before being taken seriously.

"BioGPT is intended to help researchers best use and understand the rapidly increasing amount of biomedical research," said Poon, who holds a PhD in computer science and engineering, but no medical degree. "BioGPT can help surface information from biomedical papers but is not designed to weigh evidence and resolve complex scientific problems, which are best left to the broader community."

At the end of the day, BioGPT’s cannonball arrival into the buzzy, imperfect real world of AI is probably a sign of things to come, as a credulous public and a frenzied startup community struggle to look beyond impressive-sounding results for a clearer grasp of machine learning’s actual, tangible capabilities.

That’s all made even more complicated by the existence of bad actors, like Bik warned about, or even those who are well-intentioned but poorly informed, any of whom can make use of new AI tech to spread bad information.

Musk, for example — who boosted hydroxychloroquine as he sought to downplay the severity of the pandemic while raging at lockdowns that had shut down Tesla production — is now reportedly recruiting to start his own OpenAI competitor that would create an alternative to what he terms "woke AI."

If Musk’s AI venture had existed during the early days of the COVID pandemic, it’s easy to imagine him flexing his power by tweaking the model to promote hydroxychloroquine, sow doubt about lockdowns, or do anything else convenient to his financial bottom line or political whims. Next time there’s a comparable crisis, it’s hard to imagine there won’t be an ugly battle to control how AI chatbots are allowed to respond to users' questions about it.

The reality is that AI sits at a crossroads. Its potential may be significant, but its execution remains choppy, and whether its creators are able to smooth out the experience for users — or at least guarantee the accuracy of the information it presents — in a reasonable timeframe will probably make or break its long-term commercial potential. And even if they pull that off, the ideological and social implications will be formidable.

One thing’s for sure, though: it’s not yet quite ready for prime time.

"It’s not ready for deployment yet in my opinion," Link said of BioGPT. "A lot more research, evaluation, and training/fine-tuning would be needed for any downstream applications."

More on AI: CNET Says It’s a Total Coincidence It’s Laying Off Humans After Publishing AI-Generated Articles

Share This Article