ChatGPT Actually Gets Half of Programming Questions Wrong

Researchers found that ChatGPT got just half of 517 Stack Overflow prompts wrong, indicating it's not great at spitting out perfect code. — *Image: Getty / Futurism*

Failing Grade

Not long after it was released to the public, programmers started to take note of a notable feature of OpenAI’s ChatGPT: that it could quickly spit out code, in response to easy prompts.

But should software engineers really trust its output?

In a yet-to-be-peer-reviewed study, researchers at Purdue University found that the uber-popular AI tool got just over half of 517 software engineering prompts from the popular question-and-answer platform Stack Overflow wrong — a sobering reality check that should have programmers think twice before deploying ChatGPT’s answers in anything important.

Pathological Liar

The research goes further, though, finding intriguing nuance in the ability of humans as well. The researchers asked a group of 12 participants with varying levels of programming expertise to analyze ChatGPT’s answers. While they tended to rate Stack Overflow’s answers higher across categories including correctness, comprehensiveness, conciseness, and usefulness, they weren’t great at identifying the answers ChatGPT got wrong, failing to identify incorrect answers 39.34 percent of the time.

In other words, ChatGPT is a very convincing liar — a reality we’ve become all too familiar with.

“Users overlook incorrect information in ChatGPT answers (39.34 percent of the time) due to the comprehensive, well-articulated, and humanoid insights in ChatGPT answers,” the paper reads.

So how worried should we really be? For one, there are many ways to arrive at the same “correct” answer in software. A lot of human programmers also say they verify ChatGPT’s output, suggesting they understand the tool’s limitations. But whether that’ll continue to be the case remains to be seen.

Lack of Reason

The researchers argue that a lot of work still needs to be done to address these shortcomings.

“Although existing work focus on removing hallucinations from [large language models], those are only applicable to fixing factual errors,” they write. “Since the root of conceptual error is not hallucinations, but rather a lack of understanding and reasoning, the existing fixes for hallucination are not applicable to reduce conceptual errors.”

In response, we need to focus on “teaching ChatGPT to reason,” the researchers conclude — a tall order for this current generation of AI.

More on ChatGPT: AI Expert Says ChatGPT Is Way Stupider Than People Realize