Majority of Humans Fooled by GPT-4 in Turing Test, Scientists Find

OpenAI's GPT-4 is so lifelike, it can trick more than 50 percent of human test subjects into thinking they are talking to a person. — *Image: Getty / Futurism*

Pass/Fail

OpenAI’s GPT-4 is so lifelike, it can apparently trick more than 50 percent of human test subjects into thinking they’re talking to a person.

In a new paper, cognitive science researchers from the University of California San Diego found that more than half the time, people mistook writing from GPT-4 as having been written by a flesh-and-blood human. In other words, the large language model (LLM) passes the Turing test with flying colors.

The researchers performed a simple experiment: they asked roughly 500 people to have five-minute text-based conversations with either a human or a chatbot built on GPT-4. They then asked the subjects if they thought they’d been conversing with a person or an AI.

The results, as the San Diego scientists reported in their not-yet-peer-reviewed paper, were telling: 54 percent of the subjects believed they’d been speaking to humans when they’d actually been chatting with OpenAI’s creation.

First theorized back in 1950 by computer science pioneer Alan Turing, the Turing Test is more of a thought experiment than an actual battery of tests. In his original test, Turing had three “players” — a human interrogator, a witness of indeterminate humanity or machine-ness, and a human observer.

For their study, the UC San Diego researchers tweaked Turing’s original three-player formula by eliminating the third human observer to simplify the setup. They then had the 500 participants communicate with one of four witness types: another human, GPT-3.5, GPT-4, or the rudimentary ELIZA chatbot from the 1960s.

Coin Toss

Jones and Bergen hypothesized that the study’s subjects would generally be able to tell most of the time if they were communicating with either a human or ELIZA, but that when it came to the OpenAI LLMs, they would essentially have a 50/50 chance.

As it turns out, they were pretty much on the money. Beyond the 54 percent who mistook GPT-4 for a human, exactly 50 percent of the subjects confused GPT-3.5, the latest LLM’s direct predecessor, for a person as well. Compared to the 22 percent who thought ELIZA was the real deal, that’s pretty stunning.

👀 "the first robust empirical demonstration that any artificial system passes an interactive 2-player Turing test."

GPT-4 was judged to be human by other humans 54% of the time (though humans were judged to be human 67% of the time). https://t.co/JCNUCG2AP5 pic.twitter.com/vQ0nTlt0jp
— Ethan Mollick (@emollick) May 15, 2024

Despite still being under review, the paper has already made waves in the tech world with a shoutout from Ethereum cofounder Vitalik Buterin, who declared on the Farcaster social network that to his mind, the San Diego research “counts as [GPT-4] passing the Turing test.”

While others have claimed to observe OpenAI’s GPT models passing the Turing test, the Buterin endorsement makes this study stand apart — though we’ll probably have to wait for the paper to be peer-reviewed until any grander declarations can be made.

More on GPT-4: OpenAI Secretly Trained GPT-4 With More Than a Million Hours of Transcribed YouTube Videos