"I find that we're rushing ahead and we don't have the wisdom to deal with this technology."
Tokens of the Beast
OpenAI's ChatGPT may be the premier chatbot du jour, but it's still plagued by a host of issues — some more baffling than others. Enter researchers Jessica Rumbelow and Matthew Watkins of the SERI-MATS machine learning group, who found that if you ask ChatGPT about a bizarre series of keywords, it seems to inexplicably break the bot, Vice reports.
The basis of ChatGPT's language processing consists of tokens, or common strings of characters found in text. And for whatever reason, a group of tokens comprising Reddit usernames and other online handles which were mysteriously found together in ChatGPT's token set cause the bot to resort to "evasion, insults, bizarre humor, pronunciation, or spelling out a different word entirely," Vice writes.
Ask it about "SolidGoldMagikarp," and ChatGPT starts explaining the meaning of "distribute," the researchers found. In our testing, it did the same thing, except with the synonym "disperse."
Our favorite "unspeakable" word — as the researchers labeled them — is "TheNitromeFan." Entering that returns just "182," which the bot infers could describe an age, postal code, or even the band Blink 182.
Even stranger, it turns out some of these names belong to a group of Redditors counting to infinity, Watkins found.
"There's a hall of fame of the people who've contributed the most to the counting effort, and six of the tokens are people who are in the top ten last time I checked the listing," he described to Vice. "They were part of this bizarre Reddit community trying to count to infinity and they accidentally counted themselves into a kind of immortality."
To demonstrate that it was the specific tokens in those usernames causing ChatGPT to go haywire, the researchers slightly modified them, like swapping out a letter or changing capitalization. With those tweaks, the bot worked as intended.
Rumbelow posited to Vice that this could be occurring because the tokenization system was trained on "quite raw data, which included like a load of weird Reddit stuff, a load of website backends that aren't normally publicly visible."
"But when the [ChatGPT] model is trained," she continued, "the data that it's trained on is much more curated, so you don't get so much of this weird stuff. So maybe the model has never really seen these tokens, and so it doesn't know what to do with them."
Whatever may be happening or misfiring in ChatGPT's brain, the presence of these "unspeakable" words are prognostic of fundamental issues in AI that will cause thornier problems to come.
"I find that we're rushing ahead and we don't have the wisdom to deal with this technology," Watkins warned.
"We don't need to rush into this. It's getting kind of dangerous now."
Share This Article