"A language model for the dark side of the internet."
OpenAI's large language models (LLMs) are trained on a vast array of datasets, pulling information from the internet's dustiest and cobweb-covered corners.
But what if such a model were to crawl through the dark web — the internet's seedy underbelly where you can host a site without your identity being public or even available to law enforcement — instead? A team of South Korean researchers did just that, creating an AI model dubbed DarkBERT to index some of the sketchiest domains on the internet.
It's a fascinating glimpse into some of the murkiest corners of the World Wide Web, which have become synonymous with illegal and malicious activities from the sharing of leaked data to the sale of hard drugs.
It sounds like a nightmare, but the researchers say DarkBERT has noble intentions: trying to shed light on new ways of fighting cybercrime, a field that has made increasing use of natural language processing.
Perhaps unsurprisingly, making sense of the parts of the web that aren't indexed by search engines like Google and often can only be accessed via specific software wasn't an easy task.
As detailed in a yet-to-be-peer-reviewed paper titled "DarkBERT: A language model for the dark side of the internet," the team hooked their model up to the Tor network, a system for accessing parts of the dark web. It then got to work, creating a database of the raw data it found.
The team says their new LLM was far better at making sense of the dark web than other models that were trained to complete similar tasks, including RoBERTa, which Facebook researchers designed back in 2019 to "predict intentionally hidden sections of text within otherwise unannotated language examples," according to an official description.
"Our evaluation results show that DarkBERT-based classification model outperforms that of known pretrained language models," the researchers wrote in their paper.
The team suggests DarkBERT could be used for a variety of cybersecurity-related tasks, such as detecting sites that sell ransomware or leak confidential data. It could also be used to crawl through the countless dark web forums that get updated daily and monitor them for any exchange of illicit information.
Overall, we'll believe it when we see it. But even if the system works as intended, do we really want to start letting AI police the internet?
More on the Dark Web: Insurance Company Refuses to Pay Ransom, So Hackers Start Releasing Health Records of Up To 10 Million People