OpenAI Deploys Crawler to Vacuum Up Your Posts and Train AI With Them

OpenAI's "BotGPT" is hitting the streets, ready to devour anything that's ever been posted online. Luckily, it's pretty easy to block. — husband cleaning under the chair *Image: Getty Images*

Data Scraper

OpenAI has launched a new web crawler called “GPTBot” that will trawl the internet for content to train its large language models like GPT-4, which power ChatGPT.

“Allowing GPTBot to access your site can help AI models become more accurate and improve their general capabilities and safety,” reads a post on OpenAI’s website.

The AI juggernaut also claims that GPTBot is “filtered” to remove paywalled sources, personally identifiable information, and text that violates its policies.

Fortunately, OpenAI does provide a way to easily block GPTBot by adding an entry to a website’s robot.txt, a file that tells web crawlers from search engines like Google what they’re allowed to access.

Moreover, administrators can customize which parts of their sites GPTBot can crawl. Its multiple IPs are available, too, for easy blocking.

Keep Out!

Until now, the large language models behind ChatGPT were trained on hordes of online data gathered up to September 2021.

There’s no way to have data that was scraped before that cutoff date removed retroactively, but blocking its new web crawler will at least future-proof websites that want to keep it out going forward.

And you can bet that many site owners, who probably aren’t keen on having their content hoovered up and imitated by an AI, are already taking advantage of this.

One example is popular sci-fi magazine Clarkesworld, which announced on X, formerly known as Twitter, that it was blocking GPTBot.

Tech outlet The Verge has quietly done the same, and countless articles are already circulating that advise on how to block the crawler.

Creepy Crawlies

Of course, web crawlers are, for better or for worse, the lifeblood of the modern internet and are nothing new. In many cases, websites are encouraged to let crawlers from Google and other search engines through to help bring them web traffic.

Now, though, many feel that having them scrape data to train generative AI is a bridge too far.

For example, a recent lawsuit against OpenAI argues that, since its chatbot is trained on everyone’s writing without permission — everything from books to articles available online — it constitutes theft.

That OpenAI’s gone ahead and announced GPTBot despite the lawsuit may suggest that it’s not worried about its outcome. On the other hand, by now giving websites the option to block the crawler, it may be covering its tracks, too.

More on OpenAI: Contractors Say OpenAI Psychologically Scarred Them for $2/Hour