Reddit announced on Tuesday that it's updating its Robots Exclusion Protocol, or robots.txt file, which tells automated webbots whether they're allowed to crawl a site.
Historically, robots.txt files were used to allow search engines to scrape sites and direct users to content. However, with the rise of AI, websites can now be scraped and used to train models without knowledge of the actual source of the content.
Along with updating its robots.txt file, Reddit will continue to restrict and block unknown bots and crawlers from accessing the platform. The company told TechCrunch that bots and crawlers will be rate-limited or blocked if they don't comply with Reddit's public content policies and don't have an agreement with the platform.
Reddit says the update shouldn't affect the vast majority of users, or well-intentioned parties like researchers and organizations like the Internet Archive. Rather, the update is intended to stop AI companies from training large-scale language models on Reddit content. Of course, AI crawlers can ignore Reddit's robots.txt file.
The announcement comes days after a Wired investigation found that AI-powered search startup Perplexity was stealing and scraping content. Wired noted that Perplexity appears to be ignoring requests not to scrape websites, despite blocking the startup with a robots.txt file. Perplexity CEO Aravind Srinivas responded to the allegations, saying that robots.txt files are not a legal framework.
Reddit's upcoming changes won't affect companies that have contracts with the company. For example, Reddit has a $60 million deal with Google that allows the search giant to train its AI models on the social platform's content. These changes suggest that Reddit will have to pay a fee to other companies that want to use Reddit's data for AI training.
“Anyone who accesses Reddit's content must comply with our policies, including those designed to protect Reddit users,” Reddit said in a blog post. “We are selective about who we work with and trust with access to Reddit's content at scale.”
The announcement comes as no surprise, as Reddit released new policies a few weeks ago designed to guide how Reddit data can be accessed and used by commercial entities and other partners.