Publicly traded cloud services provider Cloudflare has released a new free tool to prevent bots from scraping data from websites hosted on its platform to train AI models.
Some AI vendors, including Google, OpenAI, and Apple, allow website owners to block bots used for data scraping and model training by modifying their site's robots.txt, a text file that tells bots which pages they can visit on a website. But as Cloudflare points out in a post announcing its bot-fighting tools, not all AI scrapers respect this.
“Customers don't want AI bots visiting their websites, especially not bots that are engaging in fraudulent activities,” the company wrote in a blog post. “We are concerned that some AI companies will continue to adapt relentlessly to evade bot detection as they try to get around the rules to access content.”
So to address this issue, Cloudflare analyzed AI bot and crawler traffic and fine-tuned our automated bot detection model, which takes into account, among other factors, whether an AI bot is trying to evade detection by mimicking the appearance and behavior of a person using a web browser.
“When bad actors attempt to crawl large websites, they typically use tools and frameworks that we can fingerprint,” Cloudflare wrote. “Based on these signals, our models [are] This will allow traffic from evasive AI bots to be properly flagged as bots.”
Cloudflare said it has set up a form for hosts to report suspicious AI bots or crawlers, and that it will continue to manually blacklist AI bots.
As the boom in generative AI drives demand for model training data, the problem of AI bots becomes clearer.
Many sites, wary of AI vendors training models on their content without warning or compensation, have chosen to block AI scrapers and crawlers: One study found that roughly 26% of the top 1,000 sites on the web block OpenAI bots, while another found that more than 600 news publishers block the bots.
However, blocking is not a foolproof protection, and as we noted earlier, some vendors appear to be ignoring standard bot-elimination rules to gain an edge in the AI race: AI search engine Perplexity was recently accused of scraping content from websites by masquerading as legitimate visitors, and OpenAI and Anthropic have both allegedly ignored robots.txt rules in the past.
In a letter to publishers last month, content licensing startup TollBit said that in reality “many AI agents” ignore the robots.txt standard.
Tools like Cloudflare could help, but only if they prove to be accurate at detecting covert AI bots. Nor do they solve the more intractable problem of publishers risking losing referral traffic from AI tools like Google’s AI Overviews, where blocking certain AI crawlers results in sites being excluded.