The Outpost is a comprehensive collection of curated artificial intelligence software tools that cater to the needs of small business owners, bloggers, artists, musicians, entrepreneurs, marketers, writers, and researchers.
© 2025 TheOutpost.AI All rights reserved
Curated by THEOUTPOST
On Thu, 4 Jul, 12:01 AM UTC
4 Sources
[1]
Cloudflare wants to help stomp out AI scraper bots for good
AI bot blocking is about to become more effective with new Cloudflare tool Cloudflare has announced a new feature that allows web hosting customers to block AI bots from scraping their website content without permission. Companies responsible for AI tools, particularly those involved with training, have come under fire in recent months over their use of content published online. Cloudflare says its new tool responds to widespread discontent among website owners and publishers about AI bots harvesting data and aims to protect content creators on the internet. In its announcement, Cloudflare wrote: "We hear clearly that customers don't want AI bots visiting their websites, and especially those that do so dishonestly. To help, we've added a brand new one-click to block all AI bots." The tool has become available for all customers, including those on Cloudflare's free tier. Previously, website owners have been able to use the robots.txt file to present bots from crawling their sites, but to varying degrees of success. The new tools offers a more robust defence against content scraping by using machine learning to identify and block bots. Cloudflare's system relies on digital fingerprinting to differentiate bots from legitimate users. The company said that the enhanced protection comes as a response to significant AI bot activity across its network. Bytespider, GPTBot and ClaudeBot were among the most commonly observed, but despite previous blocking attempts, AI bots accessed more than one-third (39%) of the top one million sites served by Cloudflare in June. Thankfully, the new feature is available to all Cloudflare customers, including those on the free tier. By navigating to the Security > Bots menu and blocking "AI Scrapers and Crawlers," users can protect their website from unwanted AI bot activity.
[2]
Cloudflare offers 1-click block against web-scraping AI bots
Cloudflare on Wednesday offered web hosting customers a way to block AI bots from scraping website content and using the data without permission to train machine learning models. It did so based on customer loathing of AI bots and, "to help preserve a safe internet for content creators," it said in a statement. "We hear clearly that customers don't want AI bots visiting their websites, and especially those that do so dishonestly. To help, we've added a brand new one-click to block all AI bots." There's already a somewhat effective method to block bots that's widely available to website owners, the robots.txt file. When placed in a website's root directory, automated web crawlers are expected to notice and comply with directives in the file that tell them to stay out. Given the widespread belief that generative AI is based on theft, and the many lawsuits attempting to hold AI companies accountable, firms trafficking in laundered content have graciously allowed web publishers to opt-out of the pilfering. Last August, OpenAI published guidance about how to block its GPTbot crawler using a robots.txt directive, presumably aware of concern about having content scraped and used for AI training without consent. Google took similar steps the following month. Also in September last year Cloudflare began offering a way to block rule-respecting AI bots, and 85 percent of customers - it's claimed - enabled this block. Now the network services biz aims to provide a more robust barrier to bot entry. The internet is "now flooded with these AI bots," it said, which visit about 39 percent of the top one million web properties served by Cloudflare. The problem is that robots.txt, like the Do Not Track header implemented in browsers fifteen years ago to declare a preference for privacy, can be ignored, generally without consequences. And recent reports suggest AI bots do just that. Amazon last week said it was looking into evidence that bots working on behalf of AI search outfit Perplexity, an AWS client, had crawled websites, including news sites, and reproduced their content without suitable credit or permission. Amazon cloud customers are supposed to obey robots.txt, and Perplexity was accused of not doing that. Aravind Srinivas, CEO of the AI upstart, denied his biz was underhandedly ignoring the file, though conceded third-party bots used by Perplexity were the ones observed scraping pages against the wishes of webmasters. "Sadly, we've observed bot operators attempt to appear as though they are a real browser by using a spoofed user agent," Cloudflare said. "We've monitored this activity over time, and we're proud to say that our global machine learning model has always recognized this activity as a bot, even when operators lie about their user agent." Cloudflare said its machine-learning scoring system rated the disguised Perplexity bot below 30 consistently over a period from June 14 through June 27, indicating that it's "likely automated." This bot detection approach relies on digital fingerprinting, a technique commonly used to track people online and deny privacy. Crawlers, like individual internet users, often stand out from the crowd based on technical details that can be read through network interactions. These bot tend to use the same tools and frameworks for automating website visits. And with a network that sees an average of 57 million requests per second, Cloudflare has ample data to determine which of these fingerprints can be trusted. So this is what it's come to: machine learning models defending against bots foraging to feed AI models, available even for free tier customers. All customers have to do is click the Block AI Scrapers and Crawlers toggle button in the Security -> Bots menu for a given website. "We fear that some AI companies intent on circumventing rules to access content will persistently adapt to evade bot detection," Cloudflare said. "We will continue to keep watch and add more bot blocks to our AI Scrapers and Crawlers rule and evolve our machine learning models to help keep the Internet a place where content creators can thrive and keep full control over which models their content is used to train or run inference on." ®
[3]
Cloudflare rolls out feature for blocking AI companies' web scrapers - SiliconANGLE
Cloudflare rolls out feature for blocking AI companies' web scrapers Cloudflare Inc. today debuted a new no-code feature for preventing artificial intelligence developers from scraping website content. The capability is available as part of the company's flagship CDN, or content delivery network. The platform is used by a sizable percentage of the world's websites to speed up page loading times for users. According to Cloudflare, the new scraping prevention feature is available in both the free and paid tiers of its CDN. Many AI companies use content from the public web to train their large language models. OpenAI, Google LLC and several other market players enable website operators to opt out of scraping. However, not all LLM developers provide such an option, which is the issue that Cloudflare hopes to address with its scraping prevention tool. The feature uses AI to detect automated content extraction attempts. According to Cloudflare, its software can spot bots that scrape content for LLM training projects even when they attempt to avoid detection. "Sadly, we've observed bot operators attempt to appear as though they are a real browser by using a spoofed user agent," Cloudflare engineers wrote in a blog post today. "We've monitored this activity over time, and we're proud to say that our global machine learning model has always recognized this activity as a bot." One of the crawlers that Cloudflare managed to detect is a bot that collects content for Perplexity AI Inc., a well-funded search engine startup. Last month, Wired reported that the manner in which the bot scrapes websites makes its requests appear as regular user traffic. As a result, website operators have struggled to block Perplexity AI from using their content. Cloudflare assigns every website visit that its platform processes a score of 1 to 99. The lower the number, the greater the likelihood that the request was generated by a bot. According to the company, requests made by the bot that collects content for Perplexity AI consistently receive a score under 30. "When bad actors attempt to crawl websites at scale, they generally use tools and frameworks that we are able to fingerprint," Cloudflare's engineers detailed. "For every fingerprint we see, we use Cloudflare's network, which sees over 57 million requests per second on average, to understand how much we should trust this fingerprint." Cloudflare will update the feature over time to address changes in AI scraping bots' technical fingerprints and the emergence of new crawlers. As part of the initiative, the company is rolling out a tool that will enable website operators to report any new bots they may encounter.
[4]
Cloudflare launches a tool to combat AI bots | TechCrunch
Cloudflare, the publicly-traded cloud service provider, has launched a new, free tool to prevent bots from scraping websites hosted on its platform for data to train AI models. Some AI vendors, including Google, OpenAI and Apple, allow website owners to block the bots they use for data scraping and model training by amending their site's robots.txt, the text file that that tells bots which pages they can access on a website. But, as Cloudflare points out in a post announcing its bot-combatting tool, not all bots respect this. "Customers don't want AI bots visiting their websites, and especially those that do so dishonestly," the company writes on its official blog. "We fear that some AI companies intent on circumventing rules to access content will persistently adapt to evade bot detection." So, in an attempt to address the problem, Cloudflare analyzed AI bot and crawler traffic to fine-tune an automatic bot detection model. The model considers, among other factors, whether an AI bot might be trying to evade detection by mimicking the appearance and behavior of someone using a web browser. "When bad actors attempt to crawl websites at scale, they generally use tools and frameworks that we are able to fingerprint," Cloudflare writes. "Based on these signals, our models [are] able to appropriately flag traffic from evasive AI bots as bots." Cloudflare has set up a form for hosts to report suspected AI bots and crawlers and says that it'll continue to manually blacklist new AI bots over time. The problem of AI bots has come into sharp relief as the generative AI boom fuels the demand for AI model training data. Many sites, wary of AI vendors training models on their content without alerting or compensating them, have opted to block AI scrapers. Around 26% of the top 1,000 sites on the web have blocked OpenAI's bot, according to one study; another found that more than 600 major news publishers had blocked the bot. Blocking isn't surefire, however. As alluded to earlier, some vendors appear to be ignoring standard exclusion rules to gain a competitive advantage. AI search engine Perplexity was recently accused of impersonating legitimate visitors to scrape content from websites. Tools like Cloudflare's could help -- but only if they prove to be accurate in detecting clandestine AI bots.
Share
Share
Copy Link
Cloudflare has introduced a new feature aimed at helping website owners block AI companies from scraping their sites. The tool allows site owners to easily restrict access to content-scraping bots.
Cloudflare, a leading web infrastructure and security company, has launched a new feature designed to help website owners protect their content from being scraped by AI companies. The tool, announced on July 3, 2024, allows site owners to easily restrict access to bots that are scraping data to train AI models.1
The new feature leverages Cloudflare's existing bot management tools to identify and block traffic from AI scraping bots. Website owners can enable the feature with a single click in their Cloudflare dashboard.2 Once enabled, Cloudflare will automatically detect and mitigate bot traffic attempting to scrape the site's content for AI training purposes.
The introduction of this tool comes amidst growing concerns from publishers and content creators about their data being used, without consent or compensation, to train AI models.3 Many AI companies rely on web scraping to gather the vast amounts of data needed to train large language models and other AI systems. Cloudflare's feature aims to give website owners more control over how their content is accessed and used.
While the development of AI technologies often relies on large datasets for training, there are important questions around intellectual property rights, fair use, and compensating content creators.4 Cloudflare's new feature is a step towards addressing these concerns and empowering website owners to protect their content. However, the broader debate around balancing AI innovation with the rights of content creators is likely to continue as the technology advances.
Reference
[2]
Cloudflare has introduced a new free tool that allows websites to easily block AI bots from scraping their content with just one click. The tool aims to protect website owners' intellectual property from being used to train AI models without permission.
3 Sources
Cloudflare introduces AI Audit, a suite of tools designed to help website owners analyze and control how AI models use their content, potentially allowing content creators to monetize AI access to their work.
17 Sources
An intense battle is underway to stop AI bots from spreading misinformation online. Researchers and tech companies are working to develop systems to detect and combat AI-generated fake content.
2 Sources
Enterprises are increasingly blocking AI web crawlers due to performance issues, security threats, and content guideline violations. This trend highlights the growing tension between AI data collection and website integrity.
2 Sources
Freelancer.com's CEO Matt Barrie alleges that AI company Anthropic engaged in unauthorized data scraping from their platform. The accusation raises questions about data ethics and AI training practices.
2 Sources