undefined | Better HN

0 pointsijk10mo ago0 comments

What I want to know is if the flood of scraping everyone has been complaining about is coming from people trying to scrape for training or bots doing RAG search.

I get that everyone wants data, but presumably the big players already scraped the web. Do they really need to do it again? Or is it bit players reproducing data that's likely already in the training set? Or is it really that valuable to have your own scraped copy of internet scale data?

I feel like I'm missing something here. My expectation is that RAG traffic is going to be orders of magnitude higher than scraping for training. Not that it would be easy to measure from the outside.

0 comments

mattcollins10mo ago

I wondered about this, too.

Cloudflare have some recent data about traffic from bots (https://blog.cloudflare.com/from-googlebot-to-gptbot-whos-cr...) which indicates that, for the time being, the overwhelming majority of the bot requests are for AI training and not for RAG.

progmetaldev10mo ago

I believe it's both. We're at a place where legislation hasn't really declared what is and isn't allowed. These scrapers are acting like Googlebot or any other search engine crawler, and trying to find any kind of new content that might be of value to their users.

New data is still being added online daily (probably hourly, if not more often) by humans, and the first ones to gain access could be the "winners," particularly if their users happen to need up to date data (and the service happens to have scraped it). Just like with search engines/crawlers, there's also the big players that may respect your website, but there are also those that don't use rate-limiting or respect robots.txt.

wiether10mo ago

You should ask Zuck, since, for what we've seen and what we were ask to act against, Meta is the main culprit in scraping every single page of websites, multiple times a day.

And I'm talking about ecommerce websites, with their bot scraping every variation of each product, multiple times a day.

j / k navigate · click thread line to collapse