undefined | Better HN

0 pointsintellectronica1y ago0 comments

Right, but how is it related to AI?

Is there just a correlation between crawlers that don't respect indexing norms and crawlers that operate in the service of AI products?

Or is there something about the sort of indexing people do for AI that makes this nasty behaviour more likely?

0 comments

7 comments · 2 top-level

blakesterz1y ago· 5 in thread

If normal crawlers are a light rain, AI crawlers are a hurricane. Most sites can handle some rain, but they are not built to handle hurricanes. AI crawlers can look like DDOS attacks. The worst offenders will just crawl a site as fast as possible until it goes offline.

intellectronicaOP1y ago

Yes, I understand. And sorry to hear that. But I'm trying to understand how it is related to AI. How come this is happening with AI crawlers but not with traditional web index crawlers. If the pattern is so common (which is confirmed by multiple credible sources) there must be some interesting and potentially useful explanation.

graemep1y ago

Search engines link to websites. They want the websites up, so its worth a little extra work to avoid harming them. LLMs seek to replace the websites.

Search engine crawlers are more mature and better written.

I suspect a lot of LLM crawling an development is done under time pressure to get things done while the investors money is still coming in to fund it. DO stuff in a hurry, and it will be less competently done.

TonyTrapp1y ago

My understanding is that everyone wants to be first in the AI race, so they throw all the rules everyone else agreed on overboard.

thisdougb1y ago

https://diff.wikimedia.org/2025/04/01/how-crawlers-impact-th...

This is interesting on the subject. Maybe gives you some understanding. Posted on HN a few minutes ago.

cess111y ago

Ordinary search indices don't contain the entire target site, while LLM-style so called AI does consume it all. I would guess some of these crawlers are subcontractors rather than "AI" companies, i.e. they compete on having the most complete and fresh dataset you could rent for "training".

Whenever the market decides the Internet is too full of slop to be usable for "training" the one that has the most copies of the pre-"AI" Internet wins. Some of the traffic is likely "AI" "tool use", i.e. bot scraping as part of running some LLM, i.e. "AI" "research".

The big scraping bots have gone from stupid to ruthless. Previously it was irritating that some of them got stuck traversing cyclical link paths on your site or on-the-fly generated pages, now it's like your silly family blog suddenly got very popular for no good reason and it puts a lot of load on the tiny amount of hardware it's served from.

structural1y ago

One major difference is that while indexing, you're generating an internal data structure that represents that site. Once done, if the site doesn't change, you don't have any need to revisit it, and in fact, fetching the site multiple times just increases your own costs.

On the other hand, an unsupervised AI training algorithm may just need raw text, and as much of it as possible. It doesn't know what site it came from or much care, and it's not building any index that links the content back to its original source. So fetching the site on each training epoch might actually be viable: why bother storing the entire internet when you can just fetch -> transform -> ingest into your model? If your crawler is distributed enough, it won't be the bottleneck, either.

If this is the architecture some companies are using, this also means that these crawlers won't ever stop, because they are finetuning some model by constantly updating over time based on the "current" internet, whatever that might mean.

j / k navigate · click thread line to collapse