Is there just a correlation between crawlers that don't respect indexing norms and crawlers that operate in the service of AI products?
Or is there something about the sort of indexing people do for AI that makes this nasty behaviour more likely?
Search engine crawlers are more mature and better written.
I suspect a lot of LLM crawling an development is done under time pressure to get things done while the investors money is still coming in to fund it. DO stuff in a hurry, and it will be less competently done.
This is interesting on the subject. Maybe gives you some understanding. Posted on HN a few minutes ago.
Whenever the market decides the Internet is too full of slop to be usable for "training" the one that has the most copies of the pre-"AI" Internet wins. Some of the traffic is likely "AI" "tool use", i.e. bot scraping as part of running some LLM, i.e. "AI" "research".
The big scraping bots have gone from stupid to ruthless. Previously it was irritating that some of them got stuck traversing cyclical link paths on your site or on-the-fly generated pages, now it's like your silly family blog suddenly got very popular for no good reason and it puts a lot of load on the tiny amount of hardware it's served from.
On the other hand, an unsupervised AI training algorithm may just need raw text, and as much of it as possible. It doesn't know what site it came from or much care, and it's not building any index that links the content back to its original source. So fetching the site on each training epoch might actually be viable: why bother storing the entire internet when you can just fetch -> transform -> ingest into your model? If your crawler is distributed enough, it won't be the bottleneck, either.
If this is the architecture some companies are using, this also means that these crawlers won't ever stop, because they are finetuning some model by constantly updating over time based on the "current" internet, whatever that might mean.