How is it related to AI? Do AI crawlers do something different from traditional search index crawlers? Or is it simply a proliferation of crawlers because of the growth of AI products?
What makes AI special in this context?
- They request every resource, vastly increasing costs compared to a normal crawler.
- They not only don't respect robots.txt; they use it as an explicit source of more links to mine.
- They request resources frequently (many reports of 100x per day), sometimes from bugs and sometimes to ensure they have the latest copy.
- There's no rate limiting. It's trivial to create a crawler architecture where the crawler operates at full tilt, spread across millions of pages and respecting each site, but they don't bother, so even if everything else were fine it starts looking like a DOS attack.
- They intentionally use pools of IPs and other resources to obfuscate their identities and make themselves harder to block.
How much of that is "baby's first crawler" not being written very well, and how much is actual malice? Who knows, but the net effect is huge jumps in costs to support the AI wave.
At the same time, publicly and easily available quality content is a race against time. Platforms like Reddit and Xitter already lock down with aggressive anti-bot measures and fingerprinting, and the cottage industries are following. Meanwhile, public data is being polluted by content farms producing garbage at increased rate using AI.
Together this creates a perfect storm of bad incentives: (1) the data hoarders are no longer just Google and Microsoft, but probably thousands of smaller entities and (2) they’re short on time, and try to scrape more invasively and at a fast rate.
Security/vulnerability scans doesn't ask too much pages, at least existing ones, and usually come from few IPs from time to time.
But AI crawlers could be really a lot, try to get all your pages, and not always are respectful about robots.txt or your performance, u. And don't give you anything back. There may be exceptions, but the few you notice ends having a negative impact.
But the question I'm asking is _why_ do AI crawlers behave in this different way.
Is there just a correlation between crawlers that don't respect indexing norms and crawlers that operate in the service of AI products?
Or is there something about the sort of indexing people do for AI that makes this nasty behaviour more likely?
How It Works: Providers like Proxyrack, Live Proxies, Rayobyte, and Infatica allow you to integrate their SDKs into your app. Users who agree to join the proxy network contribute their device’s bandwidth, often used for web scraping, and you get paid based on their activity—typically per monthly or daily active user.
So it need not be "compromised Android SetTop Boxes", but just millions of free apps running on user's phones.