Abusive AI Web Crawlers: Get Off My Lawn (opens in new tab)

(mythic-beasts.com)

23 pointsbluehatbrit1y ago22 comments

22 comments

14 comments · 4 top-level

I don't understand the current thing about "AI Crawlers". Maybe someone can help educate me.

How is it related to AI? Do AI crawlers do something different from traditional search index crawlers? Or is it simply a proliferation of crawlers because of the growth of AI products?

What makes AI special in this context?

hansvm1y ago

The proliferation of crawlers is part of the problem. They're also more aggressive and poorly behaved than typical search engines. Some issues:

- They request every resource, vastly increasing costs compared to a normal crawler.

- They not only don't respect robots.txt; they use it as an explicit source of more links to mine.

- They request resources frequently (many reports of 100x per day), sometimes from bugs and sometimes to ensure they have the latest copy.

- There's no rate limiting. It's trivial to create a crawler architecture where the crawler operates at full tilt, spread across millions of pages and respecting each site, but they don't bother, so even if everything else were fine it starts looking like a DOS attack.

- They intentionally use pools of IPs and other resources to obfuscate their identities and make themselves harder to block.

How much of that is "baby's first crawler" not being written very well, and how much is actual malice? Who knows, but the net effect is huge jumps in costs to support the AI wave.

klabb31y ago

Speculation: If not already, it will be a data broker market for public-ish data too. What I mean by that is a separation of entities where open ai and ”legitimate” AI companies will buy data from data brokers of shadily scraped data, and throw them under the bus if shit hits the fan to protect the mothership. This makes sense from a corporate risk perspective, creating a gray area buffer of accountability. OpenAI and Anthropic already pleaded to the government to not take away their fair use hall pass, (by invoking the magic spell ”China”), but if this won’t work and publishers win they’ll need to be prepared.

At the same time, publicly and easily available quality content is a race against time. Platforms like Reddit and Xitter already lock down with aggressive anti-bot measures and fingerprinting, and the cottage industries are following. Meanwhile, public data is being polluted by content farms producing garbage at increased rate using AI.

Together this creates a perfect storm of bad incentives: (1) the data hoarders are no longer just Google and Microsoft, but probably thousands of smaller entities and (2) they’re short on time, and try to scrape more invasively and at a fast rate.

intellectronica1y ago

The incompetence hypothesis makes sense (it is often a good explanation). Web indexers like Google had decades to get really good at this, including hoards of people who work on crawlers full time. AI companies are often very young, execute with small teams, and don't consider web indexing their main activity, just something they do in support of pre-training (or maybe serving web results).

intellectronica1y ago

If the problem is really incompetence, then maybe a viable solution is for the community to create a really great (and well-behaved) OSS crawler. Make it easier for the AI people to do the right thing by making rolling their own crawler the more expensive, lower quality option.

gmuslera1y ago

Search engine crawlers, aggregators, vertical markets crawlers and so on may give you visibility, are not so much, and are usually well-behaved (i.e. respect robots.txt, announce themselves with a consistent user agent, etc).

Security/vulnerability scans doesn't ask too much pages, at least existing ones, and usually come from few IPs from time to time.

But AI crawlers could be really a lot, try to get all your pages, and not always are respectful about robots.txt or your performance, u. And don't give you anything back. There may be exceptions, but the few you notice ends having a negative impact.

intellectronica1y ago

Yes, I understand that, and I'm dismayed to learn about this.

But the question I'm asking is _why_ do AI crawlers behave in this different way.

1 more reply

brettzky1y ago

Vercel has made a few posts about how substantial the traffic has increased with the rise of AI crawlers, or crawlers for AI training. https://vercel.com/blog/the-rise-of-the-ai-crawler

otikik1y ago

They don’t respect robots.txt at all and won’t hesitate to call all the endpoints they find, repeatedly, even when they’re costly for the host. That’s basically it.

intellectronica1y ago

Right, but how is it related to AI?

Is there just a correlation between crawlers that don't respect indexing norms and crawlers that operate in the service of AI products?

Or is there something about the sort of indexing people do for AI that makes this nasty behaviour more likely?

2 more replies

PeterStuer1y ago· 1 in thread

You can monetize your app users by partnering with providers that offer SDKs for residential proxy networks. These services let users opt-in to share their internet connection, earning you revenue while they get benefits like ad-free experiences.

How It Works: Providers like Proxyrack, Live Proxies, Rayobyte, and Infatica allow you to integrate their SDKs into your app. Users who agree to join the proxy network contribute their device’s bandwidth, often used for web scraping, and you get paid based on their activity—typically per monthly or daily active user.

So it need not be "compromised Android SetTop Boxes", but just millions of free apps running on user's phones.

mateuszbuda1y ago

There are many different methods used by proxy providers to unethically source their IPs: https://scrapingfish.com/how-ips-for-web-scraping-are-source...

DarkPlayer1y ago

We observed the same behavior. Each request used a different IP address and a random user agent. In our case, most of the IP addresses belonged to Chinese ISPs. They went to great lengths to avoid being blocked, but at the same time used user agents such as Windows 95/98 or IE 5. Fortunately, the combination of the odd user agents and the fact that they still use HTTP/1.1 makes them somewhat easy to identify. So you can use a captcha on more expensive endpoints to block them.

lostmsu1y ago

Why does the author of this post assume their increase in traffic has anything to do with "AI" specifically?

j / k navigate · click thread line to collapse

22 comments

14 comments · 4 top-level

intellectronica1y ago· 9 in thread

I don't understand the current thing about "AI Crawlers". Maybe someone can help educate me.

How is it related to AI? Do AI crawlers do something different from traditional search index crawlers? Or is it simply a proliferation of crawlers because of the growth of AI products?

What makes AI special in this context?

hansvm1y ago

The proliferation of crawlers is part of the problem. They're also more aggressive and poorly behaved than typical search engines. Some issues:

- They request every resource, vastly increasing costs compared to a normal crawler.

- They not only don't respect robots.txt; they use it as an explicit source of more links to mine.

- They request resources frequently (many reports of 100x per day), sometimes from bugs and sometimes to ensure they have the latest copy.

- They intentionally use pools of IPs and other resources to obfuscate their identities and make themselves harder to block.

How much of that is "baby's first crawler" not being written very well, and how much is actual malice? Who knows, but the net effect is huge jumps in costs to support the AI wave.

klabb31y ago

intellectronica1y ago

gmuslera1y ago

Security/vulnerability scans doesn't ask too much pages, at least existing ones, and usually come from few IPs from time to time.

intellectronica1y ago

Yes, I understand that, and I'm dismayed to learn about this.

But the question I'm asking is _why_ do AI crawlers behave in this different way.

1 more reply

brettzky1y ago

Vercel has made a few posts about how substantial the traffic has increased with the rise of AI crawlers, or crawlers for AI training. https://vercel.com/blog/the-rise-of-the-ai-crawler

otikik1y ago

They don’t respect robots.txt at all and won’t hesitate to call all the endpoints they find, repeatedly, even when they’re costly for the host. That’s basically it.

intellectronica1y ago

Right, but how is it related to AI?

Is there just a correlation between crawlers that don't respect indexing norms and crawlers that operate in the service of AI products?

Or is there something about the sort of indexing people do for AI that makes this nasty behaviour more likely?

2 more replies

PeterStuer1y ago· 1 in thread

So it need not be "compromised Android SetTop Boxes", but just millions of free apps running on user's phones.

mateuszbuda1y ago

There are many different methods used by proxy providers to unethically source their IPs: https://scrapingfish.com/how-ips-for-web-scraping-are-source...

DarkPlayer1y ago

lostmsu1y ago

Why does the author of this post assume their increase in traffic has anything to do with "AI" specifically?

j / k navigate · click thread line to collapse