AI bots are destroying Open Access (opens in new tab)

(go-to-hellman.blogspot.com)

99 pointsdhacks1y ago49 comments

49 comments

42 comments · 11 top-level

xnx1y ago· 9 in thread

Why are "AI" bots generating so much fuss. Is it because there are so many of them? Is it because AI companies are each writing their own (bad) crawlers instead of using existing ones?

jsheard1y ago

AI bots operators are financially incentivized to not be good citizens, they want as much data as possible as fast as possible and don't care who they piss off in the process. Plus for now at least they have effectively unlimited money to throw at bandwidth, storage, IP addresses, crawling with full-blown headless browsers, etc.

quectophoton1y ago

And it gets worse.

For now they are probably paying to use residential IP addresses that they get from other services that sell them (and these services get them from people who willingly sell some of their bandwidth for cents).

But I think it won't be long before we start seeing the AI companies having each their own swarm of residential IP addresses by selling themselves a browser extension or mobile app, saying something like:

"Get faster results (or a discount) by using our extension! By using your own internet connection to fetch the required context, you won't need to share computing resources with other users, thusly increasing the speed of your queries! Plus, since you don't use our servers, that means we can pass our savings to you as a discount!"

Then in small letter saying they use your connection for helping others with their queries, or being more eco-friendly because sharing, or whatever they come up with to justify this.

reginald781y ago

I thought this as well reading the last discussion. I believe some extra shady free VPNs have used a browser extension to borrow your endpoint to work around geoblocks, etc. I always thought this was a terrible idea, who wants their home internet ip associated with some random VPN users traffic? A voracious mindless bot that slurps up everything it can get to isn't much better.

Microsoft could build this into Windows even, they already use your upload bandwidth to help distribute their updates.

duttonw1y ago

OpenAI has ‘already’ got a browser extension. Who knows when this is ‘enabled’. We already had the ‘honey’ debacle with Amazon/ebay referral link stealing

1 more reply

flakeoil1y ago

They are maxing out the CPU of the web servers. For example Anthropic hitting a server 11 times per second non-stop easily loads a basic web server serving a dynamic website. That's like 1 million page views per day. And they continue for weeks even though they could have scraped whatever they are after in less than an hour.

jrochkind11y ago

I personally don't care the intended use of the crawling -- and also don't know that the bots we are seeing now are "AI bots", I would not have used that phrase.

What many of us have seen is a huge increase in bot crawling traffic, from highly distributed IPs, and often requesting insane combinations of query params that don't actually get them useful content -- that bring down our sites. (And that increase their volume if you scale up your resources!) They seem to have very deep pockets, in that they don't mind that they are scraping terrabytes of useless/duplicate content from me (they can get all the actual useful open content from my SiteMap and I wouldn't mind!)

That's what bothers me. I don't care if they scrape my site for AI purposes in polite robots.txt-respecting honest-user-agent low-volume ways. And if they are doing it the way they are doing it for something other than AI, it's just as much of a problem. (The best guess is just that it's for AI).

So I agree with you that I wouldn't have spoken of this in terms of "AI".

But it has become a huge problem.

"Fighting the AI scraperbot scourge" https://lwn.net/Articles/1008897/

"LLM crawlers continue to DDoS SourceHut" https://status.sr.ht/issues/2025-03-17-git.sr.ht-llms/

"Open Source devs say AI crawlers dominate traffic, forcing blocks on entire countries" https://arstechnica.com/ai/2025/03/devs-say-ai-crawlers-domi...

Some of us -- we think jokingly -- wonder if Cloudflare or other WAF purveyors are behind it. It is leaving most of us no choice but some kind of WAF or bot detection.

loloquwowndueo1y ago

This is explained in the article. Tl;dr for whichever reason these AI bots behave nothing like the web crawlers of old. To quote TFA:

> The current generation of bots is mindless. They use as many connections as you have room for. If you add capacity, they just ramp up their requests. They use randomly generated user-agent strings. They come from large blocks of IP addresses. They get trapped in endless hallways. I observed one bot asking for 200,000 nofollow redirect links pointing at Onedrive, Google Drive and Dropbox. (which of course didn't work, but Onedrive decided to stop serving our Canadian human users). They use up server resources - one speaker at Code4lib described a bug where software they were running was using 32 bit integers for session identifiers, and it ran out!

morusrubra1y ago

Although I think it's likely that these are "AI" bots, the real problem is the proliferation of rich and crappy crawlers. Whether or not legacy crawlers respect robots.txt, etc., they do seem to be sophisticated enough to determine when they're stuck in a loop. The home organizations of these new crawlers seem to have more money than sense and are often getting stuck in large dynamic sites for months without retrieving any new information. Among all of the articles about building "bot traps," libraries realized that they have unwittingly been in the bot trapping business for years.

rcxdude1y ago

Seems like the latter. There's basically a large number of well-funded attempts to crawl the internet, and enough of them are badly behaved enough it's basically a DDOS against smaller hosts.

nathanaldensr1y ago· 9 in thread

The only way--the only way--to solve these issues is with web servers requiring that all clients authenticate with a credential that is provably tied to a real-world entity--person or corporate entity--so that legal recourse is available to the server owner when abuse occurs. The internet is no longer high-trust; we're running web servers the same way we'd run an honor-system store where people just come in and steal, anonymously, and with no recourse.

reginald781y ago

Anubis (or something similar) is an alternative option: https://github.com/TecharoHQ/anubis

Aside from the obvious disadvantages of a non-anoynmous web I also don't even think it will work. How do you deal with identification and punishment of threat actors across the globe? We've been failing at that since the start. When was the internet ever high trust?

hollerith1y ago

>When was the internet ever high trust?

In the 1970s and 1980s.

eesmith1y ago

$125 and I can start an LLC.

That's a real-world corporate entity. Recourse ends at the "limited liability" in LLC.

Make that LLC owned by another? Offshore ownership? Might take a few thousand bucks.

ronsor1y ago

Cheaper than that, and the Secretary of State doesn't actually verify anything.

akomtu1y ago

In reality, users will have to show passport to use the internet, while corporations will hide behind a "Corporate ID" that's whitelisted in all authenticator services, because those are also corporations. So you'll keep getting millions of requests from corp234 and corp456 with no legal recourse against them.

Verdex1y ago

I don't know. Once I know the who the legal entity is who I assert is a bad actor, I'm not sure there is really an recourse to be had.

Your honor, these people are visiting my website in a way that makes me sad? I feel that we would need to encode bad behavior in a legally reasonable way first.

And not to mention that you'll have to bring legal disputes a legal entity at a time. And some of these legal entities have very deep pockets.

Unless the suggestion is that internet providers are all going to join together to stand up for the little guy? Somehow I'm not optimistic.

(Finally IPv6 has taken decades to get to where it is today. Somehow I don't see legally attributable IP traffic extension to be ready and deployed any faster)

1 more reply

JohnFen1y ago

The truth and tragedy of this is very clear to me. I am hoping this is something that will eventually be solved, but I don't expect it. These companies are on a burn-and-pillage rampage.

ronsor1y ago

I guess I'm done using the Internet then.

batata_frita1y ago

We're just ruining the last good part of internet

kragen1y ago· 6 in thread

Sheesh, just use BitTorrent. That's what open access licensing is for! BitTorrent's tit-for-tat approach limits the harm selfish actors can do, only greatly rewarding those whose behavior benefits others, and has been shown to be very robust against active disruption attempts for decades now. Moreover, it also confers some resistance to falsification of the published record, to linkrot, and to publishing companies going bankrupt.

Sooner or later we need to take back the legitimate internet from surveillance capitalism. Capitalism is great (it shares many of BitTorrent's virtues, not coincidentally) but surveillance capitalism is not.

quectophoton1y ago

As much as I like BitTorrent, people (usually) don't want to provide open access to information; what they (usually) want is to be an "open" gateway to that information, as long as they are the centralized point of distribution whose name appears in the URL bar, and as long as they control when they can remove access to that information.

Creating a torrent is not showy enough, because the credit is "just" another file and/or a comment in the torrent metadata.

Granted, they usually do that because they want to "kindly" advertise a way to donate to them (EDIT: or to track you, or other similar goals), and there's nothing wrong with trying to get donations, but there's clearly a conflict of interest at play here.

kragen1y ago

It doesn't matter what people usually want. It's sufficient for someone to want to torrent the open-access articles, even if everyone else is playing the exploitative games you're describing. The Berlin Declaration that defined "open access" https://openaccess.mpg.de/Berlin-Declaration requires specifically

> The author(s) and right holder(s) of such contributions grant(s) to all users a free, irrevocable, worldwide, right of access to, and a license to copy, use, distribute, transmit and display the work publicly and to make and distribute derivative works, in any digital medium for any responsible purpose, subject to proper attribution of authorship (community standards, will continue to provide the mechanism for enforcement of proper attribution and responsible use of the published work, as they do now), as well as the right to make small numbers of printed copies for their personal use.

This guarantees that such torrents are legal unless the original authors are infringing copyright.

So there is no danger of AI bots destroying open access.

Nemo_bis1y ago

Someone still needs to assemble those documents to create the torrent collections in the first place. That's harder now that captchas and other access walls are getting more and more hostile to human consumption.

So, yes, torrents help to preserve what has already been archived in the past, but we still need a lot more works to be deposited in open archives like Zenodo or arxiv in the first place.

1 more reply

dijksterhuis1y ago

torrents work fine for immutable artefacts. linux distro ISOs etc are not getting modified after release. that's a new version. same with films (piracy). once a film is released, that's it released. any later versions are just that, a new version.

torrents are a problem for mutable artefacts because you are reliant on your peers having the latest up to date copy, which is not guaranteed. the peers you download from might have just switched their machine on after 5 months, so their copy of the mutable artefact is 5 months out of date. as ever with distributed systems, requiring consistency introduces complexity.

"open gateways" (term used by a sibling comment) provide much simpler mutability. which makes sense when there's like a simple typo in a PDF document that requires the document's replacement. just replace the document on the web server. bam! everyone now has access to the latest corrected version immediately.

also, most of the general population doesn't know how to use torrents. just because there's a part-way technical solution, doesn't mean it makes sense to switch everything over to some fancy new proposal (not the underlying tech, which is old now).

if users would struggle to use the implementation, why do it? what benefits are there except for a seemingly more technically perfect solution?

kragen1y ago

Published papers, just like accounting records, are immutable artifacts. People need to be able to say, "In Arneson & Dijksterhuis 2014, the figure given for the solanine concentration in eggplants is an order of magnitude too high because of a typographical error," or, "In Arneson & Dijksterhuis 2014, the figure given for the solanine concentration in eggplants is correct to within 25% and does not contain a typographical error," and for other people to be able to verify that. "Just replac[ing] the document on the web server" is considered serious academic misconduct, among other things because you might be introducing errors into previously correct published work that others had referenced. For this case, torrents' immutability is a feature, not a bug.

In 01994 the general population didn't know how to use the internet, but it was already very useful for researchers. Software improved over time to simplify using it.

What benefit would there be? The benefit would be that it prevents AI bots from destroying Open Access.

dijksterhuis1y ago

okay, let’s run with your specific example.

how do you deal with retractions? how do you deal with academic/research conduct so egregious that all previous versions of a retracted paper need to be edited with “RETRACTED” in big red letters over all the text on every page of the paper? just to make sure no-one ever accidentally reads one page and thinks it is a legitimate source of information.

like the one written by disgraced ex-doctor andrew wakefield: https://www.thelancet.com/journals/lancet/article/PIIS0140-6...

immutability doesn’t stop academic misconduct. in this specific egregious example it would enable serious harm to continue until every peer updates to the new versions. and there is no guarantee that happens.

the lancet, with their mutable web server hosted versions, were able to edit it and stick RETRACTED in big red letters all over the thing immediately [0]. the ability to edit due to misconduct is guaranteed.

like, i’m all for anything that stops OpenAI spamming web servers, or more generally anything that gets in their way. but there isn’t a perfect technical solution. torrents don’t solve the problem perfectly, they bring new trade offs.

that’s what i’m trying to help you see here. it’s mostly shades of grey.

[0]: by immediate, i mean “once they finally made the decision to retract it waaaaaaaaaay later than they should have”. like, the update was immediate. not the paper was immediately retracted on publication.

1 more reply

musicale1y ago· 3 in thread

I guess the problem of throttling connections to human rates is that the bots rapidly eat up all of the connections.

Can we at least get rid of CAPTCHAs now since they clearly don't work?

jrochkind11y ago

I don't mean to be marketting for them, but the CloudFlare Turnstile "captcha alternative" (Similar to Google ReCaptcha and others) has been working for me. it's the only thing that has of what I tried so far (rate-limiting IPs, fail2ban, etc -- their IPs are just too distributed).

It doesn't make the user do a puzzle, it's the kind that either works entirely automatically or in some cases asks the user to tick a checkbox. You have probably seen it proliferating across the internet in your personal use becuase, well, see above.

musicale1y ago

Rate limiting individual addresses seems like a possibly useful, if not perfect, idea since it forces the bots to spread out over more addresses. It does penalize humans behind NAT however.

jrochkind11y ago

I have indeed tries lots of things that seemed possibly useful! Rate limiting by IP (or by CIDR subnet of various sizes) was not enough for me. The bots spread out to more addreseses and still overwhelmed my resources.

1 more reply

zzo38computer1y ago· 2 in thread

I have temporarily disabled my HTTP server for now. (I set up port knocking for a day, but I got rid of it due to a kernel panic.)

My issue is not to prevent anyone from obtaining a copy if they want to do, and I want to ensure that users can use curl, Lynx, and other programs; I do not want to require JavaScripts, CSS, Firefox, Google, etc.

My problem is that these LLM scraping bots are badly behaved, making many requests and repeating them even though there is no good reason to do so, and potentially overloading the servers. These things are mentioned in the article. Some bots are not so badly behaved, and those are not the problem.

paulddraper1y ago

How can you tell they are LLM bots?

zzo38computer1y ago

I do not know for sure, but they are accessing with many different IP addresses, and with many different user-agent values that all include "Mozilla". I had read elsewhere that apparently they are botnets for LLM scraping.

data_maan1y ago· 2 in thread

Uhmm.... Just use harder Catpchas?

I can't really understand the outrage here, this problem of scraping to the point of being DDOSed, which is what the author seems to contend, has existed since forever.

eesmith1y ago

Who is developing, deploying, and profiting from these harder captchas? Apparently not Cloudflare.

> They are using commercial services such as Cloudflare to outsource their bot-blocking and captchas, without knowing for sure what these services are blocking, how they're doing it, or whether user privacy and accessibility is being flushed down the toilet. But nothing seems to offer anything but temporary relief.

Also:

data_maan1y ago

Google is typically developing Captchas, but you can develop your own if you like.

kazinator1y ago

> The old style bots were rarely a problem. They respected robot exclusions and "nofollow" warnings.

What year are they reminiscing about here, 1999? Nothing has respected robots.txt in over twenty years.

Nofollow isn't an anti-bot measure; it's supposed to inform search engines that you don't vouch for the linked content (don't wish to boost its rank). Nofollow doesn't mean "you must not follow this if you're a crawler".

pickledoyster1y ago

>large blocks of IP addresses

This could be at least partially solved by starting legal and cybersec (bulk blocks, flagging SDKs as malware) action against botnets for rent[0], forcing their SDKs out of app stores[1].

0 – https://spur.us/residential-proxies-the-legal-botnet-that-no... 1 – https://datadome.co/bot-management-protection/how-proxy-prov...

josefritzishere1y ago

AI is a scourge. It provides next to nothing useful but wrecks havok. As passing fads go it's heavy on the distruption but light on the utility.

elzbardico1y ago

Looks like they vibe-coded the scrappers

skeledrew1y ago

The way I see it, there may be pain now, but this is just the chaos before the web eventually becomes the semantic web, as mostly conceived by Berners-Lee[0]. Make all data available for true, easy, permanent open access instead of throwing up roadblocks to be circumvented, so all organizations wanting to train AI models can access, and thus properly democratize the ecosystem. It's that or end up with a handful of players with pockets deep enough to do what it takes to get the data, and then gatekeep the results for their own profit.

Also, eventually I see most people filtering their queries through something like Perplexity anyway instead of going to individual sites, so those putting up barriers will lose out on human traffic in any case. Let's ensure that the results people are able to access via AI continually improves so the "slop" term disappears that much faster.

[0] https://en.m.wikipedia.org/wiki/Semantic_Web

j / k navigate · click thread line to collapse

49 comments

42 comments · 11 top-level

xnx1y ago· 9 in thread

Why are "AI" bots generating so much fuss. Is it because there are so many of them? Is it because AI companies are each writing their own (bad) crawlers instead of using existing ones?

jsheard1y ago

quectophoton1y ago

And it gets worse.

Then in small letter saying they use your connection for helping others with their queries, or being more eco-friendly because sharing, or whatever they come up with to justify this.

reginald781y ago

Microsoft could build this into Windows even, they already use your upload bandwidth to help distribute their updates.

duttonw1y ago

OpenAI has ‘already’ got a browser extension. Who knows when this is ‘enabled’. We already had the ‘honey’ debacle with Amazon/ebay referral link stealing

1 more reply

flakeoil1y ago

jrochkind11y ago

I personally don't care the intended use of the crawling -- and also don't know that the bots we are seeing now are "AI bots", I would not have used that phrase.

So I agree with you that I wouldn't have spoken of this in terms of "AI".

But it has become a huge problem.

"Fighting the AI scraperbot scourge" https://lwn.net/Articles/1008897/

"LLM crawlers continue to DDoS SourceHut" https://status.sr.ht/issues/2025-03-17-git.sr.ht-llms/

"Open Source devs say AI crawlers dominate traffic, forcing blocks on entire countries" https://arstechnica.com/ai/2025/03/devs-say-ai-crawlers-domi...

Some of us -- we think jokingly -- wonder if Cloudflare or other WAF purveyors are behind it. It is leaving most of us no choice but some kind of WAF or bot detection.

loloquwowndueo1y ago

This is explained in the article. Tl;dr for whichever reason these AI bots behave nothing like the web crawlers of old. To quote TFA:

morusrubra1y ago

rcxdude1y ago

Seems like the latter. There's basically a large number of well-funded attempts to crawl the internet, and enough of them are badly behaved enough it's basically a DDOS against smaller hosts.

nathanaldensr1y ago· 9 in thread

reginald781y ago

Anubis (or something similar) is an alternative option: https://github.com/TecharoHQ/anubis

hollerith1y ago

>When was the internet ever high trust?

In the 1970s and 1980s.

eesmith1y ago

$125 and I can start an LLC.

That's a real-world corporate entity. Recourse ends at the "limited liability" in LLC.

Make that LLC owned by another? Offshore ownership? Might take a few thousand bucks.

ronsor1y ago

Cheaper than that, and the Secretary of State doesn't actually verify anything.

akomtu1y ago

Verdex1y ago

I don't know. Once I know the who the legal entity is who I assert is a bad actor, I'm not sure there is really an recourse to be had.

Your honor, these people are visiting my website in a way that makes me sad? I feel that we would need to encode bad behavior in a legally reasonable way first.

And not to mention that you'll have to bring legal disputes a legal entity at a time. And some of these legal entities have very deep pockets.

Unless the suggestion is that internet providers are all going to join together to stand up for the little guy? Somehow I'm not optimistic.

(Finally IPv6 has taken decades to get to where it is today. Somehow I don't see legally attributable IP traffic extension to be ready and deployed any faster)

1 more reply

JohnFen1y ago

The truth and tragedy of this is very clear to me. I am hoping this is something that will eventually be solved, but I don't expect it. These companies are on a burn-and-pillage rampage.

ronsor1y ago

I guess I'm done using the Internet then.

batata_frita1y ago

We're just ruining the last good part of internet

kragen1y ago· 6 in thread

quectophoton1y ago

Creating a torrent is not showy enough, because the credit is "just" another file and/or a comment in the torrent metadata.

kragen1y ago

This guarantees that such torrents are legal unless the original authors are infringing copyright.

So there is no danger of AI bots destroying open access.

Nemo_bis1y ago

So, yes, torrents help to preserve what has already been archived in the past, but we still need a lot more works to be deposited in open archives like Zenodo or arxiv in the first place.

1 more reply

dijksterhuis1y ago

if users would struggle to use the implementation, why do it? what benefits are there except for a seemingly more technically perfect solution?

kragen1y ago

In 01994 the general population didn't know how to use the internet, but it was already very useful for researchers. Software improved over time to simplify using it.

What benefit would there be? The benefit would be that it prevents AI bots from destroying Open Access.

dijksterhuis1y ago

okay, let’s run with your specific example.

like the one written by disgraced ex-doctor andrew wakefield: https://www.thelancet.com/journals/lancet/article/PIIS0140-6...

that’s what i’m trying to help you see here. it’s mostly shades of grey.

1 more reply

musicale1y ago· 3 in thread

I guess the problem of throttling connections to human rates is that the bots rapidly eat up all of the connections.

Can we at least get rid of CAPTCHAs now since they clearly don't work?

jrochkind11y ago

musicale1y ago

Rate limiting individual addresses seems like a possibly useful, if not perfect, idea since it forces the bots to spread out over more addresses. It does penalize humans behind NAT however.

jrochkind11y ago

1 more reply

zzo38computer1y ago· 2 in thread

I have temporarily disabled my HTTP server for now. (I set up port knocking for a day, but I got rid of it due to a kernel panic.)

paulddraper1y ago

How can you tell they are LLM bots?

zzo38computer1y ago

data_maan1y ago· 2 in thread

Uhmm.... Just use harder Catpchas?

I can't really understand the outrage here, this problem of scraping to the point of being DDOSed, which is what the author seems to contend, has existed since forever.

eesmith1y ago

Who is developing, deploying, and profiting from these harder captchas? Apparently not Cloudflare.

Also:

data_maan1y ago

Google is typically developing Captchas, but you can develop your own if you like.

kazinator1y ago

> The old style bots were rarely a problem. They respected robot exclusions and "nofollow" warnings.

What year are they reminiscing about here, 1999? Nothing has respected robots.txt in over twenty years.

pickledoyster1y ago

>large blocks of IP addresses

This could be at least partially solved by starting legal and cybersec (bulk blocks, flagging SDKs as malware) action against botnets for rent[0], forcing their SDKs out of app stores[1].

0 – https://spur.us/residential-proxies-the-legal-botnet-that-no... 1 – https://datadome.co/bot-management-protection/how-proxy-prov...

josefritzishere1y ago

AI is a scourge. It provides next to nothing useful but wrecks havok. As passing fads go it's heavy on the distruption but light on the utility.

elzbardico1y ago

Looks like they vibe-coded the scrappers

skeledrew1y ago

[0] https://en.m.wikipedia.org/wiki/Semantic_Web

j / k navigate · click thread line to collapse