What year are they reminiscing about here, 1999? Nothing has respected robots.txt in over twenty years.
Nofollow isn't an anti-bot measure; it's supposed to inform search engines that you don't vouch for the linked content (don't wish to boost its rank). Nofollow doesn't mean "you must not follow this if you're a crawler".
Sooner or later we need to take back the legitimate internet from surveillance capitalism. Capitalism is great (it shares many of BitTorrent's virtues, not coincidentally) but surveillance capitalism is not.
Creating a torrent is not showy enough, because the credit is "just" another file and/or a comment in the torrent metadata.
Granted, they usually do that because they want to "kindly" advertise a way to donate to them (EDIT: or to track you, or other similar goals), and there's nothing wrong with trying to get donations, but there's clearly a conflict of interest at play here.
> The author(s) and right holder(s) of such contributions grant(s) to all users a free, irrevocable, worldwide, right of access to, and a license to copy, use, distribute, transmit and display the work publicly and to make and distribute derivative works, in any digital medium for any responsible purpose, subject to proper attribution of authorship (community standards, will continue to provide the mechanism for enforcement of proper attribution and responsible use of the published work, as they do now), as well as the right to make small numbers of printed copies for their personal use.
This guarantees that such torrents are legal unless the original authors are infringing copyright.
So there is no danger of AI bots destroying open access.
torrents are a problem for mutable artefacts because you are reliant on your peers having the latest up to date copy, which is not guaranteed. the peers you download from might have just switched their machine on after 5 months, so their copy of the mutable artefact is 5 months out of date. as ever with distributed systems, requiring consistency introduces complexity.
"open gateways" (term used by a sibling comment) provide much simpler mutability. which makes sense when there's like a simple typo in a PDF document that requires the document's replacement. just replace the document on the web server. bam! everyone now has access to the latest corrected version immediately.
also, most of the general population doesn't know how to use torrents. just because there's a part-way technical solution, doesn't mean it makes sense to switch everything over to some fancy new proposal (not the underlying tech, which is old now).
if users would struggle to use the implementation, why do it? what benefits are there except for a seemingly more technically perfect solution?
In 01994 the general population didn't know how to use the internet, but it was already very useful for researchers. Software improved over time to simplify using it.
What benefit would there be? The benefit would be that it prevents AI bots from destroying Open Access.
For now they are probably paying to use residential IP addresses that they get from other services that sell them (and these services get them from people who willingly sell some of their bandwidth for cents).
But I think it won't be long before we start seeing the AI companies having each their own swarm of residential IP addresses by selling themselves a browser extension or mobile app, saying something like:
"Get faster results (or a discount) by using our extension! By using your own internet connection to fetch the required context, you won't need to share computing resources with other users, thusly increasing the speed of your queries! Plus, since you don't use our servers, that means we can pass our savings to you as a discount!"
Then in small letter saying they use your connection for helping others with their queries, or being more eco-friendly because sharing, or whatever they come up with to justify this.
What many of us have seen is a huge increase in bot crawling traffic, from highly distributed IPs, and often requesting insane combinations of query params that don't actually get them useful content -- that bring down our sites. (And that increase their volume if you scale up your resources!) They seem to have very deep pockets, in that they don't mind that they are scraping terrabytes of useless/duplicate content from me (they can get all the actual useful open content from my SiteMap and I wouldn't mind!)
That's what bothers me. I don't care if they scrape my site for AI purposes in polite robots.txt-respecting honest-user-agent low-volume ways. And if they are doing it the way they are doing it for something other than AI, it's just as much of a problem. (The best guess is just that it's for AI).
So I agree with you that I wouldn't have spoken of this in terms of "AI".
But it has become a huge problem.
"Fighting the AI scraperbot scourge" https://lwn.net/Articles/1008897/
"LLM crawlers continue to DDoS SourceHut" https://status.sr.ht/issues/2025-03-17-git.sr.ht-llms/
"Open Source devs say AI crawlers dominate traffic, forcing blocks on entire countries" https://arstechnica.com/ai/2025/03/devs-say-ai-crawlers-domi...
Some of us -- we think jokingly -- wonder if Cloudflare or other WAF purveyors are behind it. It is leaving most of us no choice but some kind of WAF or bot detection.
> The current generation of bots is mindless. They use as many connections as you have room for. If you add capacity, they just ramp up their requests. They use randomly generated user-agent strings. They come from large blocks of IP addresses. They get trapped in endless hallways. I observed one bot asking for 200,000 nofollow redirect links pointing at Onedrive, Google Drive and Dropbox. (which of course didn't work, but Onedrive decided to stop serving our Canadian human users). They use up server resources - one speaker at Code4lib described a bug where software they were running was using 32 bit integers for session identifiers, and it ran out!
This could be at least partially solved by starting legal and cybersec (bulk blocks, flagging SDKs as malware) action against botnets for rent[0], forcing their SDKs out of app stores[1].
0 – https://spur.us/residential-proxies-the-legal-botnet-that-no... 1 – https://datadome.co/bot-management-protection/how-proxy-prov...
My issue is not to prevent anyone from obtaining a copy if they want to do, and I want to ensure that users can use curl, Lynx, and other programs; I do not want to require JavaScripts, CSS, Firefox, Google, etc.
My problem is that these LLM scraping bots are badly behaved, making many requests and repeating them even though there is no good reason to do so, and potentially overloading the servers. These things are mentioned in the article. Some bots are not so badly behaved, and those are not the problem.
Can we at least get rid of CAPTCHAs now since they clearly don't work?
It doesn't make the user do a puzzle, it's the kind that either works entirely automatically or in some cases asks the user to tick a checkbox. You have probably seen it proliferating across the internet in your personal use becuase, well, see above.
Also, eventually I see most people filtering their queries through something like Perplexity anyway instead of going to individual sites, so those putting up barriers will lose out on human traffic in any case. Let's ensure that the results people are able to access via AI continually improves so the "slop" term disappears that much faster.
I can't really understand the outrage here, this problem of scraping to the point of being DDOSed, which is what the author seems to contend, has existed since forever.
> They are using commercial services such as Cloudflare to outsource their bot-blocking and captchas, without knowing for sure what these services are blocking, how they're doing it, or whether user privacy and accessibility is being flushed down the toilet. But nothing seems to offer anything but temporary relief.
Also:
> The current generation of bots is mindless. They use as many connections as you have room for. If you add capacity, they just ramp up their requests. They use randomly generated user-agent strings. They come from large blocks of IP addresses. They get trapped in endless hallways. I observed one bot asking for 200,000 nofollow redirect links pointing at Onedrive, Google Drive and Dropbox. (which of course didn't work, but Onedrive decided to stop serving our Canadian human users). They use up server resources - one speaker at Code4lib described a bug where software they were running was using 32 bit integers for session identifiers, and it ran out!
Aside from the obvious disadvantages of a non-anoynmous web I also don't even think it will work. How do you deal with identification and punishment of threat actors across the globe? We've been failing at that since the start. When was the internet ever high trust?
In the 1970s and 1980s.
That's a real-world corporate entity. Recourse ends at the "limited liability" in LLC.
Make that LLC owned by another? Offshore ownership? Might take a few thousand bucks.
Your honor, these people are visiting my website in a way that makes me sad? I feel that we would need to encode bad behavior in a legally reasonable way first.
And not to mention that you'll have to bring legal disputes a legal entity at a time. And some of these legal entities have very deep pockets.
Unless the suggestion is that internet providers are all going to join together to stand up for the little guy? Somehow I'm not optimistic.
(Finally IPv6 has taken decades to get to where it is today. Somehow I don't see legally attributable IP traffic extension to be ready and deployed any faster)