The former relies on fairly controversial ideas about copyright and fair use to qualify as abuse, whereas the latter is direct financial damage – by your own direct competitors no less.
It's fun to poke at a seeming hypocrisy of the big bad, but the similarity in this case is quite superficial.
I bet people being fucking DDOSed by AI bots disagree
Also the fucking ignorance assuming it's "static content" and not something needing code running
This is something I couldn't have done before, because people very often don't have the patience to answer questions. Even Google ended up in loops of "just use Google" or "closed. This is a duplicate of X, but X doesn't actually answer the question" or references to dead links.
Are there downsides to this? Sure, but imo AI is useful.
It's still up in all its glory.
Exactly. I think the unfairness can be mitigated if models trained on public information, or on data generated by a model trained on public information, or has any of those two in its ancestry, must be made public.
Then we don't have to hit (for example) Anthropic, we can download and use the models as we see fit without Anthropic whining that the users are using too much capacity.
Are you sure it's a DDoS and not just a DoS?
We implemented an anti-bot challenge and it helped for a while. Then our server collapsed again recently. The perf command showed that the actual TLS handshakes inside nginx were using over 50% of our server's CPU, starving other stuff on the machine.
It's a DDoS.
I think these days it’s ‘DAIS’, as in your site just DAIS - from Distributed/Damned AI Scraping
DDoSers who really want to cause damage now target random IPs in the same network as their actual target. That way, it can't be blackholed without blackholing the entire hosting provider.
Because ingress and compute costs often increase with every request, to the point where AI bot requests rack up bills of hundreds or thousands of dollars more than the hobbyist operator was expecting to send.
Wild eh.
If it's not ai now, it's by default labelled "static content" and "near-zero marginal cost".
Yes, for the vast majority of the internet, serving traffic is near zero marginal cost. Not for LLMs though – those requests are orders of magnitude more expensive.
This isn't controversial at all, it's a well understood fact, outside of this irrationally angry thread at least. I don't know, maybe you don't understand the economic term "marginal cost", thus not understanding the limited scope of my statement.
If such DDOSes as you mention were common, such a scraping strategy would not have worked for the scraper at all. But no, they're rare edge cases, from a combination of shoddy scrapers and shoddy website implementations, including the lack of even basic throttling for expensive-to-serve resources.
The vast majority of websites handle AI traffic fine though, either because they don't have expensive to serve resources, or because they properly protect such resources from abuse.
If you're an edge case who is harmed by overly aggressive scrapers, take countermeasures. Everyone with that problem should, that's neither new nor controversial.
They are common. The strategy works for the llm but not for the website owner or users who can't use a site during this attack.
The majority of sites are not handling AI fine. Getting Ddosed only part of the time is not acceptable. Countermeasures like blocking huge ranges can help but also lock out legimate users.
Any actual evidence of the alleged scope of this problem, or just anecdotes from devs who are mad at AI, blown out of proportion?
It is a cost for me for LLM to scrape me.
Why should I care about costs that have when they don't care about the costs I have?
The number of bots that try to hide who they are, and don't bother to even check robots.txt is new.
And how much of this is users who are tired of walled gardens and enshitfication. We murdered RSS, API's and the "open web" in the name of profit, and lock in.
There is a path where "AI" turns into an ouroboros, tech eating itself, before being scaled down to run on end user devices.
Genuinely interested.
OpenAI et al seem to mostly be well-behaved.
Also, it's not just the cost of the bandwidth and processing. Information has value too. Otherwise they wouldn't bother scraping it in the first place. They compete directly with the websites featuring their training data and thus they are taking away value from them just as the bots do from ChatGPT.
In fact the more I think of it, I think it's exactly the same thing.
But what happens if gamefaqs disappears because of lack of traffic?
Can LLM actually create or only regurgitate content.
Contrary to what others say, LLMs can create content. If you have a private repo you can ask the LLM to look at it and answer questions based on that. You can also have it write extra code. Both of these are examples of something that did not exist before.
In terms of gamefaqs, I could theoretically see an LLM play a game and based on that write about the game. This is theoretical, because currently LLMs are nowhere near capable enough to play video games.
Github pages is one way, but there are other platforms offering similar services. Static content just isn't that expensive to host.
THe troubles start when you're actually running something dynamic that pretends to be static, like Wordpress or Mediawiki. You can still reduce costs significantly with CDNs / caching, but many don't bother and then complain.
https://diff.wikimedia.org/2025/04/01/how-crawlers-impact-th...
It hasn’t even been updated in years so hell if I know why it needs to be fetched constantly and aggressively, - but fuck every single one of these companies now whining about bots scraping and victimizing them, here’s my violin.
It’s a static site that hasn’t been updated since 2016—- so it’s .. since been moved to cloudflare r2 where it’s getting a $0.00 bill, and it now has a disallow / directive. I’m not sure if it’s being obeyed because the cf dash still says it’s getting 700-1300 hits a day even with all the anti bot, “cf managed robots” stuff for ai crawlers in there.
The content is so dry and irrelevant I just can’t even fathom 1/100th of that being legitimate human interest but I thought these things just vacuumed up and stole everyone’s content instead of nailing their pages constantly?
You imply that "an expensive llm service" is harmed by abuse, but, every other service is not? Because their websites are "static" and "near-zero marginal cost"?
You have no clue what you are talking about.
Never in 15 years if running the website did we have such issues, and you can be sure that cache layers were in place already for it to last this long.
https://drewdevault.com/2025/03/17/2025-03-17-Stop-externali...
What "$FOO" actually is, is irrelevant. I'm curious how you would convince people that this sort of rule is fair.
The corp can always ban users who break ToS, after all. They don't need any help. The charitable initiative can't actually do that, can they?
> net-zero marginal cost
Lol, you single-handedly created a market for Anubis, and in the past 3 years the cloudflare captchas have multiplied by at least 10-fold, now they are even on websites that were very vocal against it. Many websites are still drowning - gnu family regularly only accessible through wayback machine.Spare me your tears.
It's not possible to know in advance what is static and what is not. I have some rather stubborn bots make several requests per second to my server, completely ignoring robots.txt and rel="nofollow", using residential IPs and browser user-agents. It's just a mild annoyance for me, although I did try to block them, but I can imagine it might be a real problem for some people.
I'm not against my website getting scraped, I believe being able to do that is an important part what the web is, but please have some decency.
(TBH it's not clear to me that their marginal costs are low. They seem to pick based on narrative.)
How do you know the content is static?
Stealing the content from the whole planet & actively reducing the incentive to visit the sites without financial restitution is pretty bad.
Stop justifying their anti-social behavior because it lines your pockets.
I obviously disagree. I mean, on top of this we are talking about not-open OpenAI.
The gall. https://weirdgloop.org/blog/clankers