undefined | Better HN

0 pointsnikitaga2mo ago0 comments

Scraping static content from a website at near-zero marginal cost to its server, vs scraping an expensive LLM service provided for free, are different things.

The former relies on fairly controversial ideas about copyright and fair use to qualify as abuse, whereas the latter is direct financial damage – by your own direct competitors no less.

It's fun to poke at a seeming hypocrisy of the big bad, but the similarity in this case is quite superficial.

0 comments

84 comments · 32 top-level

PunchyHamster2mo ago· 31 in thread

> Scraping static content from a website at near-zero marginal cost to its server, vs scraping an expensive LLM service provided for free, are different things.

I bet people being fucking DDOSed by AI bots disagree

Also the fucking ignorance assuming it's "static content" and not something needing code running

remus2mo ago

I think the parent is just pointing out that these things lie on a spectrum. I have a website that consists largely of static content and the (significant) scraping which occurs doesn't impact the site for general users so I don't mind (and means I get good, up to date answers from LLMs on the niche topic my site covers). If it did have an impact on real users, or cost me significant money, I would feel pretty differently.

0xEF2mo ago

Putting everything on a spectrum is what got us into this mess of zero regulation and moving goal posts. It's slippery slope thinking no matter which way we cut it, because every time someone calls for a stop sign to be put up after giving an inch, the very people who would have to stop will argue tirelessly for the extra mile.

Aerroon2mo ago

What mess are you talking about? The existence of LLMs? I think it's pretty neat that I can now get answers to questions I have.

This is something I couldn't have done before, because people very often don't have the patience to answer questions. Even Google ended up in loops of "just use Google" or "closed. This is a duplicate of X, but X doesn't actually answer the question" or references to dead links.

Are there downsides to this? Sure, but imo AI is useful.

1 more reply

daveidol2mo ago

I’d argue putting everything in terms of black and white is the bigger issue than understanding nuance

1 more reply

Den_VR2mo ago

I miss the www where the .html was written in vim or notepad.

mghackerlady2mo ago

It still can be. Do it. Go make your website in M$ Frontpage, for all I care

butlike2mo ago

Shameless plug: My music homepage follows the HTML 2.0 spec and is written by hand

https://sampleoffline.com/

1 more reply

consp2mo ago

Just did that for a test frontend for a module I needed to build (not my primary job so don't know anything about UI but running in browsers was a requirement), so basic HTML with the bare minimum of JS and all DOM. Colleagues were very surprized. And yes, vim is still the goto editor and will be for a long time now all "IDE" are pushing "AI" slop everywhere.

holler2mo ago

ahh yes, fresh off reading "Html For Dummies" I made my first tripod.com site

sdsd2mo ago

For me it was making a petpage for my neopets using https://lissaexplains.com/

It's still up in all its glory.

1 more reply

eloisius2mo ago

Also wild that from the tech bro perspective, the cost of journalism is just how much data transfer costs for the finished article. Authors spend their blood, sweat and tears writing and then OpenAI comes to Hoover it up without a care in the world about license, copyright or what constitutes fair use. But don’t you dare scrape their slop.

lelanthran2mo ago

> Also wild that from the tech bro perspective, the cost of journalism is just how much data transfer costs for the finished article.

Exactly. I think the unfairness can be mitigated if models trained on public information, or on data generated by a model trained on public information, or has any of those two in its ancestry, must be made public.

Then we don't have to hit (for example) Anthropic, we can download and use the models as we see fit without Anthropic whining that the users are using too much capacity.

eru2mo ago

> I bet people being fucking DDOSed by AI bots disagree

Are you sure it's a DDoS and not just a DoS?

MattJ1002mo ago

Yes, it is. The worst offenders hammer us (and others) with thousands upon thousands of requests, and each request uses unique IP addresses making all per-IP limits useless.

We implemented an anti-bot challenge and it helped for a while. Then our server collapsed again recently. The perf command showed that the actual TLS handshakes inside nginx were using over 50% of our server's CPU, starving other stuff on the machine.

It's a DDoS.

troyvit2mo ago

You should see Cloudflare's control panel for AI bot blocking. There are dozens of different AI bots you can choose to block, and that doesn't even count the different ASNs they might use. So in this case I'd say that a DDoS is a decent description. It's not as bad as every home router on the eastern seaboard or something, but it's pretty bad.

SolarNet2mo ago

When every AI company does it from multiple data centers... yes it's distributed.

Bilal_io2mo ago

Uncoordinated DDoS, when multiple search and AI companies are hammering your server.

1 more reply

catoc2mo ago

> Are you sure it's a DDoS and not just a DoS?

I think these days it’s ‘DAIS’, as in your site just DAIS - from Distributed/Damned AI Scraping

17186274402mo ago

Off topic, but why is a DoS something considered to act on, often by just shutting down the service altogether? That results in the same DoS just by the operator than due to congestion. Actually it's worse, because now the requests will never actually be responded rather then after some delay. Why is the default not to just don't do anything?

pocksuppet2mo ago

It keeps the other projects hosted on the same server or network online. Blackhole routes are pushed upstream to the really big networks and they push them to their edge routers, so traffic to the affected IPs is dropped near the sender's ISP and doesn't cause network congestion.

DDoSers who really want to cause damage now target random IPs in the same network as their actual target. That way, it can't be blackholed without blackholing the entire hosting provider.

ImPostingOnHN2mo ago

*> Why is the default not to just don't do anything?

Because ingress and compute costs often increase with every request, to the point where AI bot requests rack up bills of hundreds or thousands of dollars more than the hobbyist operator was expecting to send.

echoangle2mo ago

I think some people use hosting that is paid per request/load, so having crawlers make unwanted requests costs them money.

lm4112mo ago

> Also the fucking ignorance assuming it's "static content" and not something needing code running

Wild eh.

If it's not ai now, it's by default labelled "static content" and "near-zero marginal cost".

littlestymaar2mo ago

What's a database after all.

nikitagaOP2mo ago

All this reactionary outrage in the comments is funny. And lame.

Yes, for the vast majority of the internet, serving traffic is near zero marginal cost. Not for LLMs though – those requests are orders of magnitude more expensive.

This isn't controversial at all, it's a well understood fact, outside of this irrationally angry thread at least. I don't know, maybe you don't understand the economic term "marginal cost", thus not understanding the limited scope of my statement.

If such DDOSes as you mention were common, such a scraping strategy would not have worked for the scraper at all. But no, they're rare edge cases, from a combination of shoddy scrapers and shoddy website implementations, including the lack of even basic throttling for expensive-to-serve resources.

The vast majority of websites handle AI traffic fine though, either because they don't have expensive to serve resources, or because they properly protect such resources from abuse.

If you're an edge case who is harmed by overly aggressive scrapers, take countermeasures. Everyone with that problem should, that's neither new nor controversial.

ipaddr2mo ago

"such DDOSes as you mention were common, such a scraping strategy would not have worked for the scraper at all"

They are common. The strategy works for the llm but not for the website owner or users who can't use a site during this attack.

The majority of sites are not handling AI fine. Getting Ddosed only part of the time is not acceptable. Countermeasures like blocking huge ranges can help but also lock out legimate users.

nikitagaOP2mo ago

> They are common

Any actual evidence of the alleged scope of this problem, or just anecdotes from devs who are mad at AI, blown out of proportion?

1 more reply

fireflash382mo ago

It's not a cost for me to scrape LLM.

It is a cost for me for LLM to scrape me.

Why should I care about costs that have when they don't care about the costs I have?

grayhatter2mo ago

The extent of the utilization is new.

The number of bots that try to hide who they are, and don't bother to even check robots.txt is new.

expedition322mo ago

One euro is marginal for me for someone else it is their daily meal.

juliangmp2mo ago

"They are rare edge cases" are we on the same internet?

not2b2mo ago· 6 in thread

I understand why OpenAI is trying to reduce its costs, but it simply isn't true that AI crawlers aren't creating very significant load, especially those crawlers that ignore robots.txt and hide their identities. This is direct financial damage and it's particularly hard on nonprofit sites that have been around a long time.

zer00eyz2mo ago

> but it simply isn't true that AI crawlers aren't creating very significant load.

And how much of this is users who are tired of walled gardens and enshitfication. We murdered RSS, API's and the "open web" in the name of profit, and lock in.

There is a path where "AI" turns into an ouroboros, tech eating itself, before being scaled down to run on end user devices.

stingraycharles2mo ago

These are ChatGPT and Claude Desktop crawlers we’re talking about? Or what is it exactly? Are these really creating significant load while not honoring robots.txt?

Genuinely interested.

63stack2mo ago

Is this the first time you are reading HN? Every day there are posts from people describing how AI crawlers are hammering their sites, with no end. Filtering user agents doesn't work because they spoof it, filtering IPs doesn't work because they use residential IPs. Robots.txt is a summer child's dream.

miki1232112mo ago

They seem to mostly be third-party upstarts with too much money to burn, willing to do what it takes to get data, probably in hopes of later selling it to big labs. Maaaybe Chinese AI labs too, I wouldn't put it past them.

OpenAI et al seem to mostly be well-behaved.

cruffle_duffle2mo ago

I bet dollars to doughnuts that 95% of the traffic is from Claude and ChatGPT desktop / mobile and not literal content scraping for training.

crote2mo ago

That wouldn't explain the 1000x increase in traffic for extremely obscure content, or seeing it download every single page on a classic web forum.

1 more reply

wolvoleo2mo ago· 5 in thread

It's more ironic because without all the scraping openai has done, there would have been no ChatGPT.

Also, it's not just the cost of the bandwidth and processing. Information has value too. Otherwise they wouldn't bother scraping it in the first place. They compete directly with the websites featuring their training data and thus they are taking away value from them just as the bots do from ChatGPT.

In fact the more I think of it, I think it's exactly the same thing.

expedition322mo ago

This leads me to thinking: I ask chatGPT a question and they get the answer from gamefaqs.

But what happens if gamefaqs disappears because of lack of traffic?

Can LLM actually create or only regurgitate content.

Aerroon2mo ago

>Can LLM actually create or only regurgitate content.

Contrary to what others say, LLMs can create content. If you have a private repo you can ask the LLM to look at it and answer questions based on that. You can also have it write extra code. Both of these are examples of something that did not exist before.

In terms of gamefaqs, I could theoretically see an LLM play a game and based on that write about the game. This is theoretical, because currently LLMs are nowhere near capable enough to play video games.

wolvoleo2mo ago

It will remain in their scraped data so they can keep including it in their later training datasets if they wish. However it won't be able to do live internet searches anymore. And it will not generate new content of course. Especially not based on games released after the site codes down so it doesn't know. Though it could of course correlate data from other sources that talk about the game in question.

stefanka2mo ago

They cannot create original content.

wolvoleo2mo ago

Well they can make some up, like hallucination. That's an additional problem: when the original site that provided the training data is gone: how can they use verify the AI output to make sure it's correct?

sandeepkd2mo ago· 4 in thread

Lets not try to qualify the wrongs by picking a metric and evaluating just one side of it. A static website owner could be running with a very small budget and the scraping from bots can bring down their business too. The chances of a static website owner burning through their own life savings are probably higher.

expedition322mo ago

Perhaps the long play is to destroy all small hobby websites until only a AI directed web is left.

miki1232112mo ago

If you're truly running a static site, you can run it for free, no matter how much traffic you're getting.

Github pages is one way, but there are other platforms offering similar services. Static content just isn't that expensive to host.

THe troubles start when you're actually running something dynamic that pretends to be static, like Wordpress or Mediawiki. You can still reduce costs significantly with CDNs / caching, but many don't bother and then complain.

ezrast2mo ago

Setting aside the notion that a site presenting live-editability as its entire core premise is "pretending to be static", do the actual folks at Wikimedia, who have been running a top 10 website successfully for many years, and who have a caching system that worked well in the environment it was designed for, and who found that that system did not, in fact, trivialize the load of AI scraping, have any standing to complain? Or must they all just be bad at their jobs?

https://diff.wikimedia.org/2025/04/01/how-crawlers-impact-th...

jazzyjackson2mo ago

It's true it can be done but many business owners are not hip to cloudflare r2 buckets or github pages. Many are still paying for a whole dedicated server to run apache (and wordpress!) to serve static files. These sites will go down when hammered by unscrupulous bots.

razingeden2mo ago· 4 in thread

It is direct financial damage if my servers not on an unmetered connection — after years of bills coming in around $3/mo I got a surprise >$800 bill on a site nobody on earth appears to care about besides AI scrapers.

It hasn’t even been updated in years so hell if I know why it needs to be fetched constantly and aggressively, - but fuck every single one of these companies now whining about bots scraping and victimizing them, here’s my violin.

gzread2mo ago

If you can identify the scraper you should have a valid legal case to recover damages.

thisislife22mo ago

Only if they had a robots.txt for their site.

gzread2mo ago

No, it's still illegal to DDoS sites that don't have robots.txt.

1 more reply

razingeden2mo ago

I hadn’t even considered that. Don’t know why that comment is greyed out or downvoted.

It’s a static site that hasn’t been updated since 2016—- so it’s .. since been moved to cloudflare r2 where it’s getting a $0.00 bill, and it now has a disallow / directive. I’m not sure if it’s being obeyed because the cf dash still says it’s getting 700-1300 hits a day even with all the anti bot, “cf managed robots” stuff for ai crawlers in there.

The content is so dry and irrelevant I just can’t even fathom 1/100th of that being legitimate human interest but I thought these things just vacuumed up and stole everyone’s content instead of nailing their pages constantly?

lm4112mo ago· 1 in thread

That is ridiculous.

You imply that "an expensive llm service" is harmed by abuse, but, every other service is not? Because their websites are "static" and "near-zero marginal cost"?

You have no clue what you are talking about.

camillomiller2mo ago

Well he’s a simp

nslsm2mo ago· 1 in thread

The issue is that there are so many awful webmasters that have websites that take hundreds of milliseconds to generate and are brought down by a couple requests a second.

bakugo2mo ago

OpenAI must be the most awful webmasters of all, then, to need such sophisticated protections.

cicko2mo ago

Interesting how other people's cost is "near-zero marginal cost" while yours is "an expensive LLM service". Also, others' rights are "fairly controversial ideas about copyright and fair use" while yours is "direct financial damage". I like how you frame this.

alsetmusic2mo ago

Have you not seen the multiple posts that have reached the front page of HN with people taking self-hosted Git repos offline or having their personal blogs hammered to hell? Cause if you haven't, they definitely exist and get voted up by the community.

bakugo2mo ago

The cost is so marginal that many, many websites have been forced to add cloudflare captchas or PoW checks before letting anyone access them, because the server would slow to a crawl from 1000 scrapers hitting it at once otherwise.

AmbroseBierce2mo ago

It's not like those models are expensive because the usefulness that they extracted from scraping others without permission right? You are not even scratching the surface of the hypocrisy

VadimPR2mo ago

Getting scraped by abusive bots who bring down the website because they overload the DB with unique queries is not marginal. I spent a good half of last year with extra layers of caching, CloudFlare, you name it because our little hobby website kept getting DDoS'd by the bots scraping the web for training data.

Never in 15 years if running the website did we have such issues, and you can be sure that cache layers were in place already for it to last this long.

unsungNovelty2mo ago

"near-zero marginal costs". For whom exactly????

https://drewdevault.com/2025/03/17/2025-03-17-Stop-externali...

lelanthran2mo ago

I don't think a rule along the lines of "Doing $FOO to a corporate is forbidden, but doing $FOO to a charitable initiative is fine" is at all fair.

What "$FOO" actually is, is irrelevant. I'm curious how you would convince people that this sort of rule is fair.

The corp can always ban users who break ToS, after all. They don't need any help. The charitable initiative can't actually do that, can they?

ungreased06752mo ago

You’re describing the tragedy of the commons. No single raindrop thinks it’s responsible for the flood.

1 more reply

the_sleaze_2mo ago

60% of our traffic is bot, on average. Sometimes almost 100%.

not_your_vase2mo ago

  > net-zero marginal cost

Lol, you single-handedly created a market for Anubis, and in the past 3 years the cloudflare captchas have multiplied by at least 10-fold, now they are even on websites that were very vocal against it. Many websites are still drowning - gnu family regularly only accessible through wayback machine.

Spare me your tears.

grishka2mo ago

> Scraping static content from a website at near-zero marginal cost to its server

It's not possible to know in advance what is static and what is not. I have some rather stubborn bots make several requests per second to my server, completely ignoring robots.txt and rel="nofollow", using residential IPs and browser user-agents. It's just a mild annoyance for me, although I did try to block them, but I can imagine it might be a real problem for some people.

I'm not against my website getting scraped, I believe being able to do that is an important part what the web is, but please have some decency.

xmcqdpt22mo ago

AI providers also claim to have small marginal costs. The costs of token is supposedly based on pricing in model training, so not that different from eg your server costs being low but the content production costs being high. And in many cases AI companies are direct competitors (artists, musicians etc.)

(TBH it's not clear to me that their marginal costs are low. They seem to pick based on narrative.)

SkiFire132mo ago

> Scraping static content

How do you know the content is static?

ori_b2mo ago

My website serving git that only works from Plan 9 is serving about a terabyte of web traffic monthly. Each page load is about 10 to 30 kilobytes. Do you think there's enough organic, non-scraper interest in the site that scrapers are a near-zero part of the cost?

make32mo ago

Absolutely not, the former relies on controversial ideas to qualify as legal.

Stealing the content from the whole planet & actively reducing the incentive to visit the sites without financial restitution is pretty bad.

foobiekr2mo ago

You are, of course, ignoring the production costs of the static content that OpenAi is stealing.

Stop justifying their anti-social behavior because it lines your pockets.

swagmoney16062mo ago

And yet I have to pay in my time and cash to handle the constant ddos'es from the constant LLM scraping

AtlasBarfed2mo ago

Because you say it is?

I obviously disagree. I mean, on top of this we are talking about not-open OpenAI.

gmerc2mo ago

It’s not for techbros to decide at what threshold of theft it’s actually theft. “My GPU time is more valuable than your CPU time” isn’t a thing and Wikipedias latest numbers on scraping show that marginal costs at scale are a valid concern

mcfedr2mo ago

I'm sure the copyright holders would consider your use of their content as direct financial damage

nozzlegear2mo ago

Are they, actually?

nickphx2mo ago

Speak for yourself.

karlshea2mo ago

I don’t know what world you live in but it’s not this one.

andrepd2mo ago

> Scraping static content from a website at near-zero marginal cost to its server

The gall. https://weirdgloop.org/blog/clankers

platybubsy2mo ago

Bait or genuine techbro? Hard to say

j / k navigate · click thread line to collapse

0 comments

84 comments · 32 top-level

PunchyHamster2mo ago· 31 in thread

> Scraping static content from a website at near-zero marginal cost to its server, vs scraping an expensive LLM service provided for free, are different things.

I bet people being fucking DDOSed by AI bots disagree

Also the fucking ignorance assuming it's "static content" and not something needing code running

remus2mo ago

0xEF2mo ago

Aerroon2mo ago

What mess are you talking about? The existence of LLMs? I think it's pretty neat that I can now get answers to questions I have.

Are there downsides to this? Sure, but imo AI is useful.

1 more reply

daveidol2mo ago

I’d argue putting everything in terms of black and white is the bigger issue than understanding nuance

1 more reply

Den_VR2mo ago

I miss the www where the .html was written in vim or notepad.

mghackerlady2mo ago

It still can be. Do it. Go make your website in M$ Frontpage, for all I care

butlike2mo ago

Shameless plug: My music homepage follows the HTML 2.0 spec and is written by hand

https://sampleoffline.com/

1 more reply

consp2mo ago

holler2mo ago

ahh yes, fresh off reading "Html For Dummies" I made my first tripod.com site

sdsd2mo ago

For me it was making a petpage for my neopets using https://lissaexplains.com/

It's still up in all its glory.

1 more reply

eloisius2mo ago

lelanthran2mo ago

> Also wild that from the tech bro perspective, the cost of journalism is just how much data transfer costs for the finished article.

Then we don't have to hit (for example) Anthropic, we can download and use the models as we see fit without Anthropic whining that the users are using too much capacity.

eru2mo ago

> I bet people being fucking DDOSed by AI bots disagree

Are you sure it's a DDoS and not just a DoS?

MattJ1002mo ago

Yes, it is. The worst offenders hammer us (and others) with thousands upon thousands of requests, and each request uses unique IP addresses making all per-IP limits useless.

It's a DDoS.

troyvit2mo ago

SolarNet2mo ago

When every AI company does it from multiple data centers... yes it's distributed.

Bilal_io2mo ago

Uncoordinated DDoS, when multiple search and AI companies are hammering your server.

1 more reply

catoc2mo ago

> Are you sure it's a DDoS and not just a DoS?

I think these days it’s ‘DAIS’, as in your site just DAIS - from Distributed/Damned AI Scraping

17186274402mo ago

pocksuppet2mo ago

DDoSers who really want to cause damage now target random IPs in the same network as their actual target. That way, it can't be blackholed without blackholing the entire hosting provider.

ImPostingOnHN2mo ago

*> Why is the default not to just don't do anything?

echoangle2mo ago

I think some people use hosting that is paid per request/load, so having crawlers make unwanted requests costs them money.

lm4112mo ago

> Also the fucking ignorance assuming it's "static content" and not something needing code running

Wild eh.

If it's not ai now, it's by default labelled "static content" and "near-zero marginal cost".

littlestymaar2mo ago

What's a database after all.

nikitagaOP2mo ago

All this reactionary outrage in the comments is funny. And lame.

Yes, for the vast majority of the internet, serving traffic is near zero marginal cost. Not for LLMs though – those requests are orders of magnitude more expensive.

The vast majority of websites handle AI traffic fine though, either because they don't have expensive to serve resources, or because they properly protect such resources from abuse.

If you're an edge case who is harmed by overly aggressive scrapers, take countermeasures. Everyone with that problem should, that's neither new nor controversial.

ipaddr2mo ago

"such DDOSes as you mention were common, such a scraping strategy would not have worked for the scraper at all"

They are common. The strategy works for the llm but not for the website owner or users who can't use a site during this attack.

The majority of sites are not handling AI fine. Getting Ddosed only part of the time is not acceptable. Countermeasures like blocking huge ranges can help but also lock out legimate users.

nikitagaOP2mo ago

> They are common

Any actual evidence of the alleged scope of this problem, or just anecdotes from devs who are mad at AI, blown out of proportion?

1 more reply

fireflash382mo ago

It's not a cost for me to scrape LLM.

It is a cost for me for LLM to scrape me.

Why should I care about costs that have when they don't care about the costs I have?

grayhatter2mo ago

The extent of the utilization is new.

The number of bots that try to hide who they are, and don't bother to even check robots.txt is new.

expedition322mo ago

One euro is marginal for me for someone else it is their daily meal.

juliangmp2mo ago

"They are rare edge cases" are we on the same internet?

not2b2mo ago· 6 in thread

zer00eyz2mo ago

> but it simply isn't true that AI crawlers aren't creating very significant load.

And how much of this is users who are tired of walled gardens and enshitfication. We murdered RSS, API's and the "open web" in the name of profit, and lock in.

There is a path where "AI" turns into an ouroboros, tech eating itself, before being scaled down to run on end user devices.

stingraycharles2mo ago

These are ChatGPT and Claude Desktop crawlers we’re talking about? Or what is it exactly? Are these really creating significant load while not honoring robots.txt?

Genuinely interested.

63stack2mo ago

miki1232112mo ago

OpenAI et al seem to mostly be well-behaved.

cruffle_duffle2mo ago

I bet dollars to doughnuts that 95% of the traffic is from Claude and ChatGPT desktop / mobile and not literal content scraping for training.

crote2mo ago

That wouldn't explain the 1000x increase in traffic for extremely obscure content, or seeing it download every single page on a classic web forum.

1 more reply

wolvoleo2mo ago· 5 in thread

It's more ironic because without all the scraping openai has done, there would have been no ChatGPT.

In fact the more I think of it, I think it's exactly the same thing.

expedition322mo ago

This leads me to thinking: I ask chatGPT a question and they get the answer from gamefaqs.

But what happens if gamefaqs disappears because of lack of traffic?

Can LLM actually create or only regurgitate content.

Aerroon2mo ago

>Can LLM actually create or only regurgitate content.

wolvoleo2mo ago

stefanka2mo ago

They cannot create original content.

wolvoleo2mo ago

sandeepkd2mo ago· 4 in thread

expedition322mo ago

Perhaps the long play is to destroy all small hobby websites until only a AI directed web is left.

miki1232112mo ago

If you're truly running a static site, you can run it for free, no matter how much traffic you're getting.

Github pages is one way, but there are other platforms offering similar services. Static content just isn't that expensive to host.

ezrast2mo ago

https://diff.wikimedia.org/2025/04/01/how-crawlers-impact-th...

jazzyjackson2mo ago

razingeden2mo ago· 4 in thread

gzread2mo ago

If you can identify the scraper you should have a valid legal case to recover damages.

thisislife22mo ago

Only if they had a robots.txt for their site.

gzread2mo ago

No, it's still illegal to DDoS sites that don't have robots.txt.

1 more reply

razingeden2mo ago

I hadn’t even considered that. Don’t know why that comment is greyed out or downvoted.

lm4112mo ago· 1 in thread

That is ridiculous.

You imply that "an expensive llm service" is harmed by abuse, but, every other service is not? Because their websites are "static" and "near-zero marginal cost"?

You have no clue what you are talking about.

camillomiller2mo ago

Well he’s a simp

nslsm2mo ago· 1 in thread

The issue is that there are so many awful webmasters that have websites that take hundreds of milliseconds to generate and are brought down by a couple requests a second.

bakugo2mo ago

OpenAI must be the most awful webmasters of all, then, to need such sophisticated protections.

cicko2mo ago

alsetmusic2mo ago

bakugo2mo ago

AmbroseBierce2mo ago

It's not like those models are expensive because the usefulness that they extracted from scraping others without permission right? You are not even scratching the surface of the hypocrisy

VadimPR2mo ago

Never in 15 years if running the website did we have such issues, and you can be sure that cache layers were in place already for it to last this long.

unsungNovelty2mo ago

"near-zero marginal costs". For whom exactly????

https://drewdevault.com/2025/03/17/2025-03-17-Stop-externali...

lelanthran2mo ago

I don't think a rule along the lines of "Doing $FOO to a corporate is forbidden, but doing $FOO to a charitable initiative is fine" is at all fair.

What "$FOO" actually is, is irrelevant. I'm curious how you would convince people that this sort of rule is fair.

The corp can always ban users who break ToS, after all. They don't need any help. The charitable initiative can't actually do that, can they?

ungreased06752mo ago

You’re describing the tragedy of the commons. No single raindrop thinks it’s responsible for the flood.

1 more reply

the_sleaze_2mo ago

60% of our traffic is bot, on average. Sometimes almost 100%.

not_your_vase2mo ago

  > net-zero marginal cost

Spare me your tears.

grishka2mo ago

> Scraping static content from a website at near-zero marginal cost to its server

I'm not against my website getting scraped, I believe being able to do that is an important part what the web is, but please have some decency.

xmcqdpt22mo ago

(TBH it's not clear to me that their marginal costs are low. They seem to pick based on narrative.)

SkiFire132mo ago

> Scraping static content

How do you know the content is static?

ori_b2mo ago

make32mo ago

Absolutely not, the former relies on controversial ideas to qualify as legal.

Stealing the content from the whole planet & actively reducing the incentive to visit the sites without financial restitution is pretty bad.

foobiekr2mo ago

You are, of course, ignoring the production costs of the static content that OpenAi is stealing.

Stop justifying their anti-social behavior because it lines your pockets.

swagmoney16062mo ago

And yet I have to pay in my time and cash to handle the constant ddos'es from the constant LLM scraping

AtlasBarfed2mo ago

Because you say it is?

I obviously disagree. I mean, on top of this we are talking about not-open OpenAI.

gmerc2mo ago

mcfedr2mo ago

I'm sure the copyright holders would consider your use of their content as direct financial damage

nozzlegear2mo ago

Are they, actually?

nickphx2mo ago

Speak for yourself.

karlshea2mo ago

I don’t know what world you live in but it’s not this one.

andrepd2mo ago

> Scraping static content from a website at near-zero marginal cost to its server

The gall. https://weirdgloop.org/blog/clankers

platybubsy2mo ago

Bait or genuine techbro? Hard to say

j / k navigate · click thread line to collapse