undefined | Better HN

0 pointsPunchyHamster1mo ago0 comments

> Scraping static content from a website at near-zero marginal cost to its server, vs scraping an expensive LLM service provided for free, are different things.

I bet people being fucking DDOSed by AI bots disagree

Also the fucking ignorance assuming it's "static content" and not something needing code running

0 comments

remus1mo ago

I think the parent is just pointing out that these things lie on a spectrum. I have a website that consists largely of static content and the (significant) scraping which occurs doesn't impact the site for general users so I don't mind (and means I get good, up to date answers from LLMs on the niche topic my site covers). If it did have an impact on real users, or cost me significant money, I would feel pretty differently.

0xEF1mo ago

Putting everything on a spectrum is what got us into this mess of zero regulation and moving goal posts. It's slippery slope thinking no matter which way we cut it, because every time someone calls for a stop sign to be put up after giving an inch, the very people who would have to stop will argue tirelessly for the extra mile.

Aerroon1mo ago

What mess are you talking about? The existence of LLMs? I think it's pretty neat that I can now get answers to questions I have.

This is something I couldn't have done before, because people very often don't have the patience to answer questions. Even Google ended up in loops of "just use Google" or "closed. This is a duplicate of X, but X doesn't actually answer the question" or references to dead links.

Are there downsides to this? Sure, but imo AI is useful.

butlike1mo ago

It's just repackaged Google results masquerading as an 'answer.' PageRank pulled results and displayed the first 10 relevant links and the LLM pulls tokens and displays the first relevant tokens to the query.

Just prompt it.

1 more reply

daveidol1mo ago

I’d argue putting everything in terms of black and white is the bigger issue than understanding nuance

instig0071mo ago

Generalizing with "everything", "all", etc exclusive markers is exactly the kind of black/white divide you're arguing against. What happened to your nuanced reality within a single sentence? Not everything is black and white, but some situations are.

1 more reply

Den_VR1mo ago

I miss the www where the .html was written in vim or notepad.

mghackerlady1mo ago

It still can be. Do it. Go make your website in M$ Frontpage, for all I care

butlike1mo ago

Shameless plug: My music homepage follows the HTML 2.0 spec and is written by hand

https://sampleoffline.com/

mghackerlady1mo ago

heck yeah B)

consp1mo ago

Just did that for a test frontend for a module I needed to build (not my primary job so don't know anything about UI but running in browsers was a requirement), so basic HTML with the bare minimum of JS and all DOM. Colleagues were very surprized. And yes, vim is still the goto editor and will be for a long time now all "IDE" are pushing "AI" slop everywhere.

holler1mo ago

ahh yes, fresh off reading "Html For Dummies" I made my first tripod.com site

sdsd1mo ago

For me it was making a petpage for my neopets using https://lissaexplains.com/

It's still up in all its glory.

DigiEggz1mo ago

This is great! The name reference also made me smile.

eloisius1mo ago

Also wild that from the tech bro perspective, the cost of journalism is just how much data transfer costs for the finished article. Authors spend their blood, sweat and tears writing and then OpenAI comes to Hoover it up without a care in the world about license, copyright or what constitutes fair use. But don’t you dare scrape their slop.

lelanthran1mo ago

> Also wild that from the tech bro perspective, the cost of journalism is just how much data transfer costs for the finished article.

Exactly. I think the unfairness can be mitigated if models trained on public information, or on data generated by a model trained on public information, or has any of those two in its ancestry, must be made public.

Then we don't have to hit (for example) Anthropic, we can download and use the models as we see fit without Anthropic whining that the users are using too much capacity.

eru1mo ago

> I bet people being fucking DDOSed by AI bots disagree

Are you sure it's a DDoS and not just a DoS?

MattJ1001mo ago

Yes, it is. The worst offenders hammer us (and others) with thousands upon thousands of requests, and each request uses unique IP addresses making all per-IP limits useless.

We implemented an anti-bot challenge and it helped for a while. Then our server collapsed again recently. The perf command showed that the actual TLS handshakes inside nginx were using over 50% of our server's CPU, starving other stuff on the machine.

It's a DDoS.

troyvit1mo ago

You should see Cloudflare's control panel for AI bot blocking. There are dozens of different AI bots you can choose to block, and that doesn't even count the different ASNs they might use. So in this case I'd say that a DDoS is a decent description. It's not as bad as every home router on the eastern seaboard or something, but it's pretty bad.

SolarNet1mo ago

When every AI company does it from multiple data centers... yes it's distributed.

Bilal_io1mo ago

Uncoordinated DDoS, when multiple search and AI companies are hammering your server.

1 more reply

catoc1mo ago

> Are you sure it's a DDoS and not just a DoS?

I think these days it’s ‘DAIS’, as in your site just DAIS - from Distributed/Damned AI Scraping

17186274401mo ago

Off topic, but why is a DoS something considered to act on, often by just shutting down the service altogether? That results in the same DoS just by the operator than due to congestion. Actually it's worse, because now the requests will never actually be responded rather then after some delay. Why is the default not to just don't do anything?

pocksuppet1mo ago

It keeps the other projects hosted on the same server or network online. Blackhole routes are pushed upstream to the really big networks and they push them to their edge routers, so traffic to the affected IPs is dropped near the sender's ISP and doesn't cause network congestion.

DDoSers who really want to cause damage now target random IPs in the same network as their actual target. That way, it can't be blackholed without blackholing the entire hosting provider.

ImPostingOnHN1mo ago

*> Why is the default not to just don't do anything?

Because ingress and compute costs often increase with every request, to the point where AI bot requests rack up bills of hundreds or thousands of dollars more than the hobbyist operator was expecting to send.

echoangle1mo ago

I think some people use hosting that is paid per request/load, so having crawlers make unwanted requests costs them money.

lm4111mo ago

> Also the fucking ignorance assuming it's "static content" and not something needing code running

Wild eh.

If it's not ai now, it's by default labelled "static content" and "near-zero marginal cost".

littlestymaar1mo ago

What's a database after all.

nikitaga1mo ago

All this reactionary outrage in the comments is funny. And lame.

Yes, for the vast majority of the internet, serving traffic is near zero marginal cost. Not for LLMs though – those requests are orders of magnitude more expensive.

This isn't controversial at all, it's a well understood fact, outside of this irrationally angry thread at least. I don't know, maybe you don't understand the economic term "marginal cost", thus not understanding the limited scope of my statement.

If such DDOSes as you mention were common, such a scraping strategy would not have worked for the scraper at all. But no, they're rare edge cases, from a combination of shoddy scrapers and shoddy website implementations, including the lack of even basic throttling for expensive-to-serve resources.

The vast majority of websites handle AI traffic fine though, either because they don't have expensive to serve resources, or because they properly protect such resources from abuse.

If you're an edge case who is harmed by overly aggressive scrapers, take countermeasures. Everyone with that problem should, that's neither new nor controversial.

ipaddr1mo ago

"such DDOSes as you mention were common, such a scraping strategy would not have worked for the scraper at all"

They are common. The strategy works for the llm but not for the website owner or users who can't use a site during this attack.

The majority of sites are not handling AI fine. Getting Ddosed only part of the time is not acceptable. Countermeasures like blocking huge ranges can help but also lock out legimate users.

nikitaga1mo ago

> They are common

Any actual evidence of the alleged scope of this problem, or just anecdotes from devs who are mad at AI, blown out of proportion?

ipaddr1mo ago

Love AI so can't be that. Not devs website owners. Yes ask AI for stats.

fireflash381mo ago

It's not a cost for me to scrape LLM.

It is a cost for me for LLM to scrape me.

Why should I care about costs that have when they don't care about the costs I have?

grayhatter1mo ago

The extent of the utilization is new.

The number of bots that try to hide who they are, and don't bother to even check robots.txt is new.

expedition321mo ago

One euro is marginal for me for someone else it is their daily meal.

juliangmp1mo ago

"They are rare edge cases" are we on the same internet?

j / k navigate · click thread line to collapse

0 comments

remus1mo ago

0xEF1mo ago

Aerroon1mo ago

What mess are you talking about? The existence of LLMs? I think it's pretty neat that I can now get answers to questions I have.

Are there downsides to this? Sure, but imo AI is useful.

butlike1mo ago

Just prompt it.

1 more reply

daveidol1mo ago

I’d argue putting everything in terms of black and white is the bigger issue than understanding nuance

instig0071mo ago

1 more reply

Den_VR1mo ago

I miss the www where the .html was written in vim or notepad.

mghackerlady1mo ago

It still can be. Do it. Go make your website in M$ Frontpage, for all I care

butlike1mo ago

Shameless plug: My music homepage follows the HTML 2.0 spec and is written by hand

https://sampleoffline.com/

mghackerlady1mo ago

heck yeah B)

consp1mo ago

holler1mo ago

ahh yes, fresh off reading "Html For Dummies" I made my first tripod.com site

sdsd1mo ago

For me it was making a petpage for my neopets using https://lissaexplains.com/

It's still up in all its glory.

DigiEggz1mo ago

This is great! The name reference also made me smile.

eloisius1mo ago

lelanthran1mo ago

> Also wild that from the tech bro perspective, the cost of journalism is just how much data transfer costs for the finished article.

Then we don't have to hit (for example) Anthropic, we can download and use the models as we see fit without Anthropic whining that the users are using too much capacity.

eru1mo ago

> I bet people being fucking DDOSed by AI bots disagree

Are you sure it's a DDoS and not just a DoS?

MattJ1001mo ago

Yes, it is. The worst offenders hammer us (and others) with thousands upon thousands of requests, and each request uses unique IP addresses making all per-IP limits useless.

It's a DDoS.

troyvit1mo ago

SolarNet1mo ago

When every AI company does it from multiple data centers... yes it's distributed.

Bilal_io1mo ago

Uncoordinated DDoS, when multiple search and AI companies are hammering your server.

1 more reply

catoc1mo ago

> Are you sure it's a DDoS and not just a DoS?

I think these days it’s ‘DAIS’, as in your site just DAIS - from Distributed/Damned AI Scraping

17186274401mo ago

pocksuppet1mo ago

DDoSers who really want to cause damage now target random IPs in the same network as their actual target. That way, it can't be blackholed without blackholing the entire hosting provider.

ImPostingOnHN1mo ago

*> Why is the default not to just don't do anything?

echoangle1mo ago

I think some people use hosting that is paid per request/load, so having crawlers make unwanted requests costs them money.

lm4111mo ago

> Also the fucking ignorance assuming it's "static content" and not something needing code running

Wild eh.

If it's not ai now, it's by default labelled "static content" and "near-zero marginal cost".

littlestymaar1mo ago

What's a database after all.

nikitaga1mo ago

All this reactionary outrage in the comments is funny. And lame.

Yes, for the vast majority of the internet, serving traffic is near zero marginal cost. Not for LLMs though – those requests are orders of magnitude more expensive.

The vast majority of websites handle AI traffic fine though, either because they don't have expensive to serve resources, or because they properly protect such resources from abuse.

If you're an edge case who is harmed by overly aggressive scrapers, take countermeasures. Everyone with that problem should, that's neither new nor controversial.

ipaddr1mo ago

"such DDOSes as you mention were common, such a scraping strategy would not have worked for the scraper at all"

They are common. The strategy works for the llm but not for the website owner or users who can't use a site during this attack.

The majority of sites are not handling AI fine. Getting Ddosed only part of the time is not acceptable. Countermeasures like blocking huge ranges can help but also lock out legimate users.

nikitaga1mo ago

> They are common

Any actual evidence of the alleged scope of this problem, or just anecdotes from devs who are mad at AI, blown out of proportion?

ipaddr1mo ago

Love AI so can't be that. Not devs website owners. Yes ask AI for stats.

fireflash381mo ago

It's not a cost for me to scrape LLM.

It is a cost for me for LLM to scrape me.

Why should I care about costs that have when they don't care about the costs I have?

grayhatter1mo ago

The extent of the utilization is new.

The number of bots that try to hide who they are, and don't bother to even check robots.txt is new.

expedition321mo ago

One euro is marginal for me for someone else it is their daily meal.

juliangmp1mo ago

"They are rare edge cases" are we on the same internet?

j / k navigate · click thread line to collapse