undefined | Better HN

0 pointshombre_fatal7mo ago0 comments

Your comment and the above comment of course show different cases.

An agent making a request on the explicit behalf of someone else is probably something most of us agree is reasonable. "What are the current stories on Hacker News?" -- the agent is just doing the same request to the same website that I would have done anyways.

But the sort of non-explicit just-in-case crawling that Perplexity might do for a general question where it crawls 4-6 sources isn't as easy to defend. "Are polar bears always white?" -- Now it's making requests I wouldn't have necessarily made, and it could even been seen as a sort of amplification attack.

That said, TFA's example is where they register secretexample.com and then ask Perplexity "what is secretexample.com about?" and Perplexity sends a request to answer the question, so that's an example of the first case, not the second.

0 comments

bayindirh7mo ago

As a person who has a couple of sites out there, and witnesses AI crawlers coming and fetching pages from these sites, I have a question:

What prevents these companies from keeping a copy of that particular page, which I specifically disallowed for bot scraping, and feed it to their next training cycle?

Pinky promises? Ethics? Laws? Technical limitations? Leeroy Jenkins?

Aeolun7mo ago

> What prevents these companies from keeping a copy of that particular page, which I specifically disallowed for bot scraping, and feed it to their next training cycle?

What prevents anyone else? robots.txt is a request, not an access policy.

utbabya7mo ago

This honor system mostly worked at scale because interests align, which seems to be no longer the case.

Does information no longer wants to be free now? Maybe internet, just like social media was just a social experiment at the end, albeit a successful one. Thanks GenAI.

2 more replies

accrual7mo ago

Thanks for sharing your experience. A little off-topic but I'd like to start hosting some personal content, guides/tutorials, etc.

Do you still see authentic human traffic on your domains, is it easy to discern?

I feel like I missed the bus on running a blog pre-AI.

bayindirh7mo ago

I intentionally doesn't keep detailed analytics on my homepage server and my digital garden, because I respect my users and don't want to push unnecessary Javascript on them. The blog platform I use (Mataroa) keeps rudimentary analytics (essentially page hit counters, nothing more) on index, RSS and per post.

Both my blog homepage and posts see mostly human traffic. Sometimes bots crawl the site and they appear as spikes in the analytics.

Looks like my homepage which doesn't have anything but links is pretty popular with crawlers. My digital garden doesn't get much interest from them. All in all, human traffic on my sites are pretty much alive.

I don't believe in missing the bus in anything actually, because I don't write these for others, first. Both my blog (more meta) and digital garden (more technical) are written for myself primarily, and left open. I post links to both when it's appropriate, but they are not made to be popular. If people read it and learn something or solve one of their problems, that's enough for me.

This is why my software is GPLv3, Digital Garden is GFDL and blog is CC BY-NC-SA 2.0. This is why everything is running with absolutely minimum analytics and without any ads whatsoever.

Lastly, this is why I don't want AI crawlers in my site and my data in the models. This thing is made by a human for humans, absolutely for free. It's not OK somebody to sell something designed to be free and make money over it.

1 more reply

kldg7mo ago

if you do analytics, it is not so hard, but then you need to store user data (if not directly, then worse, with a third party), which should be viewed as a liability. I see ~2/3 human traffic, ~1/3 bot traffic (I just parse user agent strings and count whitelisted browsers as human), but my main landing page is all dynamic-populated webgl. I just asked Gemini what it sees on website, and it states "The page appears to be loading, with the text "Loading room data...".[1] There are also labels for "BG", "FG", and "CURSOR", and a background weather animation." -so I can be feel reasonably confident I don't need to worry about AI, for now; it needs a machine-friendly frontend.

you could go proper insanomode, too. remaking The Internet is trivial if you don't care about existing web standards -- replacing HTTP with your own TCP implementation, getting off html/js/css, etc. being greenfield, you can control the protocol, server, and client implementation, and put it in whatever language you want. I made a stateful Internet implementation in Python earlier for proof-of-concept, but I want to port it and expand on it in rust soon (just for fun; I don't do serious biznos). you'll very likely have 100% human traffic then, even if you're the only person curious and trusting enough to run your client.

1 more reply

1024core7mo ago

It's your server. You're free to do whatever you want. You can serve different versions of the page depending on the UserAgent (has been done many times before).

You can put up a paywall depending on UserAgent or OS (has been done).

In short, it's a 2-way street: the client on the other end of the TCP pipe makes a request, and your server fulfills the request as it sees fit.

tempfile7mo ago

The way to prevent people from downloading your pages and using them is to take them off the public internet. There are laws to prevent people from violating your copyright or from preventing access to your service (by excessive traffic). But there is (thankfully) no magical right that stops people from reading your content and describing it.

bayindirh7mo ago

Many site operators want people to access their content, but prevent AI companies from scraping their sites for training data. People who think like that made tools like Anubis, and it works.

I also want to keep this distinction on the sites I own. I also use licenses to signal that this site is not good to use for AI training, because it's CC BY-NC-SA-2.0.

So, I license my content appropriately (No derivative, Non-commercial, shareable with the same license with attribution), add technical countermeasures on top, because companies doesn't respect these licenses (because monies), and circumvent these mechanisms (because monies), and I'm the one to suck this up and shut-up (because their monies)?

Makes no sense whatsoever.

4 more replies

miki1232117mo ago

the fact that it would be discovered almost immediately.

If you give them a URL that does not appear in Google, ask them to visit that URL specifically, and then notice the content from that URL in the training data, it's proof that they're doing this, which would be quite damaging to them.

Freak_NL7mo ago

> […] it's proof that they're doing this, which would be quite damaging to them.

Is it? It's damning, but is it damaging at all?

I'm not getting the impression that anyone's data being available for training if some bot can get to it is just how things are now, rather than an unsettled point of contention. There's too much money invested in this thing for any other outcome, and with the present decline of the rule of law…

autoexec7mo ago

Nothing, and that's why I expect they all do it.

tintor7mo ago

technical limitations / data poisoning measures

AuthAuth7mo ago

Hacker news wants you to vist the site, look at the main page, enter threads and participate in discussion.

When you swap in an AI and ask what are the current stories. The AI fetches the front page and every thread and feeds it back to you. You are less likely to participate in discussion because you've already had the info summarized.

jychang7mo ago

Who cares what Hacker News wants? You’re not obliged to participate in discussion.

Am I supposed to spend money on Amazon.com when I visit the website just because Amazon wants me to?

egypturnash7mo ago

If most people quit spending money on Amazon then Amazon stops being worth running.

If most people stop discussing things on HN, and the discussion is indeed one of the major reasons it’s kept running, then HN stops being worth running.

1 more reply

AuthAuth7mo ago

Whats the point of a human coming to a site if all the threads and empty and its front page is a glorified RSS feed for lazy peoples AI agents?

AnthonBerg7mo ago

Who cares what you want?

1 more reply

noboostforyou7mo ago

> You’re not obliged to participate in discussion.

Are website owners obligated to serve content to AI agents and/or LLM scrapers?

butlike7mo ago

It was a corollary example

ithkuil7mo ago

Foo news wants you to visit the site, look at the main page, watch the ads, click on them and buy the products advertised by third parties which will give money to Foo news in exchange for this service.

And yet people install ad blockers and defend their freedom to not participate in this because they don't want to be annoyed by ads.

They claim that since they are free to not buy an advertised product, why would they be forced to see ads for it. But Foo news claims that they are also free to not waste bandwidth to serve their free website to people who declare (by using an ad blocker or the modern alternative: AI aummarizera) they won't participate in the funding of the service

skydhash7mo ago

It's not ads. We have ads in paper magazines and newspapers and no one went around with scissors to remove them. It's obnoxious ads, designed to violently grabs your attention and trackers (malware). It's like a newspapers giving your address to a whole crew of salemens that intrudes on your property at 3am and looking at you sleeping and installing cameras in your bathroom. All so that they can jump at you in the street to loudly claim they have the underwear you told your partner you like. If you're going to be that invasive about my person, then I'm going to be that forceful about restrictions.

2 more replies

remus7mo ago

> And yet people install ad blockers and defend their freedom to not participate in this because they don't want to be annoyed by ads.

I think this is a pretty different scenario. Here the user and the news website are talking directly to each other, but then the user is making a choice around what to do with the content the news website send to them. With AI agents, there is a company inserting themselves between the user and the news website and acting as a middleman.

It seems reasonable to me that the news website might say they only want to deal with users and not middlemen.

1 more reply

trhway7mo ago

With all the crypto development how come we haven't got to

  HTTP/1.1 402 Payment Required
  WWW-price: 0.0000001 BTC, 0.000001 ETH, 0.00001 DOGE

> You are less likely to participate in discussion

you (or AI on your behalf) paid instead. Many sites would probably like it better.

autoexec7mo ago

If people were forced to pay for websites by the http request people would demand that websites stop loading a ton of externally hosted JS, stop filling sites with ads, and would demand that websites actually have content worth the price.

There are so many links I click on these days that are such trash I'd be demanding refunds constantly.

1 more reply

dns_snek7mo ago

It's not a development problem, it's an adoption problem. Publishers are desperate to sell us on a $20+/month subscription, they don't want to offer convenient affordable access to single articles.

1 more reply

j / k navigate · click thread line to collapse

0 comments

bayindirh7mo ago

As a person who has a couple of sites out there, and witnesses AI crawlers coming and fetching pages from these sites, I have a question:

What prevents these companies from keeping a copy of that particular page, which I specifically disallowed for bot scraping, and feed it to their next training cycle?

Pinky promises? Ethics? Laws? Technical limitations? Leeroy Jenkins?

Aeolun7mo ago

> What prevents these companies from keeping a copy of that particular page, which I specifically disallowed for bot scraping, and feed it to their next training cycle?

What prevents anyone else? robots.txt is a request, not an access policy.

utbabya7mo ago

This honor system mostly worked at scale because interests align, which seems to be no longer the case.

Does information no longer wants to be free now? Maybe internet, just like social media was just a social experiment at the end, albeit a successful one. Thanks GenAI.

2 more replies

accrual7mo ago

Thanks for sharing your experience. A little off-topic but I'd like to start hosting some personal content, guides/tutorials, etc.

Do you still see authentic human traffic on your domains, is it easy to discern?

I feel like I missed the bus on running a blog pre-AI.

bayindirh7mo ago

Both my blog homepage and posts see mostly human traffic. Sometimes bots crawl the site and they appear as spikes in the analytics.

This is why my software is GPLv3, Digital Garden is GFDL and blog is CC BY-NC-SA 2.0. This is why everything is running with absolutely minimum analytics and without any ads whatsoever.

1 more reply

kldg7mo ago

1 more reply

1024core7mo ago

It's your server. You're free to do whatever you want. You can serve different versions of the page depending on the UserAgent (has been done many times before).

You can put up a paywall depending on UserAgent or OS (has been done).

In short, it's a 2-way street: the client on the other end of the TCP pipe makes a request, and your server fulfills the request as it sees fit.

tempfile7mo ago

bayindirh7mo ago

Many site operators want people to access their content, but prevent AI companies from scraping their sites for training data. People who think like that made tools like Anubis, and it works.

I also want to keep this distinction on the sites I own. I also use licenses to signal that this site is not good to use for AI training, because it's CC BY-NC-SA-2.0.

Makes no sense whatsoever.

4 more replies

miki1232117mo ago

the fact that it would be discovered almost immediately.

Freak_NL7mo ago

> […] it's proof that they're doing this, which would be quite damaging to them.

Is it? It's damning, but is it damaging at all?

autoexec7mo ago

Nothing, and that's why I expect they all do it.

tintor7mo ago

technical limitations / data poisoning measures

AuthAuth7mo ago

Hacker news wants you to vist the site, look at the main page, enter threads and participate in discussion.

jychang7mo ago

Who cares what Hacker News wants? You’re not obliged to participate in discussion.

Am I supposed to spend money on Amazon.com when I visit the website just because Amazon wants me to?

egypturnash7mo ago

If most people quit spending money on Amazon then Amazon stops being worth running.

If most people stop discussing things on HN, and the discussion is indeed one of the major reasons it’s kept running, then HN stops being worth running.

1 more reply

AuthAuth7mo ago

Whats the point of a human coming to a site if all the threads and empty and its front page is a glorified RSS feed for lazy peoples AI agents?

AnthonBerg7mo ago

Who cares what you want?

1 more reply

noboostforyou7mo ago

> You’re not obliged to participate in discussion.

Are website owners obligated to serve content to AI agents and/or LLM scrapers?

butlike7mo ago

It was a corollary example

ithkuil7mo ago

And yet people install ad blockers and defend their freedom to not participate in this because they don't want to be annoyed by ads.

skydhash7mo ago

2 more replies

remus7mo ago

> And yet people install ad blockers and defend their freedom to not participate in this because they don't want to be annoyed by ads.

It seems reasonable to me that the news website might say they only want to deal with users and not middlemen.

1 more reply

trhway7mo ago

With all the crypto development how come we haven't got to

  HTTP/1.1 402 Payment Required
  WWW-price: 0.0000001 BTC, 0.000001 ETH, 0.00001 DOGE

> You are less likely to participate in discussion

you (or AI on your behalf) paid instead. Many sites would probably like it better.

autoexec7mo ago

There are so many links I click on these days that are such trash I'd be demanding refunds constantly.

1 more reply

dns_snek7mo ago

It's not a development problem, it's an adoption problem. Publishers are desperate to sell us on a $20+/month subscription, they don't want to offer convenient affordable access to single articles.

1 more reply

j / k navigate · click thread line to collapse