undefined | Better HN

0 pointsbayindirh9mo ago0 comments

As a person who has a couple of sites out there, and witnesses AI crawlers coming and fetching pages from these sites, I have a question:

What prevents these companies from keeping a copy of that particular page, which I specifically disallowed for bot scraping, and feed it to their next training cycle?

Pinky promises? Ethics? Laws? Technical limitations? Leeroy Jenkins?

0 comments

Aeolun9mo ago

> What prevents these companies from keeping a copy of that particular page, which I specifically disallowed for bot scraping, and feed it to their next training cycle?

What prevents anyone else? robots.txt is a request, not an access policy.

utbabya9mo ago

This honor system mostly worked at scale because interests align, which seems to be no longer the case.

Does information no longer wants to be free now? Maybe internet, just like social media was just a social experiment at the end, albeit a successful one. Thanks GenAI.

egypturnash9mo ago

“Information Wants To Be Free. Information also wants to be expensive. ...That tension will not go away.” - the full aphorism

https://en.wikipedia.org/wiki/Information_wants_to_be_free

windexh8er9mo ago

Can the Terms of Service of individual content creators leverage a "death of a thousand cuts" model to produce a legal honeypot which would require organizations like Perplexity to be bound up in 10s of thousands of conciliation court cases?

Big Tech has hidden behind ToS for years. Now, it seems as though it only works for them, but not against. It seems as though this would be easy to orchestrate and prove forcing these companies into a legal nightmare or risk insolvent business stature due to the high load of cases filed against.

Why couldn't something like this be used to flip the table? A conciliation brigading, of sorts.

1 more reply

accrual9mo ago

Thanks for sharing your experience. A little off-topic but I'd like to start hosting some personal content, guides/tutorials, etc.

Do you still see authentic human traffic on your domains, is it easy to discern?

I feel like I missed the bus on running a blog pre-AI.

bayindirhOP9mo ago

I intentionally doesn't keep detailed analytics on my homepage server and my digital garden, because I respect my users and don't want to push unnecessary Javascript on them. The blog platform I use (Mataroa) keeps rudimentary analytics (essentially page hit counters, nothing more) on index, RSS and per post.

Both my blog homepage and posts see mostly human traffic. Sometimes bots crawl the site and they appear as spikes in the analytics.

Looks like my homepage which doesn't have anything but links is pretty popular with crawlers. My digital garden doesn't get much interest from them. All in all, human traffic on my sites are pretty much alive.

I don't believe in missing the bus in anything actually, because I don't write these for others, first. Both my blog (more meta) and digital garden (more technical) are written for myself primarily, and left open. I post links to both when it's appropriate, but they are not made to be popular. If people read it and learn something or solve one of their problems, that's enough for me.

This is why my software is GPLv3, Digital Garden is GFDL and blog is CC BY-NC-SA 2.0. This is why everything is running with absolutely minimum analytics and without any ads whatsoever.

Lastly, this is why I don't want AI crawlers in my site and my data in the models. This thing is made by a human for humans, absolutely for free. It's not OK somebody to sell something designed to be free and make money over it.

accrual9mo ago

> I intentionally doesn't keep detailed analytics on my homepage server and my digital garden, because I respect my users and don't want to push unnecessary Javascript on them.

Absolutely, I'm in agreement here. I want to run a JS-free blog, just plain old static HTML. I plan to use GoAccess to parse the access logs but that's it. I think I would find it encouraging to see real human traffic.

> I don't write these for others, first. Both my blog (more meta) and digital garden (more technical) are written for myself primarily, and left open.

That is a great way to view it, thank you.

1 more reply

kldg9mo ago

if you do analytics, it is not so hard, but then you need to store user data (if not directly, then worse, with a third party), which should be viewed as a liability. I see ~2/3 human traffic, ~1/3 bot traffic (I just parse user agent strings and count whitelisted browsers as human), but my main landing page is all dynamic-populated webgl. I just asked Gemini what it sees on website, and it states "The page appears to be loading, with the text "Loading room data...".[1] There are also labels for "BG", "FG", and "CURSOR", and a background weather animation." -so I can be feel reasonably confident I don't need to worry about AI, for now; it needs a machine-friendly frontend.

you could go proper insanomode, too. remaking The Internet is trivial if you don't care about existing web standards -- replacing HTTP with your own TCP implementation, getting off html/js/css, etc. being greenfield, you can control the protocol, server, and client implementation, and put it in whatever language you want. I made a stateful Internet implementation in Python earlier for proof-of-concept, but I want to port it and expand on it in rust soon (just for fun; I don't do serious biznos). you'll very likely have 100% human traffic then, even if you're the only person curious and trusting enough to run your client.

71bw9mo ago

  > I made a stateful Internet implementation in Python earlier for proof-of-concept

Is there a repo or some other form of public access? I'd like to see this.

1 more reply

1024core9mo ago

It's your server. You're free to do whatever you want. You can serve different versions of the page depending on the UserAgent (has been done many times before).

You can put up a paywall depending on UserAgent or OS (has been done).

In short, it's a 2-way street: the client on the other end of the TCP pipe makes a request, and your server fulfills the request as it sees fit.

tempfile9mo ago

The way to prevent people from downloading your pages and using them is to take them off the public internet. There are laws to prevent people from violating your copyright or from preventing access to your service (by excessive traffic). But there is (thankfully) no magical right that stops people from reading your content and describing it.

bayindirhOP9mo ago

Many site operators want people to access their content, but prevent AI companies from scraping their sites for training data. People who think like that made tools like Anubis, and it works.

I also want to keep this distinction on the sites I own. I also use licenses to signal that this site is not good to use for AI training, because it's CC BY-NC-SA-2.0.

So, I license my content appropriately (No derivative, Non-commercial, shareable with the same license with attribution), add technical countermeasures on top, because companies doesn't respect these licenses (because monies), and circumvent these mechanisms (because monies), and I'm the one to suck this up and shut-up (because their monies)?

Makes no sense whatsoever.

zzo38computer9mo ago

I don't want AI companies to scrape my sites (or use the files I wrote) for training data either, but that is not specifically what I am trying to stop (unless the files are supposed to be private and unpublished). I should not stop them from using the files for what they want, once they have them. (I also specifically do not want to block use of lynx, curl, Dillo, etc.)

What I want to stop is excessive crawling and scraping of my server. Once they have the file they can do what they want with it. Another comment (44786237) mentions that robots.txt is only for restricting recursive access; I agree and that is what should be blocked. They also should not access the same file several times quickly even though it should be unnecessary to do so, just as much as they should not access all of the files. (If someone wants to make a mirror of the files, there may be other ways, e.g. in case there is a archive file available to download many at once (possibly, in case if the site operator made their own index and then did it this way). If it is a git repository, then it can be cloned.)

tempfile9mo ago

Of course some people want that. And at the moment they can prevent it. But those methods may stop working. Will it then be alright to do it? Of course not, so why bother mentioning that they are able to prevent it now - just give a justification.

Your license is probably not relevant. I can go to the cinema and watch a movie, then come on this website and describe the whole plot. That isn't copyright infringement. Even if I told it to the whole world, it wouldn't be copyright infringement. Probably the movie seller would prefer it if I didn't tell anyone. Why should I care?

I actually agree that AI companies are generally bad and should be stopped - because they use an exorbitant amount of bandwidth and harm the services for other users. At least they should be heavily taxed. I don't even begrudge people for using Anubis, at least in some cases. But it is wrong-headed (and actually wrong in fact) to try to say someone may or may not use my content for some purpose because it hurts my feelings or it messes with my ad revenue. We have laws against copyright infringement, and to prevent service disruption. We should not have laws that say, yes you can read my site but no you can't use it to train an LLM, or to build a search index. That would be unethical. Call for a windfall tax if they piss you off so much.

1 more reply

account429mo ago

> Many site operators want people to access their content, but prevent AI companies from scraping their sites for training data.

That is unfortunately not a distinction that is currently legally enforceable. Until that changes all other "solutions" are pointless and only cause more harm.

> People who think like that made tools like Anubis, and it works.

It works to get real humans like myself to stop visiting your site while scrapers will have people whose entire job is to work around such "protections". Just like traditional DRM inconveniences honest customers and not pirates. And to be clear, what you are advocating for is DRM.

> I also want to keep this distinction on the sites I own. I also use licenses to signal that this site is not good to use for AI training, because it's CC BY-NC-SA-2.0.

If AI crawlers cared about that we wouldn't be talking about this issue. A license and only give more permissions than there are without one.

1 more reply

hombre_fatal9mo ago

I guess that's a question that might be answered by the NYT vs OpenAI lawsuit at least on the enforceability of copyright claims if you're a corporation like NYT.

If you don't have the funds to sue an AI corp, I'd probably think of a plan B. Maybe poison the data for unauthenticated users. Or embrace the inevitability. Or see the bright side of getting embedded in models as if you're leaving your mark.

miki1232119mo ago

the fact that it would be discovered almost immediately.

If you give them a URL that does not appear in Google, ask them to visit that URL specifically, and then notice the content from that URL in the training data, it's proof that they're doing this, which would be quite damaging to them.

Freak_NL9mo ago

> […] it's proof that they're doing this, which would be quite damaging to them.

Is it? It's damning, but is it damaging at all?

I'm not getting the impression that anyone's data being available for training if some bot can get to it is just how things are now, rather than an unsettled point of contention. There's too much money invested in this thing for any other outcome, and with the present decline of the rule of law…

autoexec9mo ago

Nothing, and that's why I expect they all do it.

tintor9mo ago

technical limitations / data poisoning measures

j / k navigate · click thread line to collapse

0 comments

Aeolun9mo ago

> What prevents these companies from keeping a copy of that particular page, which I specifically disallowed for bot scraping, and feed it to their next training cycle?

What prevents anyone else? robots.txt is a request, not an access policy.

utbabya9mo ago

This honor system mostly worked at scale because interests align, which seems to be no longer the case.

Does information no longer wants to be free now? Maybe internet, just like social media was just a social experiment at the end, albeit a successful one. Thanks GenAI.

egypturnash9mo ago

“Information Wants To Be Free. Information also wants to be expensive. ...That tension will not go away.” - the full aphorism

https://en.wikipedia.org/wiki/Information_wants_to_be_free

windexh8er9mo ago

Why couldn't something like this be used to flip the table? A conciliation brigading, of sorts.

1 more reply

accrual9mo ago

Thanks for sharing your experience. A little off-topic but I'd like to start hosting some personal content, guides/tutorials, etc.

Do you still see authentic human traffic on your domains, is it easy to discern?

I feel like I missed the bus on running a blog pre-AI.

bayindirhOP9mo ago

Both my blog homepage and posts see mostly human traffic. Sometimes bots crawl the site and they appear as spikes in the analytics.

This is why my software is GPLv3, Digital Garden is GFDL and blog is CC BY-NC-SA 2.0. This is why everything is running with absolutely minimum analytics and without any ads whatsoever.

accrual9mo ago

> I intentionally doesn't keep detailed analytics on my homepage server and my digital garden, because I respect my users and don't want to push unnecessary Javascript on them.

> I don't write these for others, first. Both my blog (more meta) and digital garden (more technical) are written for myself primarily, and left open.

That is a great way to view it, thank you.

1 more reply

kldg9mo ago

71bw9mo ago

  > I made a stateful Internet implementation in Python earlier for proof-of-concept

Is there a repo or some other form of public access? I'd like to see this.

1 more reply

1024core9mo ago

It's your server. You're free to do whatever you want. You can serve different versions of the page depending on the UserAgent (has been done many times before).

You can put up a paywall depending on UserAgent or OS (has been done).

In short, it's a 2-way street: the client on the other end of the TCP pipe makes a request, and your server fulfills the request as it sees fit.

tempfile9mo ago

bayindirhOP9mo ago

Many site operators want people to access their content, but prevent AI companies from scraping their sites for training data. People who think like that made tools like Anubis, and it works.

I also want to keep this distinction on the sites I own. I also use licenses to signal that this site is not good to use for AI training, because it's CC BY-NC-SA-2.0.

Makes no sense whatsoever.

zzo38computer9mo ago

tempfile9mo ago

1 more reply

account429mo ago

> Many site operators want people to access their content, but prevent AI companies from scraping their sites for training data.

That is unfortunately not a distinction that is currently legally enforceable. Until that changes all other "solutions" are pointless and only cause more harm.

> People who think like that made tools like Anubis, and it works.

> I also want to keep this distinction on the sites I own. I also use licenses to signal that this site is not good to use for AI training, because it's CC BY-NC-SA-2.0.

If AI crawlers cared about that we wouldn't be talking about this issue. A license and only give more permissions than there are without one.

1 more reply

hombre_fatal9mo ago

I guess that's a question that might be answered by the NYT vs OpenAI lawsuit at least on the enforceability of copyright claims if you're a corporation like NYT.

miki1232119mo ago

the fact that it would be discovered almost immediately.

Freak_NL9mo ago

> […] it's proof that they're doing this, which would be quite damaging to them.

Is it? It's damning, but is it damaging at all?

autoexec9mo ago

Nothing, and that's why I expect they all do it.

tintor9mo ago

technical limitations / data poisoning measures

j / k navigate · click thread line to collapse