AI-Shunning robots.txt (opens in new tab)

(github.com)

44 pointsglynnormington2y ago90 comments

90 comments

34 comments · 11 top-level

nerdjon2y ago· 5 in thread

I am curious, do we have any evidence that AI is adhering to robots.txt and isn’t ignoring it since they are not technically crawling in the traditional sense?

Even if they are right now it would be a quick switch for them to just ignore it.

omoikane2y ago

I have examples in my logs of GPTBot fetching only /robots.txt, and nothing from the same /24 block fetched anything else after that, so it seems at least that bot respects robots.txt.

Maybe your question is "how do we know if whatever system GPTBot feeds downstream didn't just get your content via something else that crawl your site?" I am not sure we have anything to defend against those, other than signalling via robots.txt to say that our content is not intended for AI use.

mrkramer2y ago

Internet Archive's crawler is not respecting robots.txt because they want to archive everything not just parts of the Web. But if you are actively breaking robots.txt then your crawler will have a bad reputation and you will have an army of webmasters trying to block your crawler by any means. You can see crawling requests in your sever logs, that's how you know if they are respecting it or not.

Imo, they best solution would be to license your content so crawlers pay a fee for crawling and using your content.

nerdjon2y ago

Well TIL that IA does not respect robots.txt.

Does IA themselves block crawlers? It doesn't look like it according to their robots.txt, even going so far as to say "Please crawl our files."

What would stop an actor from maliciously complying with a robots.txt file by just going to the internet archive instead.

1 more reply

andybak2y ago

This is about crawling for training data by the look of things. Not sure if the CHatGPT browsing mode uses a different user-agent but most of the entries in that list look like crawlers.

nerdjon2y ago

I had assumed this is related to sites like chatgpt going out and searching with a specific request.

Regardless, my original question is still valid. The companies have already shown a lack of care about the data they train off of. So if ethics have already gone out the window, what is to stop them from ignoring this file if they are not already.

cabirum2y ago· 5 in thread

The crawlers can simply stop identifying themselves via custom user agent, can't they?

Also why are "AI" crawlers are worse than "normal" crawlers?

Either way, this is an exercise in futility.

vouaobrasil2y ago

> Either way, this is an exercise in futility.

Is it really? Every drop of opposition towards AI in my book is a good thing. This robots.txt thing is a small drop maybe, but over time public hatred for AI can build and it might in fact be taken down. Especially outside the tech bubble, many people are ambivalent towards AI.

Yes, in modern society were are taught to value innovation and ignore its downsides, but the more vocal opponents are against it, the more those downsides will become apparent. Hopefully, it will bring the ruin of all AI companies and research.

cabirum2y ago

I'm kind of out of the loop in regard why do we need to hate on AI? The bubble will burst given time, like all the other bubbles before.

What needed is indifference, not hate.

3 more replies

Wissenschafter2y ago

Only on hackernews you get the most ironic of takes. Supposedly someone who is educated and technologically literate to a high degree, thinks opposition to AI is a good thing.

Crazy world.

5 more replies

karaterobot2y ago

> Also why are "AI" crawlers are worse than "normal" crawlers?

A search engine will index your content to bring people to it through search. An AI crawler will take your content to recapitulate it and sell it to others. Obviously it's more complicated than this, but this is how one might see it who wishes to use this file.

> Either way, this is an exercise in futility.

Not necessarily disqualifying. Laws against theft are also futile, in the sense that honest people don't need them and dishonest people don't follow them, and history since at least Hammurabi has been replete with examples of such laws not stopping theft. And yet. Seems worth the calories it costs to say "for the record, I do not give my consent for what you're doing".

cabirum2y ago

Search engines are not the beacons of holiness - they sell ads, they sell data on who searched what, they manipulate results.

Search engines and AI things are typically owned by the same company. AIs are fed with the data collected by a search engine. The only difference is whether AI gets the data in realtime or waits for the search engine to collect another data dump.

Fighting windmills as I see it.

1 more reply

rocky_raccoon2y ago· 4 in thread

Not that I'm arguing for or against preventing access from AI crawlers, but wouldn't it make more sense to block them at a higher level, e.g. the webserver, and not even give them the choice to obey/disobey robots.txt?

rideontime2y ago

How would you propose doing so?

rocky_raccoon2y ago

Off the top of my head:

- Cloudflare

- Webserver-level user-agent blocking (Apache, nginx)

- Application-level user-agent blocking (`if request.user_agent == 'OpenAI'`)

None of them are ideal since you can simply change your user agent, but all of them seem like better options than robots.txt to me.

adrianN2y ago

We could repurpose the evil bit.

1 more reply

gtirloni2y ago

Web servers can check the user-agent and block the request.

E.g. nginx $http_user_agent

vouaobrasil2y ago· 2 in thread

Nice. Let's all contribute to this...ideally, web-hosts should provide this sort of thing by default so we can starve AI companies from training data and combine it with other strategies to put them out of business for good.

andybak2y ago

How about AI from non-companies? Or genuinely non-profit or open projects?

Also - out of curiosity - do you use any AI yourself?

vouaobrasil2y ago

> How about AI from non-companies? Or genuinely non-profit or open projects?

AI from any project will allow AI to be used commercially, and thus I oppose it. Moreover, I oppose AI on various other princincples even independent of this: it further isolates people and can be used to develop other technologies that are too powerful for us to handle. In short, I believe human beings en mass are too stupid to use AI.

> Also - out of curiosity - do you use any AI yourself?

I do not, or at least I try my best not too. In fact, I hate AI with a passion. Obviously, there may be products here and there that have used AI that I in turn use. What can you do? But I attempt to minimize any contact I have with AI: I don't use Grammarly, any form of auto-suggest, I use an ancient phone (and I RARELY use it, I hate smartphones), I don't use AI features in software such as AI-noise reduction, I turn off all automatic features in software that may have some AI behind it.

If I find out a website uses AI for content generation, I ban it and never visit again.

The other day I downloaded a text editor that looked cool but I deleted it because I realized it has an AI-console (even though I never used it).

I also work for a business and I convinced them not to use AI. We're an online magazine and it turns out the vast majority of our readers supported that decision.

In short, I am against AI because I believe it provides virtually no benefits to humanity, only detriments.

3 more replies

jddj2y ago· 2 in thread

The named source, https://darkvisitors.com, is interesting.

gavinhking2y ago

I made this, let me know if you have questions or feedback.

tbeseda2y ago

Thanks for the work on this!

I automated my site's robots.txt[0] by scraping your site. It would be extra nice if darkvisitor.com exposed a plain text version or JSON representation of the list.

[0] https://tbeseda.com/blog/automating-my-robots-txt-to-block-a...

1 more reply

natch2y ago· 2 in thread

We need AIs to know more, not less. If many people block AIs from reading their sites, AIs will just be stuffed with biased information from people pushing agendas.

nerdjon2y ago

So the value of them will plummet? That sounds like a win for society.

natch2y ago

Why would the value of AIs plummet if they know more?

Or did you mean sites? Information wants to be free.

If AI is trained only on data provided by those with agendas, you won’t want to live in that world.

1 more reply

belter2y ago· 1 in thread

Or redirect them to poisoned material?

vouaobrasil2y ago

That is a good idea. Maybe redirect them to massive datasets to cause the company mass embarrassment. There are already some image-modifying programs that generate poison images, and the bots could be redirected to such images...

internetter2y ago· 1 in thread

This is missing a couple, one that comes to mind is `FriendlyCrawler`, which is most definitely not friendly, and very likely for AI

glynnormingtonOP2y ago

Feel free to submit a PR. :-)

bakugo2y ago· 1 in thread

This makes complete sense because, as we all know, AI companies are very concerned with respecting the rights of the people they steal data from, and totally won't just ignore this.

frizlab2y ago

At least you show intent and can then potentially prove they are not respecting your wishes. It’s better than doing nothing.

andybak2y ago

As someone who uses and benefits from the results of AI crawlers, I would only want to block crawls under very specific circumstances.

I would back a general move to block crawlers from non-open models (whatever that means and if such a thing was practical) as it might be a strong lever to encourage good behaviour.

CalRobert2y ago

Given how intertwined AI and search engines are it's hard to see how this helps aside from _maybe_ making things easier for Google, Microsoft, etc., unless you also don't want to be indexed by search engines.

j / k navigate · click thread line to collapse

90 comments

34 comments · 11 top-level

nerdjon2y ago· 5 in thread

I am curious, do we have any evidence that AI is adhering to robots.txt and isn’t ignoring it since they are not technically crawling in the traditional sense?

Even if they are right now it would be a quick switch for them to just ignore it.

omoikane2y ago

I have examples in my logs of GPTBot fetching only /robots.txt, and nothing from the same /24 block fetched anything else after that, so it seems at least that bot respects robots.txt.

mrkramer2y ago

Imo, they best solution would be to license your content so crawlers pay a fee for crawling and using your content.

nerdjon2y ago

Well TIL that IA does not respect robots.txt.

Does IA themselves block crawlers? It doesn't look like it according to their robots.txt, even going so far as to say "Please crawl our files."

What would stop an actor from maliciously complying with a robots.txt file by just going to the internet archive instead.

1 more reply

andybak2y ago

This is about crawling for training data by the look of things. Not sure if the CHatGPT browsing mode uses a different user-agent but most of the entries in that list look like crawlers.

nerdjon2y ago

I had assumed this is related to sites like chatgpt going out and searching with a specific request.

cabirum2y ago· 5 in thread

The crawlers can simply stop identifying themselves via custom user agent, can't they?

Also why are "AI" crawlers are worse than "normal" crawlers?

Either way, this is an exercise in futility.

vouaobrasil2y ago

> Either way, this is an exercise in futility.

cabirum2y ago

I'm kind of out of the loop in regard why do we need to hate on AI? The bubble will burst given time, like all the other bubbles before.

What needed is indifference, not hate.

3 more replies

Wissenschafter2y ago

Only on hackernews you get the most ironic of takes. Supposedly someone who is educated and technologically literate to a high degree, thinks opposition to AI is a good thing.

Crazy world.

5 more replies

karaterobot2y ago

> Also why are "AI" crawlers are worse than "normal" crawlers?

> Either way, this is an exercise in futility.

cabirum2y ago

Search engines are not the beacons of holiness - they sell ads, they sell data on who searched what, they manipulate results.

Fighting windmills as I see it.

1 more reply

rocky_raccoon2y ago· 4 in thread

rideontime2y ago

How would you propose doing so?

rocky_raccoon2y ago

Off the top of my head:

- Cloudflare

- Webserver-level user-agent blocking (Apache, nginx)

- Application-level user-agent blocking (`if request.user_agent == 'OpenAI'`)

None of them are ideal since you can simply change your user agent, but all of them seem like better options than robots.txt to me.

adrianN2y ago

We could repurpose the evil bit.

1 more reply

gtirloni2y ago

Web servers can check the user-agent and block the request.

E.g. nginx $http_user_agent

vouaobrasil2y ago· 2 in thread

andybak2y ago

How about AI from non-companies? Or genuinely non-profit or open projects?

Also - out of curiosity - do you use any AI yourself?

vouaobrasil2y ago

> How about AI from non-companies? Or genuinely non-profit or open projects?

> Also - out of curiosity - do you use any AI yourself?

If I find out a website uses AI for content generation, I ban it and never visit again.

The other day I downloaded a text editor that looked cool but I deleted it because I realized it has an AI-console (even though I never used it).

I also work for a business and I convinced them not to use AI. We're an online magazine and it turns out the vast majority of our readers supported that decision.

In short, I am against AI because I believe it provides virtually no benefits to humanity, only detriments.

3 more replies

jddj2y ago· 2 in thread

The named source, https://darkvisitors.com, is interesting.

gavinhking2y ago

I made this, let me know if you have questions or feedback.

tbeseda2y ago

Thanks for the work on this!

I automated my site's robots.txt[0] by scraping your site. It would be extra nice if darkvisitor.com exposed a plain text version or JSON representation of the list.

[0] https://tbeseda.com/blog/automating-my-robots-txt-to-block-a...

1 more reply

natch2y ago· 2 in thread

We need AIs to know more, not less. If many people block AIs from reading their sites, AIs will just be stuffed with biased information from people pushing agendas.

nerdjon2y ago

So the value of them will plummet? That sounds like a win for society.

natch2y ago

Why would the value of AIs plummet if they know more?

Or did you mean sites? Information wants to be free.

If AI is trained only on data provided by those with agendas, you won’t want to live in that world.

1 more reply

belter2y ago· 1 in thread

Or redirect them to poisoned material?

vouaobrasil2y ago

internetter2y ago· 1 in thread

This is missing a couple, one that comes to mind is `FriendlyCrawler`, which is most definitely not friendly, and very likely for AI

glynnormingtonOP2y ago

Feel free to submit a PR. :-)

bakugo2y ago· 1 in thread

This makes complete sense because, as we all know, AI companies are very concerned with respecting the rights of the people they steal data from, and totally won't just ignore this.

frizlab2y ago

At least you show intent and can then potentially prove they are not respecting your wishes. It’s better than doing nothing.

andybak2y ago

As someone who uses and benefits from the results of AI crawlers, I would only want to block crawls under very specific circumstances.

I would back a general move to block crawlers from non-open models (whatever that means and if such a thing was practical) as it might be a strong lever to encourage good behaviour.

CalRobert2y ago

j / k navigate · click thread line to collapse