Even if they are right now it would be a quick switch for them to just ignore it.
Maybe your question is "how do we know if whatever system GPTBot feeds downstream didn't just get your content via something else that crawl your site?" I am not sure we have anything to defend against those, other than signalling via robots.txt to say that our content is not intended for AI use.
Imo, they best solution would be to license your content so crawlers pay a fee for crawling and using your content.
Does IA themselves block crawlers? It doesn't look like it according to their robots.txt, even going so far as to say "Please crawl our files."
What would stop an actor from maliciously complying with a robots.txt file by just going to the internet archive instead.
Regardless, my original question is still valid. The companies have already shown a lack of care about the data they train off of. So if ethics have already gone out the window, what is to stop them from ignoring this file if they are not already.
Also - out of curiosity - do you use any AI yourself?
AI from any project will allow AI to be used commercially, and thus I oppose it. Moreover, I oppose AI on various other princincples even independent of this: it further isolates people and can be used to develop other technologies that are too powerful for us to handle. In short, I believe human beings en mass are too stupid to use AI.
> Also - out of curiosity - do you use any AI yourself?
I do not, or at least I try my best not too. In fact, I hate AI with a passion. Obviously, there may be products here and there that have used AI that I in turn use. What can you do? But I attempt to minimize any contact I have with AI: I don't use Grammarly, any form of auto-suggest, I use an ancient phone (and I RARELY use it, I hate smartphones), I don't use AI features in software such as AI-noise reduction, I turn off all automatic features in software that may have some AI behind it.
If I find out a website uses AI for content generation, I ban it and never visit again.
The other day I downloaded a text editor that looked cool but I deleted it because I realized it has an AI-console (even though I never used it).
I also work for a business and I convinced them not to use AI. We're an online magazine and it turns out the vast majority of our readers supported that decision.
In short, I am against AI because I believe it provides virtually no benefits to humanity, only detriments.
I automated my site's robots.txt[0] by scraping your site. It would be extra nice if darkvisitor.com exposed a plain text version or JSON representation of the list.
[0] https://tbeseda.com/blog/automating-my-robots-txt-to-block-a...
Also why are "AI" crawlers are worse than "normal" crawlers?
Either way, this is an exercise in futility.
Is it really? Every drop of opposition towards AI in my book is a good thing. This robots.txt thing is a small drop maybe, but over time public hatred for AI can build and it might in fact be taken down. Especially outside the tech bubble, many people are ambivalent towards AI.
Yes, in modern society were are taught to value innovation and ignore its downsides, but the more vocal opponents are against it, the more those downsides will become apparent. Hopefully, it will bring the ruin of all AI companies and research.
What needed is indifference, not hate.
Crazy world.
A search engine will index your content to bring people to it through search. An AI crawler will take your content to recapitulate it and sell it to others. Obviously it's more complicated than this, but this is how one might see it who wishes to use this file.
> Either way, this is an exercise in futility.
Not necessarily disqualifying. Laws against theft are also futile, in the sense that honest people don't need them and dishonest people don't follow them, and history since at least Hammurabi has been replete with examples of such laws not stopping theft. And yet. Seems worth the calories it costs to say "for the record, I do not give my consent for what you're doing".
Search engines and AI things are typically owned by the same company. AIs are fed with the data collected by a search engine. The only difference is whether AI gets the data in realtime or waits for the search engine to collect another data dump.
Fighting windmills as I see it.
I would back a general move to block crawlers from non-open models (whatever that means and if such a thing was practical) as it might be a strong lever to encourage good behaviour.
- Cloudflare
- Webserver-level user-agent blocking (Apache, nginx)
- Application-level user-agent blocking (`if request.user_agent == 'OpenAI'`)
None of them are ideal since you can simply change your user agent, but all of them seem like better options than robots.txt to me.
E.g. nginx $http_user_agent