Ask HN: Is there a robots.txt equivalent for LLMs? like LICENSEME.txt?

4 pointsdevrob2y ago2 comments

If you run a blog and don't want to allow LLM crawlers to train on your content, do you have options?

2 comments

2 comments · 2 top-level

I guess if you selectively allow crawlers that promise to not use the data in such a way, robots.txt is still the way to go.

Otherwise you need to selectively allow certain bots. However, as well as with web crawlers, respecting a robots.txt is optional.

Insidious with AI-models is that it is difficult or practicably impossible to prove that it trained on your data.

Difficult to establish a standard like robots.txt. There also was .well-known/security.txt that Google proposed. Some sites serve it, but it hasn't really become a standard.

legrande2y ago

Ironically my blog is written with the help of an LLM, so AI scraper bots are trained on their own output.

But if you are concerned there's a good resource here for blocking them: https://darkvisitors.com/

j / k navigate · click thread line to collapse