Ask HN: Does ChatGPT respect Robots.txt?
Was looking for more info on whether ChatGPT (and similar LLMs) respect the robots.txt directives?
I couldn't find any details over the crawlers and the rules they follow when ChatGPT was undergoing its training.
When I asked ChatGPT - is says "As an AI language model, I do not have the ability to crawl the web on my own. However, as a general principle, web crawlers should follow the rules specified in the website's" and when asked further on who created the data set and method they used - The response is "The dataset used to train me was created by OpenAI, the organization that developed and maintains my system. OpenAI's team of researchers and engineers collected the training data from a wide variety of sources, including books, articles, websites, and other publicly available text data."
So - No clear answer as far as I can tell. It is obviously a near-impossible task to build the dataset manually (given the strength of the team) - Some crawlers would have been used (assumption) - If anyone knows or can shed light to it - It would be great.