I think this is unfortunately much less true than expected... Lawyer using chatgpt.. teachers using chatgpt... even professors using chatgpt... as if its a source of truth.
This usually happens because people don’t read documents to understand why something isn’t working in the first place or the documentation is not clear.
If anything, an LLM makes this sort of stuff more accessible.
Anecdotally, I find using something like ChatGPT to rubber duck engineering problems with various libraries to be much more enjoyable and useful than going to Stack Overflow or mucking through overly verbose (or not verbose enough) docs.
I was happy that they'd be including documents on topics I found interesting and things I wrote in the word adjacency training of their foundational model. That'd mean the model would be more useful to me. But the robots.txt stuff is weird. Maybe it's because I've had,
User-agent: Killer AI
Disallow: /~superkuh/
in there for the last 10 years? /sAll of these “we’re not letting bots crawl our site!” posts make me feel like I’ve travelled back in time to when having web spiders crawl your site was a big deal. You can’t really prevent people from using tools wrong, and it is odd that so many people care about this futile attempt to insulate yourself from stupid users that I managed to see it on the front page of HN.
The worst part is, if an LLM has already read in your docs and the interaction you fear your users having with LLMs comes to pass: they will have misapprehensions about the old version of your docs which will be even more wrong.
Allow me to prepare you for the future now before you have to hear it from someone else, you will be getting email spam about LLM Algorithm Optimization soon. LLMAO firms are probably already organizing around the bend of time, we’re just a little before they become visible.
Documentation, even good docs, usually only answer the question “What does this method/class/general idea do?” Really good docs will come with some examples of connecting A and B. But they will often not include examples of connecting A to E when you have to transform via P because of business requirements, and almost never tell you how to incorporate third-party libraries X, Y, and Z.
As an engineer, I can read the docs and figure out the bits, but having an LLM suggest some of the intermediary or glue steps, even if wrong sometimes, is a benefit I don’t get only from good documentation.
Generally speaking though you can also cut back on hallucination by asking for a source from a second LLM or using good retrospections and adding system messages to ensure if it doesn't know an answer to say so and not make one up.
Really, I think hallucination is the wrong word bullshitting or gaslighting might be better. You're asking it something and it thinks you want an answer any answer so if it doesn't know it makes it up. Similar to people who confess to crimes they didn't do because of distressful interrogation tactics.
My docs will include tutorial links at the top, and those tutorials will focus on accomplishing common tasks.
I believe that's a good jumping off point.
I sympathise. I've recently discovered that apparently I have enough Internet clout that ChatGPT knows about me. As in I can carefully construct a prompt and it will unmistakably reference me in particular. Don't even need to provide my name in the prompt.
Except, every fucking detail of what it "knows" about me is 100% false, and there's nothing I can do to correct it. I'm from a wrong country, I did things in my career that I absolutely didn't, etc.
Needless to say, I also blocked its crawler.
The guy who posted about blocking OpenAI so they will not answer questions about his software wrong (meaning not completely) ignores that his documentation is inaccessible to many less technically literate people. LLM AIs help bridge the gap to get newbies using software before they can understand the manuals.
Another thing is that chatgpt 4 can do live retrieval of websites in response to users questions. That is a different crawler doing that I imagine. Are they going to block that too?
You are correct, but if I demonstrate that I have done what I could to deny OpenAI access, and they still have it in their model, then I probably have more legal recourse against them.
I wonder if they can selectively block or remove specific content from the LLM. Personally I think it's a fools folly to even try.
AI chat is the new interface to search, I use ai powered search engines for 90 percent of searches. sometimes I still go to the source website so there's still a chance of search engines bringing revenue.
Personally I think there should be a way for them to reward sites in a medium like program where views or uses as a resource are points for a share of the revenue for the month or something.
There is no way to know that, and even if it ends up being true, blocking openai will likely make the problem worse, e.g. the ai answers will be worse without access to the documentation.
For example it gave me really wrong info when I requested info on the latest version of nextjs, I asked it if it could double check on their website at url, and it said sorry here's the correct info and all was good. I've never gotten wrong answers I couldn't have it fix assuming it has Internet access.
By not having that information in the system at all will only degrade the answers. Not change who is asking.
I wish I could, but I bet most would just ignore robots.txt.