I Have Blocked OpenAI | Better HN

55 comments

38 comments · 17 top-level

SOLAR_FIELDS2y ago· 4 in thread

This is such a technopurist take. People who use LLM’s already know they can give wrong information. Your documentation won’t be able to cover every single possible contextual scenario that an LLM can help with. I think there are valid reasons to not allow OpenAI to spider you, but this is just a really silly one that feels pretty egotistical. People aren’t going to this guy saying “well OpenAI said your software works this way and it doesn’t”. It’s an entirely contrived scenario that doesn’t exist in reality.

EatingWithForks2y ago

> People who use LLM’s already know they can give wrong information

I think this is unfortunately much less true than expected... Lawyer using chatgpt.. teachers using chatgpt... even professors using chatgpt... as if its a source of truth.

SOLAR_FIELDS2y ago

There have been a few instances, sure, and they made headlines, but that was pretty early on when LLM behavior was not well understood. I think that fake citations (as the most obvious and well documented example) are a well understood problem now, and if you google “ChatGPT fake citation” you only get a few articles mostly referencing the same couple of cases from months ago. It doesn’t seem pervasive at all.

gavinhoward2y ago

They're only not doing that because my software is not common yet. But look at GitHub issues for any semi-famous project, and you'll see a lot of questions about misunderstandings, and that's before LLM's poisoned everything.

davely2y ago

> But look at GitHub issues for any semi-famous project, and you'll see a lot of questions about misunderstandings

This usually happens because people don’t read documents to understand why something isn’t working in the first place or the documentation is not clear.

If anything, an LLM makes this sort of stuff more accessible.

Anecdotally, I find using something like ChatGPT to rubber duck engineering problems with various libraries to be much more enjoyable and useful than going to Stack Overflow or mucking through overly verbose (or not verbose enough) docs.

Racing04612y ago· 4 in thread

unpopular opinion: llm responses being wrong is still valuable to me since it gives me a better jumping off point to exploring than nothing at all. especially with something like coding that can easily be back-propagated due to something not compiling/not working as intended. could be harmful in other areas tho.

gremlinsinc2y ago

yeah, if the LLM gives me 2 truths that are beyond the documentation, like an edge case or maybe an example in a better describes way for me to grok, and one false thing, usually the false thing is so bad i can tell it's false or it's truthy but the value from the two truths exceeds the negative value of the falsehood.

Generally speaking though you can also cut back on hallucination by asking for a source from a second LLM or using good retrospections and adding system messages to ensure if it doesn't know an answer to say so and not make one up.

Really, I think hallucination is the wrong word bullshitting or gaslighting might be better. You're asking it something and it thinks you want an answer any answer so if it doesn't know it makes it up. Similar to people who confess to crimes they didn't do because of distressful interrogation tactics.

gavinhoward2y ago

Author here.

My docs will include tutorial links at the top, and those tutorials will focus on accomplishing common tasks.

I believe that's a good jumping off point.

circuit102y ago

You can’t realistically cover every use case. With an LLM you can say something like “Make me a program that sets up an SDL window with the title ABCD that has a 396x224 RGB565 framebuffer and moves around a red square using the WASD keys by using a loop to fill in pixels in that framebuffer and then quits when it reaches the right edge of the screen” and it has a reasonable chance of making something that works or at least is easy to adjust into something that does. Just because sometimes it might not work the first time isn’t a good reason to try to stop people from using it entirely

niederman2y ago

I can promise you your tutorials will offer far fewer jumping-off points than an LLM.

b800h2y ago· 3 in thread

I bet information about his software is around elsewhere, and now ChatGPT will make up even more. I don't know how this is fixed. Structured queryable data, I guess.

gavinhoward2y ago

Author here.

You are correct, but if I demonstrate that I have done what I could to deny OpenAI access, and they still have it in their model, then I probably have more legal recourse against them.

gremlinsinc2y ago

But what legal recourse? chatGPT could be considered a search engine, and technically scraping public facing sites without a login is perfectly allowed and legal. The best you'd be able to do is a dmca request, I'm not sure how they'd comply with that though. I've seen dmca requests in Google, when someone's work is being offered free without their permission. I'm guessing this would be the same sort of situation.

I wonder if they can selectively block or remove specific content from the LLM. Personally I think it's a fools folly to even try.

AI chat is the new interface to search, I use ai powered search engines for 90 percent of searches. sometimes I still go to the source website so there's still a chance of search engines bringing revenue.

Personally I think there should be a way for them to reward sites in a medium like program where views or uses as a resource are points for a share of the revenue for the month or something.

circuit102y ago

I don’t think you have any legal recourse, even if training isn’t considered fair use (which it seems to be) I don’t think you can copyright having knowledge of how to use a library or piece of software (maybe the specific way you write the documentation is but it will infer it from other sources instead)

Cantinflas2y ago· 3 in thread

> But here’s the problem: it will answer them wrong.

There is no way to know that, and even if it ends up being true, blocking openai will likely make the problem worse, e.g. the ai answers will be worse without access to the documentation.

gavinhoward2y ago

But if people come to me with problems, I can give them a link to that post and say, "GPTx does not know my stuff. You will want to read the docs yourself."

gremlinsinc2y ago

Most people who have any logic sense will likely try chatGPTs answer if it's wrong they'll go to the docs and try to see why it's wrong or they'll tell chatGPT it's wrong give a link to the docs to clarify and ask why, if using something like phind.com. Plus just because openai doesn't index it doesn't mean you can't use langchain to scrape the site and ingest the data. I'm pretty sure this is how phind works when I reference a specific page.

For example it gave me really wrong info when I requested info on the latest version of nextjs, I asked it if it could double check on their website at url, and it said sorry here's the correct info and all was good. I've never gotten wrong answers I couldn't have it fix assuming it has Internet access.

circuit102y ago

You could suggest that anyway without intentionally making it even worse at it (though honestly I don’t think this will have any effect anyway)

sigilis2y ago· 2 in thread

You should take down the documentation entirely, if you want to prevent incorrect interpretations of things. The LLMs won’t be the ones emailing you, the people who would get things wrong if the LLM provided some kind of confident wrong answer would probably simply not read your documentation as the vast majority of users do not. You’re just shifting some, but not all, misunderstandings into totally uninformed questions that will mean an additional email pointing them to RTFM.

All of these “we’re not letting bots crawl our site!” posts make me feel like I’ve travelled back in time to when having web spiders crawl your site was a big deal. You can’t really prevent people from using tools wrong, and it is odd that so many people care about this futile attempt to insulate yourself from stupid users that I managed to see it on the front page of HN.

The worst part is, if an LLM has already read in your docs and the interaction you fear your users having with LLMs comes to pass: they will have misapprehensions about the old version of your docs which will be even more wrong.

Allow me to prepare you for the future now before you have to hear it from someone else, you will be getting email spam about LLM Algorithm Optimization soon. LLMAO firms are probably already organizing around the bend of time, we’re just a little before they become visible.

gavinhoward2y ago

It's easier to put one link into an email than to try to explain things to people.

Isn't it still possible to put one link into an email without blocking the crawler?

RecycledEle2y ago· 2 in thread

We are all myopic in our own ways.

The guy who posted about blocking OpenAI so they will not answer questions about his software wrong (meaning not completely) ignores that his documentation is inaccessible to many less technically literate people. LLM AIs help bridge the gap to get newbies using software before they can understand the manuals.

NoZebra120vClip2y ago

When I entered college, my first Pascal course was on an SVR3 Unix system, and I read every manpage that I can find, because it was fantastic that I had access to that. Previously, I had read every shred of documentation of the Commodore 65xx systems, which generously included every technical detail possible. I mean, I had basically started on this in fifth grade. Reading manuals is how I gained my technical literacy.

gavinhoward2y ago

How do you know my docs are inaccessible to less technically literate people?

dutchbrit2y ago· 2 in thread

Just a thought that I have, wouldn’t it be better to block all robots and only to whitelist a select few? More AI bots are scraping now and in the future…

gavinhoward2y ago

Author here.

I wish I could, but I bet most would just ignore robots.txt.

catboybotnet2y ago

Seconding. robots.txt is just a way of "asking nicely." If somebody wants to scrape, they're spoofing their UA and ignoring it. Can't do anything other than monitor the logs and ban IPs one by one.

kristianp2y ago· 1 in thread

IIRC LLMs also use common crawl data for training. Are they also blocking common crawl?

Another thing is that chatgpt 4 can do live retrieval of websites in response to users questions. That is a different crawler doing that I imagine. Are they going to block that too?

This. Unfortunately, there is common crawl, there is bing and a million of other ways they could hide/get the data from. Or, just ignore robots.txt, it's not like it's a very honest or transparent operation they run there.

For the last two weeks my little webserver has been getting 200+ hits a day from bots with the useragent of anthropic-ai. At first it was what you'd expect, mirroring all the pdfs and such. But the last week it's just /robots.txt. 200+ times per day from amazon-ec2 so I have no way of knowing if it's actually anthropic-ai.

I was happy that they'd be including documents on topics I found interesting and things I wrote in the word adjacency training of their foundational model. That'd mean the model would be more useful to me. But the robots.txt stuff is weird. Maybe it's because I've had,

    User-agent: Killer AI
    Disallow: /~superkuh/

in there for the last 10 years? /s

I agree that LLMs are almost more likely than not to answer documentation questions wrong, to hallucinate methods that don’t exist, or just be silly. But the value I see in allowing LLMs to train on documentation is in the glue code that an LLM could (potentially!) generate.

Documentation, even good docs, usually only answer the question “What does this method/class/general idea do?” Really good docs will come with some examples of connecting A and B. But they will often not include examples of connecting A to E when you have to transform via P because of business requirements, and almost never tell you how to incorporate third-party libraries X, Y, and Z.

As an engineer, I can read the docs and figure out the bits, but having an LLM suggest some of the intermediary or glue steps, even if wrong sometimes, is a benefit I don’t get only from good documentation.

> Despite the volume of documentation, my documentation would still be just a tiny blip in the amount of information in the LLM, and it will still pull in information from elsewhere to answer questions.

I sympathise. I've recently discovered that apparently I have enough Internet clout that ChatGPT knows about me. As in I can carefully construct a prompt and it will unmistakably reference me in particular. Don't even need to provide my name in the prompt.

Except, every fucking detail of what it "knows" about me is 100% false, and there's nothing I can do to correct it. I'm from a wrong country, I did things in my career that I absolutely didn't, etc.

Needless to say, I also blocked its crawler.

speedgoose2y ago

I understand that some people don’t want their work to train AI. Personally I like that the work I publish is not completely useless as it is at least used to train LLMs.

pleoxy2y ago

Adding friction to the use of information about your product seems like a disservice to the users/customers.

By not having that information in the system at all will only degrade the answers. Not change who is asking.

zmnd2y ago

Can you give an example of what question someone asked GPT-4 and was misled? And how that question was better answered by one of your tutorials?

I wonder if that would even help. If an LLM knows nothing at all about the software, it might just make up complete bullshit anyway.

orbit72y ago

Does OpenAI also scan wayback machine? If it does and you are on that you may also wish to remove yourself.

gavinhoward2y ago

Dupe: https://news.ycombinator.com/item?id=37182366 .

j / k navigate · click thread line to collapse