undefined | Better HN

0 pointsimiric10mo ago0 comments

> If you define "hallucinations" to mean "any mistakes at all" then yes, a compiler won't catch them for you.

That's not quite my definition. If we're judging these tools by the same criteria we use to judge human programmers, then mistakes and bugs should be acceptable. I'm fine with this to a certain extent, even though these tools are being marketed as having superhuman abilities. But the problem is that LLMs create an entirely unique class of issues that most humans don't. Using nonexistent APIs is just one symptom of it. Like I mentioned in the comment below, they might hallucinate requirements that were never specified, or fixes for bugs that don't exist, all the while producing code that compiles and runs without errors.

But let's assume that we narrow down the definition of hallucination to usage of nonexistent APIs. Your proposed solution is to feed the error back to the LLM. Great, but can you guarantee that the proposed fix will also not contain hallucinations? As I also mentioned, in most occasions when I've done this the LLM simply produces more hallucinated code, and I get stuck in a neverending loop where the only solution is for me to dig into the code and fix the issue myself. So the LLM simply wastes my time in these cases.

> The new Phoenix.new coding agent actively tests the web applications it is writing using a headless browser

That's great, but can you trust that it will cover all real world usage scenarios, test edge cases and failure scenarios, and do so accurately? Tests are code as well, and it can have the same issues as application code.

I'm sure that we can continue to make these tools more useful by working around these issues and using better adjacent tooling as mitigation. But the fundamental problem of hallucinations still needs to be solved. Mainly because it affects tasks other than code generation, where it's much more difficult to deal with.

0 comments

simonw10mo ago

> Your proposed solution is to feed the error back to the LLM. Great, but can you guarantee that the proposed fix will also not contain hallucinations?

You do it in a loop. Keep looping and fixing until the code runs.

> but can you trust that it will cover all real world usage scenarios, test edge cases and failure scenarios, and do so accurately?

Absolutely not. Most of my blog entry about why code hallucinations aren't as dangerous as other mistakes talks about that as being the real problem humans need to solve when using LLMs to write code: https://simonwillison.net/2025/Mar/2/hallucinations-in-code/...

From the start of that article:

> The real risk from using LLMs for code is that they’ll make mistakes that aren’t instantly caught by the language compiler or interpreter. And these happen all the time!

imiricOP10mo ago

> You do it in a loop. Keep looping and fixing until the code runs.

I suppose, but we shouldn't need to brute force our tools into working...

And as you point out in that article, some of these issues won't be caught by the compiler or interpreter. Where we disagree is that I think most of these are introduced by the inherent problem of hallucination, not because the model is not large enough, or wasn't trained on the right data. I.e. I don't think this is something we can engineer our way out of, but that solving it will require changes at the architectural level.

Yes, ultimately we still need existing software engineering practices to confirm that the output is correct, but in the age of "vibe coding", when people are deploying software that they've barely inspected or tested (many of whom don't even have the skills or experience to do so!), built by tools that can produce thousands of lines of code in an instant, all of those practices go out the window. This should scare all of us, since it will inevitably make the average quality of software go down.

I reckon that the amount of experienced programmers who will actually go through that effort is miniscule. Realistically, reviewing and testing code requires a great deal of effort, and is often not the fun part of the job. If these tools can't be relied on to help me with tasks I sometimes don't want to do, and if I have to babysit them at every step of the way, then how much more productive are they making me? There's a large disconnect between how they are being promoted and how they're actually used in the real world.

Anyway, it's clear that we have different views on this topic, and we use LLMs very differently, but thanks for the discussion. I appreciate the work you're doing, and your content is always informative. Cheers!

j / k navigate · click thread line to collapse