Hallucinations are not unexpected in LLMs and cannot be fixed by correcting an error in the code. Instead they are fundamental property of the computing paradigm that was chosen, one that has to be worked around.
It's closer to network lag than it is to bounds checking—it's an undesirable characteristic, but one that we knew about when we chose to make a network application. We'll do our best to mitigate it to acceptable levels, but it's certainly not a bug, it's just a fact of the paradigm.
It all depends on whose specification you’re assessing the “bugginess” against, the inference code as written, the research paper, colloquial understanding in technical circles, or how the product is pitched and presents to users.
And this is why I feel it's so important to fix the way we talk about hallucinations. Engineers need to be extremely clear with product owners, salespeople, and other business folks about the inherent limitations of LLMs—about the fact that certain things, like factual accuracy, may asymptotically approach 100% accuracy but will never reach it. About the fact that even getting asymptotically close to 100% is extremely (most likely prohibitively) expensive. And once they've chosen a non-zero failure rate, they have to be clear about what the consequences of the chosen failure rate are.
Before engineers can communicate that to the business side, they have to have that straight in their own heads. Then they can communicate expectations with the business and ensure that they understand that once you've chosen a failure rate, individual 'hallucinations' can't be treated as bugs to troubleshoot—you need instead to have an industrial-style QC process that measures trends and reacts only when your process produces results outside of a set of well-defined tolerances.
(Yes, I'm aware that many organizations are so thoroughly broken that engineering has no influence over what business tells customers. But those businesses are hopeless anyway, and many businesses do listen to their engineers.)
You are wrong here - my company can fix individual responses by adding specific targeted data for the RAG prompt. So a JIRA ticket for a wrong response can be fixed in 2 days.
Because it gives you two ends from which to approach the business challenge. You can improve the fitness—the functionality itself. But you can also adjust the purpose—what people expect it to do.
I think a lot of the concerns about LLMs come down to unrealistic expectations: oracles, Google killers, etc.
Google has problems finding and surfacing good info. LLMs are way better at that… but they err in the opposite direction. They are great at surfacing fake info too! So they need to be thought of (marketed) in a different way.
Their promise needs to be better aligned with how the technology actually works. Which is why it’s helpful to emphasize that “hallucinations” are a fundamental attribute, not an easily fixed mistake.
People also blithely trust other humans even against all evidence that they're trustworthy. Some things just aren't fixable.
Spec: Do X in situation Y.
Correctness bug: It doesn't do X in situation Y.
Fitness-for-purpose (FFP) bug: It does X in situation Y, but, knowing this, you decide you don't actually want it to do X in situation Y.
Hallucination is an FFP bug.
If ask a math question and you get a random incorrect equation, it's not unfit for purpose, just incorrect.
FFP would be returning misinformation from the model, which is not a hallucination per se. Or the model misunderstanding the question and returning a correct answer to a related question.
[] Except for art generators.
This is not true for many applications of LLM. Generating legal documents, for example: it is not acceptable that it hallucinate laws that do not exist. Recipes: it is not acceptable that it would tell people to make pizza with glue, or mustard gas to remove stains. Or, in my case: it is not acceptable for a developer assisting AI to hallucinate into existence libraries that are not real and not only will not solve my problem, but that will cause me to lose hours of my day trying to figure out where to get said library.
If pneumatic tires failed to hold air as often as LLM's hallucinate, we wouldn't use them. That's not to say a tire can't blow out, sure they can, happens all the time. It's about the rate of failure. Or hell, to bring it back to your metaphor, if your network experienced high latency at the rate most LLM's hallucinate, I might actually suggest you not network computers, or at the very least, I'd say you should be replaced at whatever company you work for since you're clearly unqualified to manage a network.
The important thing isn't that the rate of failure be below a specific threshold before the technology is adopted anywhere, the important thing is that engineers working on this technology have an understanding of the fundamental limitations of the computing paradigm and design accordingly—up to and including telling leadership that LLMs are a fundamentally inappropriate tool for the job.
Also, how do you mitigate a lawyer writing whatever they want (aka: hallucinating) when writing legal documents? Double-checking??
Of course they are supposed to double and triple and multiple check as they think and write, documentation and references at hand, _exactly_ how you are supposed to do from trivial informal context on towards critical ones - exactly the same, you check the detail and the whole, multiple times.
Legislators pass incoherent legislation every day. "hallucination" is the de-facto standard for human behavior (and for law).
Maybe you bump your car because you stopped an inch too far. Perhaps it's because the tires on your car were from a lower performing but still in spec batch. Those tires weren't defective or bugged, but instead the product of a system with statistical outputs (manufacturing variation) rather than software-like deterministic ones (binary yes/no output).
Which goes back to OP's initial point: SWE types aren't used to working in fully statistical output environments.
> If not all bugs can be fixed it seems better to toss the entire concept of a "bug" out the window as no longer useful for describing the behavior of software.
Then those are bugs you cant fix. It is just lying to yourself to call them not a bug ... if they are bugs.