The issue here is people trying to use language models as deterministic problem solvers, rather than for what they actually excel at (semi-creative text generation).
https://youtu.be/MiqLoAZFRSE?si=tIQ_ya2tiMCymiAh&t=901
To quote from the slide:
* Probability e that any produced token takes us outside the set of correct answers
* Probability that answer of length n is correct
* P(correct) = (1-e)^n
* This diverges exponentially
* It's not fixable (without a major redesign)https://futurist.com/2023/02/13/metas-yann-lecun-thoughts-la...
(Speaking of "law" is rhetoric, but an idea is pretty clear.)
One way I explain it to people: Imagine a corporation that only has a PR department. Extremely good at generating press releases and answering reporter questions. But without the rest of the company, the output text isn't constrained by anything meaningful.
In an alternate universe, one where people understood this, people would be using LLMs for nothing serious, but a whole lot of fun little art projects.
you only need to solve fusion correctly once
No, it's not "hallucinating". It's not lying, or making things up, or anything like that either. It's spitting out data according to what triggers the underlying weights. If this were a regular JSON API endpoint, you wouldn't say the API is hallucinating, you'd say "This API is shit" because it's broken.
I'd argue the opposite: people think a person's mind is in "deep thought" when it's actually just a ball of statistics.
Yet, humans managed to do that (albeit over many generations)
Ergo, humans are not just balls of statistics
We all confabulate to some degree, as any neural system must, since no training data is stored perfectly.
Human "hallucinations" in contrast, are a particular kind of breakdown in our sensory feedback loops. Which is not a process LLMs even have.
Hallucinations occur when our internal sensory feedback loops overpower actual sensory input, resulting in a stream of false sensory experience/signals being generated and processed. The false running experience might still incorporate some actual sensory information or not.
When we dream, we are hallucinating - our sensory experience loop running free of our actual senses - to a productive purpose.
The reason our senses have feedback is so that we can use our interpretation of sensory input as cues to make interpreting the next moments input easier. But its important that our running interpretation can reset when new input significantly diverges from our expectations so it can quickly reorient.
(Not only is it important to revert to a raw input interpretation to ensure our running interpretation keeps up the actual context changes and corrects misinterpretations, but such resets signal that something novel or unexpected has happened, so likely trigger learning.)
So "hallucinations" was an unfortunate and misleading choice of terminology.
A couple papers that use it in this way prior to LLMs:
- 2021: The Curious Case of Hallucinations in Neural Machine Translation (https://arxiv.org/abs/2104.06683)
- 2019: Identifying Fluently Inadequate Output in Neural and Statistical Machine Translation (https://aclanthology.org/W19-6623/)
Sees space shuttle "pff, it's just a pile of engineering."
I don't see any mention of weight release unfortunately.
They might have done that for O1, but the bigger change is the "runtime train of thought" that once the model received the prompt and before giving a definitive answer, it "thinks" with words and readjusts at runtime.
At least that's my understanding from these two approaches, and if that's true, then it's not similar.
AFAIK, OpenAI been doing reinforcement learning since the first version of ChatGPT for all future models, that's why you can leave feedback in the UI in the first place.
> Through reinforcement learning, o1 learns to hone its chain of thought and refine the strategies it uses. It learns to recognize and correct its mistakes. It learns to break down tricky steps into simpler ones. It learns to try a different approach when the current one isn’t working.
That's incredibly similar to this paper, which is discusses the difficulty in finding a training method that guides the model to learn a self-correcting technique (in which subsequent attempts learn from and improve on previous attempts), instead of just "collapsing" into a mode of trying to get the answer right with the very first try.
Since OpenAI did not specify what exactly is in their reasoning trace, it's not clear what if any difference there is between the approaches. They could be vastly different, or they could be slight variations of each other. Without details from OpenAI, it's not currently possible to tell.
sorry as a practitioner i’m having trouble understanding what point/distinction you are trying to make
I don't think LLMs can self-correct without remembering their own training in some way.
(If someone tries this and it works, I’m quitting my phd and going back to camp counseling)
So if one were to improve an LLM along those lines, I believe it would be something like: 1) LLM is asked a question. 2) LLM comes up with an initial response. 3) LLM retrieves the related "learning" history behind that answer and related portions of the corpus. 4) LLM compares the initial answer with the richer set of information, looking for conflicts between the initial answer and the broader set, or "learning" choices that may be false. 6) LLM generates a better answer and gives it. 7) LLM incorporates this new "learning".
And that strikes me as a pretty reasonable long-term approach, if not one that fits within the constraints of the current gold rush.
From the abstract:
> ... To give LLMs such ability, we explore source-aware training -- a recipe that involves (i) training the LLM to associate unique source document identifiers with the knowledge in each document, followed by (ii) an instruction-tuning stage to teach the LLM to cite a supporting pretraining source when prompted.
See also: https://www.sciencedirect.com/science/article/pii/S157106452... o1's training regime is described by the "strange particle" model in this formulation
But even if it's embedded in a framework, say CS, the qualia fade in the background as time passes. E.g. like everybody in CS, I'm pretty much able to quote O() performance characteristics of a sizeable number of algorithms off the bat. If you ask me where I learned it, for that specific algorithm - that's long receded into the past.
When humans self-correct, the normal process isn't "gauging whether you know the thing" or the even more impressive feat of calling up if you heard it from a "less than reliable source". There's a fuzzy sense of "I don't fully understand it", and self-correction means re-verifying the info from a trusted source.
So, no, I don't think the qualia matter for recall as much as you think.
About the terms themselves, "confabulate" means "exchanging stories", while "hallucinate" is less clear but probably means "to err". In psychiatry, "hallucinate" was apparently introduced by Esquirol and "confabulate" by Wernicke and Bonhoeffer; neither concept seems to be akin to the substance of the phenomenon of "stochastic parrots bullshitting an unchecked narrative through formal plausibility".
See: "Hallucinations and related concepts - their conceptual background" - https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4515540/
and: "The Confabulating Mind: How the Brain Creates Reality" - https://psychiatryonline.org/doi/full/10.1176/appi.ajp.2008....
We want to improve LLM's abilities to give correct answers to hard problems. One theory is that we can do that by training a "Self Correcting" behavior into the models where they can take as input a wrong answer and improve it to a better/correct answer.
This has been explored previously, trying to train this behavior using various Reinforcement techniques where the reward is based on how good the "corrected" answer is. So far it hasn't worked well, and the trained behavior doesn't generalize well.
The thesis of the paper is that this is because when the model is presented with a training example of `Answer 1, Reasoning, Corrected Answer`, and a signal of "Make Corrected Answer Better" it actually has _two_ perfectly viable ways to do that. One is to improve `Reasoning, Corrected Answer`, which would yield a higher reward and is what we want. The other, just as valid solution, is to simply improve `Answer 1` and have `Corrected Answer` = `Answer 1`.
The latter is what existing research has shown happens, and why so far attempts to train the desired behavior has failed. The models just try to improve their answers, not their correcting behaviors. This paper's solution is to change the training regimen slightly to encourage the model to use the former approach. And thus, hopefully, get the model to actually train the desired behavior of correcting previous answers.
This is done by doing two stages of training. In the first stage, the model is forced (by KL divergence loss) to keep its first answers the same, while being rewarded for improving the second answer. This helps keep the model's distribution of initial answers the same, avoiding the issue later where the model doesn't see as many "wrong" answers because wrong answers were trained out of the model. But it helps initialize the "self correcting" behavior into the model.
In the second stage the model is free to change the first answer, but they tweak the reward function to give higher rewards for "flips" (where answer 1 was bad, but answer 2 was good). So in this second stage it can use both strategies, improving its first answer or improving its self correcting, but it gets more rewards for the latter behavior. This seems to be a kind of refinement on the model, to improve things overall, while still keeping the self correcting behavior intact.
Anyway, blah blah blah, metrics showing the technique working better and generalizing better.
Seems reasonable to me. I'd be a bit worried about, in Stage 2, the model learning to write _worse_ answers for Answer 1 so it can maximize the reward for flipping answers. So you'd need some kind of balancing to ensure Answer 1 doesn't get worse. Not sure if that's in their reward function or not, or if its even a valid concern in practice.
Isn't improving "Answer 1" the whole point?
Your write-up makes it sound like "Answer 1" an input but an output from the LLM?
Sure it’s sorting through garbage more elegantly but it’s still garbage at the end of the day.
I was hoping the RL-like approach replaced the transformers-like approach or something but that’s a pipe dream.