Training Language Models to Self-Correct via Reinforcement Learning (opens in new tab)

(arxiv.org)

230 pointsweirdcat1y ago92 comments

92 comments

47 comments · 7 top-level

optimalsolver1y ago· 12 in thread

Spoiler: You're never going to get rid of hallucinations in the autoregressive, next token prediction paradigm (aka LeCun's Law).

The issue here is people trying to use language models as deterministic problem solvers, rather than for what they actually excel at (semi-creative text generation).

plewd1y ago

Is LeCun's Law even a thing? Searching up for it doesn't yield many results, except for a HN comment where it has a different definition. I guess it could be from some obscure paper, but with how poorly it's documented it seems weird to bring it up in this context.

YeGoblynQueenne1y ago

I think the OP may be referring to this slide that Yann LeCun has presented on several occasions:

https://youtu.be/MiqLoAZFRSE?si=tIQ_ya2tiMCymiAh&t=901

To quote from the slide:

  * Probability e that any produced token takes us outside the set of correct answers
  * Probability that answer of length n is correct
  * P(correct) = (1-e)^n
  * This diverges exponentially
  * It's not fixable (without a major redesign)

7 more replies

vjerancrnjak1y ago

“Label bias” or “observation bias” a phenomenon where going outside of the learned path lives little room for error correction. Lecun talks about the lack of joint learning in LLMs.

whimsicalism1y ago

It’s a thing in that he said it but it’s not an actual law and it has several obvious logical flaws. It applies just as equally to human utterances.

mdp20211y ago

A reference could be this:

https://futurist.com/2023/02/13/metas-yann-lecun-thoughts-la...

(Speaking of "law" is rhetoric, but an idea is pretty clear.)

shawnz1y ago

Does anyone here know, has anyone tried something like feeding the perplexity of previous tokens back into the model, so that it has a way of knowing when it's going off the rails? Maybe it could be trained to start responding less confidently in those cases, reducing its desire to hallucinate.

famouswaffles1y ago

Models already know when they are going off the rails. https://news.ycombinator.com/item?id=41504226. That's not the problem. The problem is that they don't care to tell you.

wpietri1y ago

Very nice to see this point being made.

One way I explain it to people: Imagine a corporation that only has a PR department. Extremely good at generating press releases and answering reporter questions. But without the rest of the company, the output text isn't constrained by anything meaningful.

In an alternate universe, one where people understood this, people would be using LLMs for nothing serious, but a whole lot of fun little art projects.

whimsicalism1y ago

LeCuns argument is seriously flawed. It is not at all a rigorous one and you should not make such sweeping statements based on nothing.

barbarr1y ago

At this point I just invert everything LeCun says about AI. Chances are he'll flip flop on his own statement a few months later anyways.

1 more reply

seydor1y ago

"never" is not itself a problem, people do the same

you only need to solve fusion correctly once

famouswaffles1y ago

If you're talking about label bias then you don't need to solve label bias to 'solve' hallucinations when the model has already learnt internally when it's bullshitting or going off the rails.

sensanaty1y ago· 11 in thread

I hate that the AI pundits have succeeded in popularizing the notion of "hallucination", anthropomorphizing these balls of statistics into something that seems like it's actually in some sort of deep thought process akin to a person's mind.

No, it's not "hallucinating". It's not lying, or making things up, or anything like that either. It's spitting out data according to what triggers the underlying weights. If this were a regular JSON API endpoint, you wouldn't say the API is hallucinating, you'd say "This API is shit" because it's broken.

Philpax1y ago

Do we really need to have this discussion in every thread about LLMs?

sensanaty1y ago

As long as AI-bros are pushing for making AI models seem like more than they are to pad their wallets, there'll be someone like me pointing out that, no, it's not "hallucinating", it's spitting bad data.

2 more replies

qudat1y ago

> I hate that the AI pundits have succeeded in popularizing the notion of "hallucination", anthropomorphizing these balls of statistics into something that seems like it's actually in some sort of deep thought process akin to a person's mind.

I'd argue the opposite: people think a person's mind is in "deep thought" when it's actually just a ball of statistics.

whiplash4511y ago

Do you think that an LLM would spit out Latin and English if you trained it with homo sapiens mumbling?

Yet, humans managed to do that (albeit over many generations)

Ergo, humans are not just balls of statistics

1 more reply

Nevermark1y ago

The right word is "confabulation". Which is when we fill in missing information but may not be aware that we are doing it.

We all confabulate to some degree, as any neural system must, since no training data is stored perfectly.

Human "hallucinations" in contrast, are a particular kind of breakdown in our sensory feedback loops. Which is not a process LLMs even have.

Hallucinations occur when our internal sensory feedback loops overpower actual sensory input, resulting in a stream of false sensory experience/signals being generated and processed. The false running experience might still incorporate some actual sensory information or not.

When we dream, we are hallucinating - our sensory experience loop running free of our actual senses - to a productive purpose.

The reason our senses have feedback is so that we can use our interpretation of sensory input as cues to make interpreting the next moments input easier. But its important that our running interpretation can reset when new input significantly diverges from our expectations so it can quickly reorient.

(Not only is it important to revert to a raw input interpretation to ensure our running interpretation keeps up the actual context changes and corrects misinterpretations, but such resets signal that something novel or unexpected has happened, so likely trigger learning.)

So "hallucinations" was an unfortunate and misleading choice of terminology.

numeri1y ago

I've got bad news for you – that term was used in deep learning research well before LLMs came on the scene. It has nothing to do with pundits trying to popularize anything or trying to justify LLMs' shortcomings, it was just a label researchers gave to a phenomenon they were trying to study.

A couple papers that use it in this way prior to LLMs:

- 2021: The Curious Case of Hallucinations in Neural Machine Translation (https://arxiv.org/abs/2104.06683)

- 2019: Identifying Fluently Inadequate Output in Neural and Statistical Machine Translation (https://aclanthology.org/W19-6623/)

whimsicalism1y ago

can we make a siloed version of HN for your political faction? it’s tiresome reading these in every thread

hmmmhmmmhmmm1y ago

Maybe an evolutionary / structuralist lens is helpful here: terms that rapidly diffuse through discourse are those that people like most, and most people like to anthropomorphize, so "hallucination" has come to take on a new meaning, and we all (to different degrees) know what it is referring to.

bongodongobob1y ago

Give it a rest. Everything is statistics.

Sees space shuttle "pff, it's just a pile of engineering."

frakt0x901y ago

Yeah it's simply model error. All models from Linear Regression to LLMs have error. I guess because this type of error is in the form of deceptively reasonable human language, it gets a different moniker. It's also notably harder to quantify so it might warrant a different name.

seydor1y ago

do you really want to have a discussion about 'thought' and 'mind'? i don't

elcomet1y ago· 7 in thread

It's a similar approach to OpenAI's o1 model ( it's not cited, but there's no available paper for o1).

I don't see any mention of weight release unfortunately.

diggan1y ago

I think this submission paper is talking about reinforcement learning as part of/after the main training, then the model does inference as normal.

They might have done that for O1, but the bigger change is the "runtime train of thought" that once the model received the prompt and before giving a definitive answer, it "thinks" with words and readjusts at runtime.

At least that's my understanding from these two approaches, and if that's true, then it's not similar.

AFAIK, OpenAI been doing reinforcement learning since the first version of ChatGPT for all future models, that's why you can leave feedback in the UI in the first place.

numeri1y ago

OpenAI stated [1] that one of the breakthroughs needed for o1's train of thought to work was reinforcement learning to teach it to recover from faulty reasoning.

> Through reinforcement learning, o1 learns to hone its chain of thought and refine the strategies it uses. It learns to recognize and correct its mistakes. It learns to break down tricky steps into simpler ones. It learns to try a different approach when the current one isn’t working.

That's incredibly similar to this paper, which is discusses the difficulty in finding a training method that guides the model to learn a self-correcting technique (in which subsequent attempts learn from and improve on previous attempts), instead of just "collapsing" into a mode of trying to get the answer right with the very first try.

[1]: https://openai.com/index/learning-to-reason-with-llms/

josh-sematic1y ago

They are indeed similar and OpenAI did indeed use RL at training time in a way that has not been done before, as does this approach. Yes both also involve some additional inference-time generation, but the problem is that (at least as of now) you can't get standard LLMs to actually do well with extra inference-time generation unless you have a training process that uses RL to teach them to do so effectively. I'm working on a blog post to explain more about this aimed at HN-level audiences. Stay tuned!

1 more reply

nsagent1y ago

Both models generate an answer after multiple turns, where each turn has access to the outputs from a previous turn. Both refer to the chain of outputs as a trace.

Since OpenAI did not specify what exactly is in their reasoning trace, it's not clear what if any difference there is between the approaches. They could be vastly different, or they could be slight variations of each other. Without details from OpenAI, it's not currently possible to tell.

whimsicalism1y ago

you are describing the same thing?

sorry as a practitioner i’m having trouble understanding what point/distinction you are trying to make

1 more reply

WithinReason1y ago

how is it similar?

littlestymaar1y ago

https://x.com/karpathy/status/1821277264996352246

plaguuuuuu1y ago· 7 in thread

LLMs have no direct recollection of the qualia of their own training. This is at least a major way that I self-correct myself: if I'm about to talk about something I know, I'll try and figure out how/why I know that thing and in so doing, try to gauge whether I actually know that thing, if I'm hallucinating, or if I actually heard it from a less than reliable source etc.

I don't think LLMs can self-correct without remembering their own training in some way.

QuadmasterXLII1y ago

So you’re saying the solution is to prefix each training batch with a description of a sensory experience (You read the following in a paris cafe in 1997. While you read, you have an excellent baguette and some boiled eggs, and over-roasted coffee. The woman one table over is wearing a beautiful blue hat) and then post-train the final model into recalling the setting where it read any piece of text, or failing to recall any experience when presented with text it didn’t read?

(If someone tries this and it works, I’m quitting my phd and going back to camp counseling)

wpietri1y ago

I don't think that's what they're saying at all. They're talking not about qualia in the human sense, but specifically about "the qualia of their own training". That is, the corpus that LLMs "learn" from and the "experiences" of those texts that are generalized during the training process. Both the raw data and the memory of "learning" is discarded.

So if one were to improve an LLM along those lines, I believe it would be something like: 1) LLM is asked a question. 2) LLM comes up with an initial response. 3) LLM retrieves the related "learning" history behind that answer and related portions of the corpus. 4) LLM compares the initial answer with the richer set of information, looking for conflicts between the initial answer and the broader set, or "learning" choices that may be false. 6) LLM generates a better answer and gives it. 7) LLM incorporates this new "learning".

And that strikes me as a pretty reasonable long-term approach, if not one that fits within the constraints of the current gold rush.

1 more reply

numeri1y ago

Sort of like this? It does help: Source-Aware Training Enables Knowledge Attribution in Language Models (https://arxiv.org/abs/2404.01019)

From the abstract:

> ... To give LLMs such ability, we explore source-aware training -- a recipe that involves (i) training the LLM to associate unique source document identifiers with the knowledge in each document, followed by (ii) an instruction-tuning stage to teach the LLM to cite a supporting pretraining source when prompted.

triclops2001y ago

Strong disagree: https://mypapers.nyc3.cdn.digitaloceanspaces.com/the_phenome...

See also: https://www.sciencedirect.com/science/article/pii/S157106452... o1's training regime is described by the "strange particle" model in this formulation

groby_b1y ago

I think your overweighting the value of that in day-to-day use. As folks accumulate knowledge, a common pattern (especially for things not embedded in a framework - trivia-like data) is a "I have no idea why I'd know this, but the answer is X".

But even if it's embedded in a framework, say CS, the qualia fade in the background as time passes. E.g. like everybody in CS, I'm pretty much able to quote O() performance characteristics of a sizeable number of algorithms off the bat. If you ask me where I learned it, for that specific algorithm - that's long receded into the past.

When humans self-correct, the normal process isn't "gauging whether you know the thing" or the even more impressive feat of calling up if you heard it from a "less than reliable source". There's a fuzzy sense of "I don't fully understand it", and self-correction means re-verifying the info from a trusted source.

So, no, I don't think the qualia matter for recall as much as you think.

williamcotton1y ago

Unless you’re under the influence of something or having a severe mental health crisis you are not hallucinating, you’re confabulating.

mdp20211y ago

According to which philologist? In short: they are both weak terms, 'hallucination' and 'confabulation', and we are using them in this context very loosely (and it should be in the open).

About the terms themselves, "confabulate" means "exchanging stories", while "hallucinate" is less clear but probably means "to err". In psychiatry, "hallucinate" was apparently introduced by Esquirol and "confabulate" by Wernicke and Bonhoeffer; neither concept seems to be akin to the substance of the phenomenon of "stochastic parrots bullshitting an unchecked narrative through formal plausibility".

See: "Hallucinations and related concepts - their conceptual background" - https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4515540/

and: "The Confabulating Mind: How the Brain Creates Reality" - https://psychiatryonline.org/doi/full/10.1176/appi.ajp.2008....

1 more reply

fpgaminer1y ago· 2 in thread

I found the paper a tad difficult to understand because it spends a lot of time circling around the main thesis instead of directly describing. So, to the best of my understanding:

We want to improve LLM's abilities to give correct answers to hard problems. One theory is that we can do that by training a "Self Correcting" behavior into the models where they can take as input a wrong answer and improve it to a better/correct answer.

This has been explored previously, trying to train this behavior using various Reinforcement techniques where the reward is based on how good the "corrected" answer is. So far it hasn't worked well, and the trained behavior doesn't generalize well.

The thesis of the paper is that this is because when the model is presented with a training example of `Answer 1, Reasoning, Corrected Answer`, and a signal of "Make Corrected Answer Better" it actually has _two_ perfectly viable ways to do that. One is to improve `Reasoning, Corrected Answer`, which would yield a higher reward and is what we want. The other, just as valid solution, is to simply improve `Answer 1` and have `Corrected Answer` = `Answer 1`.

The latter is what existing research has shown happens, and why so far attempts to train the desired behavior has failed. The models just try to improve their answers, not their correcting behaviors. This paper's solution is to change the training regimen slightly to encourage the model to use the former approach. And thus, hopefully, get the model to actually train the desired behavior of correcting previous answers.

This is done by doing two stages of training. In the first stage, the model is forced (by KL divergence loss) to keep its first answers the same, while being rewarded for improving the second answer. This helps keep the model's distribution of initial answers the same, avoiding the issue later where the model doesn't see as many "wrong" answers because wrong answers were trained out of the model. But it helps initialize the "self correcting" behavior into the model.

In the second stage the model is free to change the first answer, but they tweak the reward function to give higher rewards for "flips" (where answer 1 was bad, but answer 2 was good). So in this second stage it can use both strategies, improving its first answer or improving its self correcting, but it gets more rewards for the latter behavior. This seems to be a kind of refinement on the model, to improve things overall, while still keeping the self correcting behavior intact.

Anyway, blah blah blah, metrics showing the technique working better and generalizing better.

Seems reasonable to me. I'd be a bit worried about, in Stage 2, the model learning to write _worse_ answers for Answer 1 so it can maximize the reward for flipping answers. So you'd need some kind of balancing to ensure Answer 1 doesn't get worse. Not sure if that's in their reward function or not, or if its even a valid concern in practice.

jasfi1y ago

Circling around the idea in a response describes what I see in a lot of LLM output quite well. I haven't tried o1 myself, but it does seem to fix that problem.

kick_in_the_dor1y ago

Can you explain what you mean by: "The other, just as valid solution, is to simply improve `Answer 1` and have `Corrected Answer` = `Answer 1`."

Isn't improving "Answer 1" the whole point?

Your write-up makes it sound like "Answer 1" an input but an output from the LLM?

textlapse1y ago· 1 in thread

Using an intelligent algorithm to guide a dumb non-intelligent next word predictor is still a non-intelligent algorithm at the end of the day.

Sure it’s sorting through garbage more elegantly but it’s still garbage at the end of the day.

I was hoping the RL-like approach replaced the transformers-like approach or something but that’s a pipe dream.

devoutsalsa1y ago

PolishedTurd.ai

ziofill1y ago

Is this effectively some sort of knowledge distillation?

j / k navigate · click thread line to collapse

92 comments

47 comments · 7 top-level

optimalsolver1y ago· 12 in thread

Spoiler: You're never going to get rid of hallucinations in the autoregressive, next token prediction paradigm (aka LeCun's Law).

The issue here is people trying to use language models as deterministic problem solvers, rather than for what they actually excel at (semi-creative text generation).

plewd1y ago

YeGoblynQueenne1y ago

I think the OP may be referring to this slide that Yann LeCun has presented on several occasions:

https://youtu.be/MiqLoAZFRSE?si=tIQ_ya2tiMCymiAh&t=901

To quote from the slide:

  * Probability e that any produced token takes us outside the set of correct answers
  * Probability that answer of length n is correct
  * P(correct) = (1-e)^n
  * This diverges exponentially
  * It's not fixable (without a major redesign)

7 more replies

vjerancrnjak1y ago

“Label bias” or “observation bias” a phenomenon where going outside of the learned path lives little room for error correction. Lecun talks about the lack of joint learning in LLMs.

whimsicalism1y ago

It’s a thing in that he said it but it’s not an actual law and it has several obvious logical flaws. It applies just as equally to human utterances.

mdp20211y ago

A reference could be this:

https://futurist.com/2023/02/13/metas-yann-lecun-thoughts-la...

(Speaking of "law" is rhetoric, but an idea is pretty clear.)

shawnz1y ago

famouswaffles1y ago

Models already know when they are going off the rails. https://news.ycombinator.com/item?id=41504226. That's not the problem. The problem is that they don't care to tell you.

wpietri1y ago

Very nice to see this point being made.

In an alternate universe, one where people understood this, people would be using LLMs for nothing serious, but a whole lot of fun little art projects.

whimsicalism1y ago

LeCuns argument is seriously flawed. It is not at all a rigorous one and you should not make such sweeping statements based on nothing.

barbarr1y ago

At this point I just invert everything LeCun says about AI. Chances are he'll flip flop on his own statement a few months later anyways.

1 more reply

seydor1y ago

"never" is not itself a problem, people do the same

you only need to solve fusion correctly once

famouswaffles1y ago

If you're talking about label bias then you don't need to solve label bias to 'solve' hallucinations when the model has already learnt internally when it's bullshitting or going off the rails.

sensanaty1y ago· 11 in thread

Philpax1y ago

Do we really need to have this discussion in every thread about LLMs?

sensanaty1y ago

2 more replies

qudat1y ago

I'd argue the opposite: people think a person's mind is in "deep thought" when it's actually just a ball of statistics.

whiplash4511y ago

Do you think that an LLM would spit out Latin and English if you trained it with homo sapiens mumbling?

Yet, humans managed to do that (albeit over many generations)

Ergo, humans are not just balls of statistics

1 more reply

Nevermark1y ago

The right word is "confabulation". Which is when we fill in missing information but may not be aware that we are doing it.

We all confabulate to some degree, as any neural system must, since no training data is stored perfectly.

Human "hallucinations" in contrast, are a particular kind of breakdown in our sensory feedback loops. Which is not a process LLMs even have.

When we dream, we are hallucinating - our sensory experience loop running free of our actual senses - to a productive purpose.

So "hallucinations" was an unfortunate and misleading choice of terminology.

numeri1y ago

A couple papers that use it in this way prior to LLMs:

- 2021: The Curious Case of Hallucinations in Neural Machine Translation (https://arxiv.org/abs/2104.06683)

- 2019: Identifying Fluently Inadequate Output in Neural and Statistical Machine Translation (https://aclanthology.org/W19-6623/)

whimsicalism1y ago

can we make a siloed version of HN for your political faction? it’s tiresome reading these in every thread

hmmmhmmmhmmm1y ago

bongodongobob1y ago

Give it a rest. Everything is statistics.

Sees space shuttle "pff, it's just a pile of engineering."

frakt0x901y ago

seydor1y ago

do you really want to have a discussion about 'thought' and 'mind'? i don't

elcomet1y ago· 7 in thread

It's a similar approach to OpenAI's o1 model ( it's not cited, but there's no available paper for o1).

I don't see any mention of weight release unfortunately.

diggan1y ago

I think this submission paper is talking about reinforcement learning as part of/after the main training, then the model does inference as normal.

At least that's my understanding from these two approaches, and if that's true, then it's not similar.

AFAIK, OpenAI been doing reinforcement learning since the first version of ChatGPT for all future models, that's why you can leave feedback in the UI in the first place.

numeri1y ago

OpenAI stated [1] that one of the breakthroughs needed for o1's train of thought to work was reinforcement learning to teach it to recover from faulty reasoning.

[1]: https://openai.com/index/learning-to-reason-with-llms/

josh-sematic1y ago

1 more reply

nsagent1y ago

Both models generate an answer after multiple turns, where each turn has access to the outputs from a previous turn. Both refer to the chain of outputs as a trace.

whimsicalism1y ago

you are describing the same thing?

sorry as a practitioner i’m having trouble understanding what point/distinction you are trying to make

1 more reply

WithinReason1y ago

how is it similar?

littlestymaar1y ago

https://x.com/karpathy/status/1821277264996352246

plaguuuuuu1y ago· 7 in thread

I don't think LLMs can self-correct without remembering their own training in some way.

QuadmasterXLII1y ago

(If someone tries this and it works, I’m quitting my phd and going back to camp counseling)

wpietri1y ago

And that strikes me as a pretty reasonable long-term approach, if not one that fits within the constraints of the current gold rush.

1 more reply

numeri1y ago

Sort of like this? It does help: Source-Aware Training Enables Knowledge Attribution in Language Models (https://arxiv.org/abs/2404.01019)

From the abstract:

triclops2001y ago

Strong disagree: https://mypapers.nyc3.cdn.digitaloceanspaces.com/the_phenome...

See also: https://www.sciencedirect.com/science/article/pii/S157106452... o1's training regime is described by the "strange particle" model in this formulation

groby_b1y ago

So, no, I don't think the qualia matter for recall as much as you think.

williamcotton1y ago

Unless you’re under the influence of something or having a severe mental health crisis you are not hallucinating, you’re confabulating.

mdp20211y ago

According to which philologist? In short: they are both weak terms, 'hallucination' and 'confabulation', and we are using them in this context very loosely (and it should be in the open).

See: "Hallucinations and related concepts - their conceptual background" - https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4515540/

and: "The Confabulating Mind: How the Brain Creates Reality" - https://psychiatryonline.org/doi/full/10.1176/appi.ajp.2008....

1 more reply

fpgaminer1y ago· 2 in thread

I found the paper a tad difficult to understand because it spends a lot of time circling around the main thesis instead of directly describing. So, to the best of my understanding:

Anyway, blah blah blah, metrics showing the technique working better and generalizing better.

jasfi1y ago

Circling around the idea in a response describes what I see in a lot of LLM output quite well. I haven't tried o1 myself, but it does seem to fix that problem.

kick_in_the_dor1y ago

Can you explain what you mean by: "The other, just as valid solution, is to simply improve `Answer 1` and have `Corrected Answer` = `Answer 1`."

Isn't improving "Answer 1" the whole point?

Your write-up makes it sound like "Answer 1" an input but an output from the LLM?

textlapse1y ago· 1 in thread

Using an intelligent algorithm to guide a dumb non-intelligent next word predictor is still a non-intelligent algorithm at the end of the day.

Sure it’s sorting through garbage more elegantly but it’s still garbage at the end of the day.

I was hoping the RL-like approach replaced the transformers-like approach or something but that’s a pipe dream.

devoutsalsa1y ago

PolishedTurd.ai

ziofill1y ago

Is this effectively some sort of knowledge distillation?

j / k navigate · click thread line to collapse