undefined | Better HN

0 pointsYeGoblynQueenne1y ago0 comments

I think the OP may be referring to this slide that Yann LeCun has presented on several occasions:

https://youtu.be/MiqLoAZFRSE?si=tIQ_ya2tiMCymiAh&t=901

To quote from the slide:

  * Probability e that any produced token takes us outside the set of correct answers
  * Probability that answer of length n is correct
  * P(correct) = (1-e)^n
  * This diverges exponentially
  * It's not fixable (without a major redesign)

* Probability e that any produced token takes us outside the set of correct answers * Probability that answer of length n is correct * P(correct) = (1-e)^n * This diverges exponentially * It's not fixable (without a major redesign)

0 comments

21 comments · 7 top-level

roboboffin1y ago· 5 in thread

Is this similar to the effect that I have seen when you have two different LLMs talking to each other, they tend to descend into nonsense ? A single error in one of the LLM's output and that then pushes the other LLM out of distribution.

I kind of oscillatory effect when the train of tokens move further and further out of the distribution of correct tokens.

vjerancrnjak1y ago

This is equivalent to the problem of maximum entropy Markov models and their application to sequence output.

After some point you’re conditioning your next decision on tokens that are severely out of the learned path and you don’t even see it’s that bad.

Usually this was fixed with cost sensitive learning or increased sampling of weird distributions during learning and then making the model learn to correct the mistake.

Another approach was to have an inference algorithm that maximize the output probability, but these algorithms are expensive (viterbi and other dynamic programming methods).

Feature modeling in NNs somewhat allowed us to ignore these issues and get good performance but they will show up again.

diggan1y ago

> Is this similar to the effect that I have seen when you have two different LLMs talking to each other, they tend to descend into nonsense ?

Is that really true? I'd expect that with high temperature values, but otherwise I don't see why this would happen, and I've experimented with pitting same models against each other and also different models against different models, but haven't come across that particular problem.

roboboffin1y ago

I think this is similar to this point: https://news.ycombinator.com/item?id=41601738

That the chain-of-thought diverges from accepted truth as an incorrect token pushes it into a line of thinking that is not true. The use of RL is there to train the LLM to implement strategies to bring it back from this. In effect, two LLMs would be the same and would slow diverge into nonsense. Maybe it is something that is not so much of a problem anymore.

Yann LeCun talks about how the correct way to fix this is to use an internal consistent model of the truth; then the chain-of-thought exists as a loop within that consistent model meaning it cannot diverge. The language is a decoded output of this internal model resolution. He speaks about this here: https://www.youtube.com/watch?v=N09C6oUQX5M

Anyway, that's my understanding. I'm no expert.

reportgunner1y ago

Can you show examples ? In any AI related discussions there are only some claims by people and never examples of the AI working well.

1 more reply

sharemywin1y ago

this is like the human game of telephone.

sharemywin1y ago· 4 in thread

Wouldn't this apply to all prediction machines that make errors.

Humans make bad predictions all the time but we still seem to manage to do some cool stuff here and there.

part of an agents architecture will be for it to minimize e and then ground the prediction loop against a reality check.

making LLMs bigger gets you a lower e with scale of data and compute but you will still need it to check against reality. test time compute also will play a roll as it can run through multiple scenarios and "search" for an answer.

YeGoblynQueenneOP1y ago

The difference between LLMs and other kinds of predictive models, or humans, is that those kinds of systems do not produce their output one token at a time, but all in one go, so their error basically stays constant. LeCun's argument is that LLM error increases with every cycle of appending a token to the last cycle's output. That's very specific to LLMs (or, well, to LLM-based chatbots to be more precise).

>> part of an agents architecture will be for it to minimize e and then ground the prediction loop against a reality check.

The problem is that web-scale LLMs can only realistically be trained to maximise the probability of the next token in a sequence, but not the factuality, correctness, truthfullness, etc of the entire sequence. That's because web-scale data is not annotated with such properties. So they can't do a "reality check" because they don't know what "reality" is, only what text looks like.

The paper above uses an "oracle" instead, meaning they have a labelled dataset of correct answers. They can only train their RL approach because they have this source of truth. This kind of approach just doesn't scale as well as predicting the next token. It's really a supervised learning approach hiding behind RL.

psb2171y ago

"The difference between LLMs and other kinds of predictive models, or humans, is that those kinds of systems do not produce their output one token at a time, but all in one go, so their error basically stays constant." -- This is a big, unproven assumption. Any non-autoregressive model can be trivially converted to an autoregressive model by: (i) generating a full output sequence, (ii) removing all tokens except the first one, (iii) generating a full-1 output sequence conditioned on the first token. This wraps the non-autoregressive model in an "MPC loop", thereby converting it to an autoregressive model where per-token error is no greater than that of the wrapped non-AR model. The explicit MPC planning behavior might reduce error per token compared to current naive applications of AR transformers, but the MPC-wrappped model is still an AR model, so the problem is not AR per se.

LeCun's argument has some decent points, eg, allocating compute per token based solely on location within the sequence (due to increasing cost of attention ops for later locations) is indeed silly. However, the points about AR being unavoidably flawed due to exponential divergence from the true manifold are wrong and lazy. They're not wrong because AR models don't diverge, they're wrong because this sort of divergence is also present in other models.

2 more replies

throwawaymaths1y ago

No. Many prediction machines can give you a confidence value on the full outcome. By the nature of tokenization and the casual inference (you build a token one at a time, and they're not really semantically connected except in the kv cache lookups, which are generally hidden to the user), the confidence values are thrown out in practice and even a weak confidence value would be hard to retrieve.

I don't think it's impossible to obtain content with confidence assessments with the transformer architecture but maybe not in the way it's done now (like maybe another mayer on top).

1 more reply

slashdave1y ago

Humans self-correct (they can push the delete button)

littlestymaar1y ago· 3 in thread

> * P(correct) = (1-e)^n * This diverges exponentially

I don't get it, 1-e is between 0 and 1, so (1-e)^n converge to zero. Also, a probability cannot diverge since it's bounded by 1!

I think the argument is that 1 - e^n converges to 1, which is what the law is about.

vbarrielle1y ago

P(correct) converges to zero, so you get almost certainly incorrect, at an exponential rate. The original choice of terms is not the most rigorous, but the reasoning is sound (under the assumption that e is a constant).

hackerlight1y ago

P(correct) doesn't go down with token count if you have self-correction. It can actually go up with token count.

littlestymaar1y ago

Ah yes I didn't pay attention that it was the probability of being correct I misread it as the probability of being incorrect since the claim was that it diverged.

atq21191y ago· 1 in thread

Doesn't that argument make the fundamentally incorrect assumption that the space of produced output sequence has pockets where all output sequence with a certain prefix are incorrect?

Design your output space in such way that every prefix has a correct completion and this simplistic argument no longer applies. Humans do this in practice by saying "hold on, I was wrong, here's what's right".

Of course, there's still a question of whether you can get the probability mass of correct outputs large enough.

marcosdumay1y ago

How do you do this in something where the only memory is the last few things it said or heard?

ziofill1y ago· 1 in thread

Doesn’t this assume that the probability of a correct answer is iid? It can’t be that simple.

vbarrielle1y ago

Yes the main flaw of this reasoning is supposing that e does not depend on previous output. I think this was a good approximation to characterize vanilla LLMs, but the kind of RL in this paper is done with the explicit goal of making e depending on prior output (and specifically to lower it given a long enough chain of thought).

hackerlight1y ago

It's quite fitting that the topic of this thread is self-correction. Self-correction is a trivial existence proof that refutes what LeCun is saying, because all the LLM has to say is "I made a mistake, let me start again".

slashdave1y ago

Simplistic, since it assumes probabilities are uncorrelated, when they clearly aren't. Also, there are many ways of writing the correct solution to a problem (you do not need to replicated an exact sequence of tokens).

j / k navigate · click thread line to collapse

0 comments

21 comments · 7 top-level

roboboffin1y ago· 5 in thread

I kind of oscillatory effect when the train of tokens move further and further out of the distribution of correct tokens.

vjerancrnjak1y ago

This is equivalent to the problem of maximum entropy Markov models and their application to sequence output.

After some point you’re conditioning your next decision on tokens that are severely out of the learned path and you don’t even see it’s that bad.

Usually this was fixed with cost sensitive learning or increased sampling of weird distributions during learning and then making the model learn to correct the mistake.

Another approach was to have an inference algorithm that maximize the output probability, but these algorithms are expensive (viterbi and other dynamic programming methods).

Feature modeling in NNs somewhat allowed us to ignore these issues and get good performance but they will show up again.

diggan1y ago

> Is this similar to the effect that I have seen when you have two different LLMs talking to each other, they tend to descend into nonsense ?

roboboffin1y ago

I think this is similar to this point: https://news.ycombinator.com/item?id=41601738

Anyway, that's my understanding. I'm no expert.

reportgunner1y ago

Can you show examples ? In any AI related discussions there are only some claims by people and never examples of the AI working well.

1 more reply

sharemywin1y ago

this is like the human game of telephone.

sharemywin1y ago· 4 in thread

Wouldn't this apply to all prediction machines that make errors.

Humans make bad predictions all the time but we still seem to manage to do some cool stuff here and there.

part of an agents architecture will be for it to minimize e and then ground the prediction loop against a reality check.

YeGoblynQueenneOP1y ago

>> part of an agents architecture will be for it to minimize e and then ground the prediction loop against a reality check.

psb2171y ago

2 more replies

throwawaymaths1y ago

I don't think it's impossible to obtain content with confidence assessments with the transformer architecture but maybe not in the way it's done now (like maybe another mayer on top).

1 more reply

slashdave1y ago

Humans self-correct (they can push the delete button)

littlestymaar1y ago· 3 in thread

> * P(correct) = (1-e)^n * This diverges exponentially

I don't get it, 1-e is between 0 and 1, so (1-e)^n converge to zero. Also, a probability cannot diverge since it's bounded by 1!

I think the argument is that 1 - e^n converges to 1, which is what the law is about.

vbarrielle1y ago

hackerlight1y ago

P(correct) doesn't go down with token count if you have self-correction. It can actually go up with token count.

littlestymaar1y ago

Ah yes I didn't pay attention that it was the probability of being correct I misread it as the probability of being incorrect since the claim was that it diverged.

atq21191y ago· 1 in thread

Doesn't that argument make the fundamentally incorrect assumption that the space of produced output sequence has pockets where all output sequence with a certain prefix are incorrect?

Of course, there's still a question of whether you can get the probability mass of correct outputs large enough.

marcosdumay1y ago

How do you do this in something where the only memory is the last few things it said or heard?

ziofill1y ago· 1 in thread

Doesn’t this assume that the probability of a correct answer is iid? It can’t be that simple.

vbarrielle1y ago

hackerlight1y ago

slashdave1y ago

j / k navigate · click thread line to collapse