I don't think you get it.
> Encoded truth: Recent work suggests that LLMs encode more truthfulness than previously understood, with certain tokens concentrating this information, which improves error detection. However, this encoding is complex and dataset-specific, hence limiting generalization. Notably, models may be encoding the correct answers internally despite generating errors, highlighting areas for targeted mitigation strategies.
Linking to this paper: https://arxiv.org/pdf/2410.02707
"Recent studies have demonstrated that LLMs’ internal states encode information regarding the truthfulness of their outputs, and that this information can be utilized to detect errors. In this work, we show that the internal representations of LLMs encode much more information about truthfulness than previously recognized."
This was already known years ago, by the way. The meme that LLMs just generate statistically plausible text is wrong and has been from the start. That's not how they work.
Did you read that paper? It doesn't support discarding this "meme" at all. More importantly, I don't think it adequately supports that LLMs "know facts"
FFS, the actual paper is about training models on the LLM state to predict whether it's actual output is correct. The interesting finding to them is that their models predict about a 75% chance of being correct even before the LLM starts generating text, that the conversation part of the answer has a low predicted chance of being correct, and that the "exact answer", a term they've created, is usually where the chance the LLM is correct (according to their trained model) peaks.
What they have demonstrated is that you can build a model that looks at in memory LLM state and have a 75% chance of guessing whether the LLM will produce the correct answer based on how the model reacts to the prompt. Even taking as a given (which you shouldn't in a science paper) that there's no trickery going on in the Probe models, accidental or otherwise, this is perfectly congruent with the statement that LLMs only "generate statistically probable text in the context of their training corpus and the prompt"
Notably, why don't they demonstrate that you can predict whether a trained but completely unprompted model will "know" the answer? Why does the LLM have to process the conversation before you can >90% chance predict whether it will produce the answer? If the LLM stores facts in it's weights, you should be able to demonstrate that completely at rest.
IMO, what they've actually done is produce "Probe models" that can 75% of the time correctly predict whether an LLM will produce a certain token or set of tokens in it's generation. That is coherent with an LLM model being, broadly speaking, a model of how tokens relate to each other from a point of view of language. The main quibble in these discussions is that doesn't constitute "knowing" IMO. LLMs are a model of language, not reality. That's why they are good at producing accurate language, and bad at producing accurate reality. That most facts are expressed in language doesn't mean language IS facts.
A question: Why don't LLMs produce garbage grammar when they "hallucinate"?
The answer to what? You have to ask a question to test whether the answer will be accurate, and that's the prompt. I don't understand this objection.
> If the LLM stores facts in it's weights, you should be able to demonstrate that completely at rest.
Sure, with good enough interpretability systems, and those are being worked on. Anthropic can already locate which parts of the model fire on specific topics or themes and force them on or off by manipulating the activation vectors.
> A question: Why don't LLMs produce garbage grammar when they "hallucinate"?
Early models did.
One of these days, someone will figure out how to include that in the training/inference loop. It's probably important for communication and reasoning, considering a similar concept happens in my head (some sort of sparsity detection).