undefined | Better HN

0 pointsmatt_kantor4mo ago0 comments

> their claim may be something more complex, after reading the paper. I'm not sure that their result applies to the final output, or it's restricted to knowing the internal state at some pre-output layer.

It's the internal state; that's what they mean by "hidden activations".

If the claim were just about the output it'd be easy to falsify. For example, the prompts "What color is the sky? Answer in one word." and "What color is the "B" in "ROYGBIV"? Answer in one word." should both result in the same output ("Blue") from any reasonable LLM.

0 comments

simiones4mo ago

Even that is not necessarily true. The output of the LLM is not "Blue". It is something like "probability of 'Blue' is 0.98131". And it may well be 0.98132 for the other question. Certainly they only talk about the internal state in 1 layer of the LLM, they don't need the entire LLM values.

joaohaas4mo ago

That's exactly what the quoted answer is saying though?

simiones4mo ago

The point I'm trying to make is this: the LLM output is a set of activations. Those are not "hidden" in any way: that is the plain result of running the LLM. Displaying the word "Blue" based on the LLM output is a separate step, one that the inference server performs, completely outside the scope of the LLM.

However, what's unclear to me from the paper is if it's enough to get these activations from the final output layer; or if you actually need some internal activations from a hidden layer deeper in the LLM, one that does require analyzing the internal state of the LLM.

BurningFrog4mo ago

There are also billions of possible Yes/No questions you can ask that won't get unique answers.

simiones4mo ago

The LLM proper will never answer "yes" or "no". It will answer something like "Yes - 99.75%; No - 0.0007%; Blue - 0.0000007%; This - 0.000031%" etc , for all possible tokens. It is this complete response that is apparently unique.

With regular LLM interactions, the inference server then takes this output and actually picks one of these responses using the probabilities. Obviously, that is a lossy and non-injective process.

__MatrixMan__4mo ago

If the authors are correct (I'm not equipped to judge) then there must be additional output which is thrown away before the user is presented with their yes/no, which can be used to recover the prompt.

It would be pretty cool if this were true. One could annotate results with this metadata as a way of citing sources.

j / k navigate · click thread line to collapse

0 pointsmatt_kantor4mo ago0 comments

It's the internal state; that's what they mean by "hidden activations".

0 comments

simiones4mo ago

joaohaas4mo ago

That's exactly what the quoted answer is saying though?

simiones4mo ago

BurningFrog4mo ago

There are also billions of possible Yes/No questions you can ask that won't get unique answers.

simiones4mo ago

With regular LLM interactions, the inference server then takes this output and actually picks one of these responses using the probabilities. Obviously, that is a lossy and non-injective process.

__MatrixMan__4mo ago

It would be pretty cool if this were true. One could annotate results with this metadata as a way of citing sources.

j / k navigate · click thread line to collapse