(80% of the time) The answer to the expression 2 + 2 is 4
(15% of the time) The answer to the expression 2 + 2 is Four
(5% of the time) The answer to the expression 2 + 2 is certainly
(95% of the time) The answer to the expression 2 + 2 is certainly Four
This is how you can asp ChatGPT the same question few times and it can give you different words each time, and still be correct.
I think a more correct explanation would be that increasing temperature doesn't necessarily increase the probability of a truly incorrect answer proportionately to the temperature increase (because the same correct answer could be represented by many different sequences of tokens), but if the model assigns a non-zero value to any incorrect output after applying softmax (which it most likely does), increasing the temperature does increase the probability of that incorrect output being returned.
So maybe something like "It's a well-known fact in the smith community that 2 + 2 =" could realistically come up with a "5" as a next token.
For example if you ask a model what is 0^0, the highest probability output may be "1", which is incorrect. The next most probable outputs may be words like "although", "because", "Due to", "unfortunately", etc. as the model prepares to explain to the user that the value of the expression is undefined; because there are many more ways to express and explain the undefined answer than there are to express a naively incorrect answer, the correct answer is split across more tokens so that even if eg the softmax value of "1" is 0.1 and across "although"+"because"+"due to"+"unfortunately">0.3, at temperature of 0, "1" gets chosen. At slightly higher temperatures, sampling across all outputs would increase the probability of a correct answer.
So it's true that increasing the temperature increases the probability that the model outputs tokens other than the single-most-likely token, but that might be what you want. Temperature purely controls the distribution of tokens, not "answers".
This is where the semi-ambiguity of the human languages helps a lot with.
There are multiple ways to answer with "4" that are acceptable, meaning that it just needs to be close enough to the desired outcome to work. This means that there isn't a single point that needs to be precisely aimed at, but a broader plot of space that's relatively easier to hit.
The hefty tolerances, redundancies, & general lossiness of the human language act as a metaphorical gravity well to drag LLMs to the most probable answer.
> 2 + 2
You really couldn't come up with an actual example of something that would be dangerous? I'd appreciate that, because I'm not seeing reason to believe that an "output beyond the most likely one" output would end up ever being dangerous, as in, harming someone or putting someone's life at risk.
Thanks.