undefined | Better HN

0 pointsnicklecompte2y ago0 comments

It's because floating-point arithmetic isn't deterministic, which becomes salient when (speaking loosely) the difference between likelihood of two different tokens is less than the precision of the FPU.

I am not sure to what extent this effect has been quantified.

0 comments

6 comments · 3 top-level

memhole2y ago· 2 in thread

I’m so glad to see LLMs spark these conversations lately. It’s been a huge gripe of mine that we don’t question the underlying precision in other areas of AI/ML

wongarsu2y ago

The last couple of years have been a steady journey of us discovering that in most neural networks precision only matters in a couple key places, and everything else can get away with astonishingly little.

We started out training everything in full (f32) or double precision (f64), then around 2020 everyone switched to half precision (f16) with some stuff in full precision, now we are starting to move to quarter precision, and the newest Nvidia card even supports f4 (eighth precision?). And then of course there's the 1.58bit LLM paper.

So there has been a steady stream of people questioning the underlying precision, and most of the time the answer they came back with was: there's more precision than we need, a larger network with less precision is faster and better than a smaller network with more precision

nicklecompteOP2y ago

To be clear there’s a distinction between the quality of the results and the determinism of the results. If a low-precision LLM is wildly stochastic but the variation is mostly linguistic rather than factual or deductive (e.g. coin tosses on synonyms or presenting independent facts in a different order), then there’s not really a contradiction.

AFAIK the determinism side of floating-point precision hasn’t been well-addressed, but it’s been a while since I skimmed those papers.

chessgecko2y ago· 1 in thread

Having played with this stuff its definitely spots in the expert buffers (the other comment in the thread has the link to explanation) and not the extremely small differences in floating point arithmetic. The effect from this is much much less than any change in quantization, i.e. almost impossible to see from the outputs.

nicklecompteOP2y ago

I guess the root cause of my claim is that OpenAI won't tell us whether or not GPT-3.5 is an MoE model, and I assumed it wasn't. Since GPT-3.5 is clearly nondeterministic at temp=0, I believed the nondeterminism was due to FPU stuff, and this effect was amplified with GPT-4's MoE. But if GPT-3.5 is also MoE then that's just wrong.

What makes this especially tricky is that small models are truly 100% deterministic at temp=0 because the relative likelihoods are too coarse for FPU issues to be a factor. I had thought 3.5 was big enough that some of its token probabilities were too fine-grained for the FPU. But that's probably wrong.

On the other hand, it's not just GPT, there are currently floating-point difficulties in vllm which significantly affect the determinism of any model run on it: https://github.com/vllm-project/vllm/issues/966 Note that a suggested fix is upcasting to float32. So it's possible that GPT-3.5 is using an especially low-precision float and introducing nondeterminism by saving money on compute costs.

Sadly I do not have the money[1] to actually run a test to falsify any of this. It seems like this would be a good little research project.

[1] Or the time, or the motivation :) But this stuff is expensive.

exe342y ago

They can be made to be deterministic on CPU, but not on GPU (unless you want to give up on the speedup). With floating points, things like addition are not associative: a + (b + c) is not the same as (a + b) + c. So on CPU, you can make sure the order is always the same and the result is deterministic. On GPU, the order is not guaranteed, and thus the output is not deterministic.

This is because of the

j / k navigate · click thread line to collapse

0 comments

6 comments · 3 top-level

memhole2y ago· 2 in thread

I’m so glad to see LLMs spark these conversations lately. It’s been a huge gripe of mine that we don’t question the underlying precision in other areas of AI/ML

wongarsu2y ago

nicklecompteOP2y ago

AFAIK the determinism side of floating-point precision hasn’t been well-addressed, but it’s been a while since I skimmed those papers.

chessgecko2y ago· 1 in thread

nicklecompteOP2y ago

Sadly I do not have the money[1] to actually run a test to falsify any of this. It seems like this would be a good little research project.

[1] Or the time, or the motivation :) But this stuff is expensive.

exe342y ago

This is because of the

j / k navigate · click thread line to collapse