Floats are not distributed evenly across the number line. The number of floats between 0 and 1 is the same as the number of floats between 1 and 3, then between 3 and 7 and so on. Quantising well to integers means that you take this sensitivity into account since the spacing between integers is always the same.
No, the number of floats between 0 and 1 is (approximately) the same as the number of floats between 1 and positive infinity. And this is the correct way for it work: 1/x has roughly the same range and precision as x, so you don't need (as many) stupid obfuscatory algebraic transforms in your formulas to keep your intermediate values from over- or under-flowing.
Output:
Count between 0 and 1: 1065353215
Count between 1 and +inf: 1073741824
Ratio: 1.0
But a more theoretical approach will probably be needed to see if the same ratio exists for 64 bit floats.So you have xxxxx E xxx as an example of a 5 bit mantissa and 3 bit exponent.
You have 2^5 floating point numbers for each possible exponent.
So no, you're wrong. For exponent 0 you have 2^5, and for exponent 1, 10 and 11 you then have the same. The exponent 0b (0d) then contain the same number of possible floating mantissas as does 1b (1d), 10b (2d) and 11b(3d). Which means that there are as many mantissas between [0,1) as there are between [1,3)
Intuitively, I like the idea of asymmetric scales as well. Treating all values as equal seems like it's probably wasteful in terms of memory. It would be interesting to see where typical values fall statistically in an LLM. I bet it's nowhere near a random distribution of values.
This advantage is paid by increased distances between neighbor numbers inside the subranges, because the number of representable numbers is the same for floating-point and fixed-point, but the floating-point numbers are spread over their wider dynamic range.
Depending on the application, either the disadvantages or the advantages of a greater dynamic range are more important, which determines the choice of floating-point or integers (actually fixed-point), and when floating-point numbers are chosen, one can allocate more or less bits for the exponent depending on whether the dynamic range or the rounding errors are more important.
For ML/AI applications, it appears that the dynamic range is much more important than the rounding errors, which has caused the use of the Google BF16 format, which has great dynamic range and big rounding errors, instead of the IEEE FP16, which has a smaller dynamic range and smaller rounding errors, and which is preferable for other applications, like graphics (mainly for color component encoding), where the rounding errors of BF16 would be unacceptable.
In the parent article, there is a figure that is confusing, because in it the dynamic range appears to be the difference between the positive number and the negative number with the greatest absolute values.
This is very wrong. The dynamic range is the ratio between the (strictly) positive numbers with the greatest and the smallest absolute values. The dynamic range can be computed by subtraction only on a logarithmic scale, which is why in practice it is frequently expressed in decibels.
For instance, for INT8, the dynamic range is not (+127)-(-127)=254 as it appears in that figure, but it is 127 divided by 1, i.e. 127. Similarly, for FP16, the dynamic range is not (+65504)-(-65504)=131008 as it appears in that figure, but it is 65504 divided by 2^(-14), i.e. 1073217536, a much larger value, which demonstrates the advantage in dynamic range of FP16 over INT16 (the dynamic range of the latter is 32767).
With a dynamic range defined like in that figure, there would be no advantages for floating-point or for BF16, because with an implicit scale factor taken into account, one could make that "dynamic range" as great as desired, for any integer numbers, including for INT8. Nothing would prevent the use of an implicit scale factor of one billion, making the "dynamic range" of INT8 as 254 billion, or of an implicit scale factor of 10^100, resulting in a "dynamic range" of INT8 much larger than that of FP32.
Sibling commenter gave a better detailed answer, but I will share a succinct tl;dr in case that’s more your desire.
INT32 maximum value: 2,147,483,647
FP32 maximum value: 3.4028235 x 10^38
If you need to exactly represent all digits between 10,000,000 and 1,000,000,000, then INT32 will handle it fine, but FP32 won’t. But instead if you need to represent a range of values from 1.00 to 35,003,986,674,493.00 and it’s ok to just be directionally accurate, FP32 has you covered.
It uses asymmetric quantization and does so layer by layer such that each layer is processed independently before continuing to the next
GPTQ also supports symmetric quantization and almost everyone uses it. The problem with GPTQ asymmetric quantization is that all popular implementations have a bug [1] where all zero/bias values of 0 are reset to 1 during packing (out of 16 possible biases in 4-bit quantization), leading to quite a large loss in quality. Interestingly, it seems that people initially observed that symmetric quantization worked better than asymmetric quantization (which is very counter-intuitive, but made GPTQ symmetric quantization far more popular) and only discovered later that it is due to a bug.
[1] https://notes.danieldk.eu/ML/Formats/GPTQ#Packing+integers
See also: Hadamard transform, Walsh functions.