> Using a commercially available 28-nanometer ASIC process technology, we have profiled (8, 1, 5, 5, 7) log ELMA as 0.96x the power of int8/32 multiply-add for a standalone processing element (PE).
> Extended to 16 bits this method uses 0.59x the power and 0.68x the area of IEEE 754 half-precision FMA
In other words, interesting but not earth shattering. Great to see people working in this area though!
Latency is also a lot less than IEEE or posit floating point FMA (not in the paper, but the results were only at 500 MHz because the float FMA couldn't meet timing closure at 750 MHz or higher in a single cycle, and the paper had to be pretty short with a deadline, so couldn't explore the whole frontier and show 1 cycle vs 2 cycle vs N cycle pipelined implementations).
The floating point tapering trick applied on top of this can help with the primary chip power problem, which is moving bits around, so you can solve more problems with a smaller word size because your encoding matches your data distribution better. Posits are a partial but not complete answer to this problem if you are willing to spend more area/energy on the encoding/decoding (I have a short mention about a learned encoding on this matter).
A floating point implementation that is more efficient than typical integer math but in which one can still do lots of interesting work is very useful too (providing an alternative for cases where you are tempted to use a wider bit width fixed point representation for dynamic range, or a 16+ bit floating point format).
They allow you to get away with much smaller word sizes while preserving dynamic range (and precision!) than would otherwise be the case. This is what the "word size"/tapering discussion in the blog. This is the thing that makes 8 bit floating point work in this case with just a drop in replacement via round to nearest even. You have to change significantly more to get 8 bit FP to work without either the Kulisch accumulator or entropy coding, as you have to make much different tradeoffs between precision and dynamic range.
"Users of floating point are seldom concerned simultaneously with with loss of accuracy and with overflow" (or underflow for that matter) [1]
The paper and blog post consider 4-5 different things/techniques, not all of which need be combined and some of which can be considered completely independently. The paper is a little bit gimmicky in that I combine all of them together, but that need not be the case.
(log significand fraction map (LNS), posit/Huffman/other entropy encoding, Kulisch accumulation, ELMA hybrid log/linear multiply-add as a replacement for pure log domain)
[1] Morris, Tapered floating point: a new floating point representation (1971) https://ieeexplore.ieee.org/abstract/document/1671767
Interesting times...
[0] https://www.synopsys.com/dw/ipdir.php?ds=asip-designer [1] https://en.wikipedia.org/wiki/LISA_(Language_for_Instruction...
This is independent of any kind of posit or other encoding issue (i.e. it has nothing to do with posits).
(I'm the author)
Do you think there might be an analytic trick that you could use for higher size ELMA numbers that yields semiaccurate results for machine learning purposes? Although to be honest I still think with a kuslich FMA and an extra operation for fused exponent add (softmax e.g.) you can cover most things you'll need 32 bits for with 8
If you didn't want to interpolate with an accurate slope, and just use a linear interpolation with a slope of 1 (using the approximations 2^x ~= 1+x and log_2(x+1) ~= x for x \in [0, 1)), then there's the issue that I discuss with the LUTs.
In the paper I mention that you need at least one more bit in the linear domain than the log domain (i.e., the `alpha` parameter in the paper is 1 + log significand fractional precision) for the values to be unique (such that log(linear(log_value)) == log_value) because the slope varies significantly from 1, but if you just took the remainder bits and used that as a linear extension with a slope of 1 (i.e., just paste the remainder bits on the end, and `alpha` == log significand fractional precision), then log(linear(log_value)) != log_value everywhere. Whether or not this is a real problem is debatable though, but probably has some effect on numerical stability if you don't preserve the identity.
Based on my tests I'm skeptical about training in 8 bits for general problems even with the exact linear addition; it doesn't work well. If you know what the behavior of the network should be, then you can tweak things enough to make it work (as people can do today with simulated quantization during training, or with int8 quantization for instance), but generally today when someone tries something new and it doesn't work, they tend to blame their architecture rather than the numerical behavior of IEEE 754 binary32 floating point. There are some things even today in ML (e.g., Poincaré embeddings) that can have issues even at 32 bits (in both dynamic range and precision). It would be a lot harder to know what the problem is in 8 bits when everything is under question if you don't know what the outcome should be.
This math type can and should also be used for many more things than neural network inference or training though.
google announced something along these lines at their AI conference last september and released the video today on youtube. here's the link to the segment where their approach is discussed: https://www.youtube.com/watch?v=ot4RWfGTtOg&t=330s
It's been a number of years since I've implemented low-level arithmetic, but when you use fixed point, don't you usually choose a power of 2? I don't see why you'd need multiplication/division instead of bit shifters.
But when multiplying two arbitrary floating point numbers, your typical case is multiplying base-2 numbers not powers of 2, like 1.01110110 by 1.10010101, which requires a real multiplier.
General floating point addition, multiplication and division thus require fixed-point adders, multipliers and dividers on the significands.