Chuck Moore of Forth fame demonstrated taking the value, say 1.6 multiplied by 4.1 and doing all the intermediate calculations via integers (16 * 41) and then formatting the output by putting the decimal point back in the "right place"; this worked as long as the range of floating point values was within a range that multiplying by 10 didn't exceed 65536 (16 bit integers), for instance. For embedded chips where for instance, you have an analog reading with 10 bits precision to quickly compute multiple times per second, this worked well.
I also recall talking many years ago with a Microsoft engineer who had worked with the Microsoft Streets and Trips program (https://archive.org/details/3135521376_qq_CD1 for a screenshot) and that they too had managed to fit what would normally be floating point numbers and the needed calculations into some kind of packed integer format with only the precision that was actually needed, that was faster on the CPUs of the day as well as more easily compressed to fit on the CDROM.
Proper finance related code should use it, but in my experience in that industry it doesn't seem very common unless you're running mainframes.
Funnily enough, I've seen a lot more fixed point arithmetic in software rasterizers than anywhere else. FreeType, GDI, WPF, WARP (D3D11 reference rasterizer) all use it heavily.
I'm working on control code one an ARM cortex-M4f. I wrote it all in fixed point because I don't trust an FPU to be faster, and I also like to have a 32bit accumulator instead of 24bit. I recently converted it all to floating point since we have the M4f part (f indicate FPU), and it's a little slower now. I did get to remove some limit checking since I can rely on the calculations being inside the limits but it's still a little slower than my fixed point implementation.
The IEEE 754 floating point standard is a very well thought out standard that is suitable for representing money as-is. If you have requirements such as compliance/legal/regulatory needs that mandate a minimum precision, then you can either opt to use decimal floating point or use binary floating point where you adjust the decimal place up to whatever legally required precision you are required to handle.
For example the common complaint about binary floating point is that $1.10 can't be represented exactly so you should instead use a fixed integer representation in terms of cents and represent it as 110. But if your requirement is to be able to represent values exactly to the penny, then you can simply do the same thing but using a floating point to represent cents and represent $1.10 as the floating point 110.0. The fixed integer representation conveys almost no benefit over the floating point representation, and once you need to work with and mix currencies that are significantly out of proportion to one another, you begin to really appreciate the nuances and work that went into IEEE 754 for taking into account a great deal of corner cases that a fixed integer representation will absolutely and spectacularly fail to handle.
On occasion I've seen people who didn't know any better use floats. One time I had to fix errors of single satoshis in a customer's database because their developer used 1.0 to represent 1 BTC.
https://arxiv.org/html/2306.11975v4
Interesting AF.
But performance often matters, so you trade off precision for performance. I think people are wrong to dismiss floating point numbers in favor of fixed point arithmetic, and I've seen plenty of fixed point arithmetic that has failed spectacularly because people think if you use it, it magically solves all your problems...
Whatever approach you take other than going all in with arbitrary precision fractions, you will need to have a good fundamental understanding of your representation and its trade-offs. For me personally I use floating point binary and adjust the decimal point so I can exactly represent any value to 6 decimal places. It's a good trade-off between performance, flexibility, and precision.
It's also what the main Bitcoin implementation uses.
: organize; gforth
Gforth 0.7.3, Copyright (C) 1995-2008 Free Software Foundation, Inc.
Gforth comes with ABSOLUTELY NO WARRANTY; for details type `license'
Type `bye' to exit
: %* d>s 10 m*/ ; : %. <# # [char] . hold #s #> type ; ok
1.6 4.1 %* %. 6.5 ok
Note that the correct answer is 6.56, so the result 6.5 is incorrectly rounded. Here's how this works.(If you're not familiar with Forth, Forth's syntax is that words are separated by spaces. "ok" is the prompt, ":" defines a subroutine terminated with ";", and you use RPN, passing parameters and receiving results on a stack.)
In standard Forth, putting a decimal point in a number makes it a double-precision number, occupying two cells on the stack, and in most Forths the number of digits after the decimal point is stored (until the next number) in the non-standardized variable dpl, decimal point location. Here I've just decided that all my numbers are going to have one decimal place. This means that after a multiplication I need to divide by 10, so I define a subroutine called %* to do this operation. (Addition and subtraction can use the standard d+ and d- subroutines; I didn't implement division, but it would need to pre-multiply the dividend by the scale factor 10.)
"%*" is defined in terms of the standard subroutine m*/, which multiplies a double-precision number by a single-precision number and divides the result by a divisor, and the standard subroutine d>s, which converts a double-precision number to a single-precision number. (There's probably a better way to do %*. I'm no Forth expert.)
I also need to define a way to print out such numbers, so I define a subroutine called "%.", using Forth's so-called "pictured numeric output", which prints out an unsigned double-precision number inserting a decimal point in the right place with "hold", after printing out the least significant digit. (In PNO we write the format backwards, starting from the least significant digit.) The call to "type" types out the formatted number from the hold space used by PNO.
Then I invoked %* on 1.6 and 4.1 and %. on its result, and it printed out 6.5 before giving me the "ok" prompt.
If you want to adapt this to use two decimal places:
: %* d>s 100 m*/ ; : %. <# # # [char] . hold #s #> type ; redefined %* redefined %. ok
1.60 4.10 %* %. 6.56 ok
Note, however, that a fixed-point multiplication still involves a multiplication, requiring potentially many additions, not just an addition. The paper, which I haven't read yet, is about how to approximate a floating-point multiplication by using an addition, presumably because in multiplication you add the mantissas, or maybe using a table of logarithms.Forth's approach to decimal numbers was a clever hack for the 01970s and 01980s on sub-MIPS machines with 8-bit and 16-bit ALUs, where you didn't want to be invoking 32-bit arithmetic casually, and you didn't have floating-point hardware. Probably on 32-bit machines it was already the wrong approach (a double-precision number on a 32-bit Forth is 64 bits, which is about 19 decimal digits) and clearly it is on 64-bit machines, where you don't even get out of the first 64-bit word until that many digits:
0 1 %. 184467440737095516.16 ok
GForth and other modern standard Forths do support floating-point, but for backward compatibility, they treat input with decimal points as double-precision integers.It this were about convolutional nets then optimizing compute would be a much bigger deal. Transformers are lightweight on compute and heavy on memory. The weakest link in the chain is fetching the model weights into the cores. The 95% and 80% energy reductions cited are for the multiplication operations in isolation, not for the entire inference process.
On fp8, the estimated gate count of fp8 multipliers is 296 vs. 157 with their technique, so the power gain on the multipliers will be much lower (50% would be a more reasonable estimation), but again for fp8 the additions in the dot products are a large part of the operations.
Overall, its really disingenuous to claim 80% power gain and small drop in accuracy, when the power gain is only for fp32 operations and the small drop in accuracy is only for fp8 operators. They don't analyze the accuracy drop in fp32, and don't present the power saved for fp8 dot product.
Every major inference provider other than Hyperbolic Labs runs Llama 3.1 405b at FP8, FWIW (e.g. Together, Fireworks, Lepton), so to compare against FP32 is misleading to say the least. Even Hyperbolic runs it at BF16.
Pretraining is typically done in FP32, although some labs (e.g. Character AI, RIP) apparently train in INT8: https://research.character.ai/optimizing-inference/
So the multiplications+additions are done on fp8/int8/int4/whatever (when the hardware support those operators of course) and accumulated in a fp32 or similar, and only the final accumulator is multiplied by the rescale factor in fp32.
http://tom7.org/grad/murphy2023grad.pdf
Also in video form: https://www.youtube.com/watch?v=Ae9EKCyI1xU
GradIEEEnt half decent - https://news.ycombinator.com/item?id=35780921 - May 2023 (32 comments)
I am asking not to dismiss it, I genuinely feel I don't understand logarithms on a fundamental level (of logic gates etc.). If multiplication can be replaced with table lookup and addition, then there has to be a circuit that gives you difficult addition and easy multiplication, or any combination of those tradeoffs.
This part is easy and anyone can implement hardware to do this. The tricky bit is always the staying in log space while doing accumulations, especially ones across a large range.
Seeing the name de Vries in the first paragraph didn't help my sense of confidence either.
http://blog.zorinaq.com/bitcoin-electricity-consumption/
It's a long read to go over multiple years worth of posts and comments but gives you a measure of the man.
(from footnote in method section)
What about over time? If this L-Mul (the matrix operation based on integer addition) operation proved to be much more energy efficient and became popular, would new hardware be created that was faster?
We've known about neural architectures since the 70s, but we couldn't build them big enough to be actually useful until the advent of the GPU.
Similarly, the LLM breakthrough was because someone decided it was worth spending millions of dollars to train one. Efficiency improvements lower that barrier for all future development (or alternatively, allow us to build even bigger models for the same cost.)
When there is an order of magnitude improvement in hardware, the AI labs will figure out an algorithm to best take advantage of it.
Nvidia funds most research around LLMs, and they also fund other companies that fund other research. If transformers were to use addition and remova all usage of floating point multiplication, there's a good chance the gpu would no longer be needed, or in the least, cheaper ones would be good enough. If that were to happen, no one would need nvidia anymore and their trillion dollar empire would start to crumble.
University labs get free gpus from nvidia -> University labs don't want to do research that would make said gpus obsolete because nvidia won't like that.
If this were to be true, it would mean that we are stuck on an inificient research path due to corporate greed. Imagine if this really was the next best thing, and we just don't explore it more because the ruling corporation doesn't want to lose their market cap.
Hopefully I'm wrong.
https://www.youtube.com/watch?v=gofI47kfD28
A lot of their work was published but went by unnoticed. But in fact the majority of their performance increase in new architecture is resulting from this work.
Reading between the lines, it seems that they came to the conclusion that a 4 bit representation with a group exponent ("FP4") is the most efficient representation of weights for inference. Reducing the number of bits in weights has the biggest impact on LLMs inference, since they are mostly memory bound. At these low bit numbers, the impact of using multiplication or other approaches is not really significiant anymore.
(multiplying a 4 bit wight with a larger activation is effectively 4 additions, barely more than what the paper proposes)
Given LLM performance seems to scale with their size, this would result in more powerful models, which would grow the applicability, use and importance of AI, which would in turn grow the use and importance of Nvidia's hardware.
So this theory doesn't really stack up for me.
Next gen nvidia chips would have more adders and fewer multipliers.
The CUDA tooling and ecosystem, VLSI architecture, organizational prowess… all matter at multiple orders of magnitude more.
You mean you have a conspiracy theory.
Why wouldn't other companies that buy Nvidia GPU fund these researches? It would greatly cut their cost.