Addition is all you need for energy-efficient language models (opens in new tab)

(arxiv.org)

334 pointsInvisibleUp1y ago126 comments

126 comments

77 comments · 19 top-level

visarga1y ago· 15 in thread

> can potentially reduce 95% energy cost by elementwise floating point tensor multiplications and 80% energy cost of dot products

It this were about convolutional nets then optimizing compute would be a much bigger deal. Transformers are lightweight on compute and heavy on memory. The weakest link in the chain is fetching the model weights into the cores. The 95% and 80% energy reductions cited are for the multiplication operations in isolation, not for the entire inference process.

woadwarrior011y ago

Pre-fill (even in the single batch case) and multi-batch decoding are still compute dominated. The oft repeated trope of "decoder only transformer inference is bottle-necked on memory bandwidth" is only strictly true in the single batch decoding case, because you're mostly doing vector matrix mults when the batch size is one.

ein0p1y ago

Not even single batch. If you want reasonable latency per token (TPOT) even larger batches do not give you high compute utilization during extend. It’s only when you don’t care about TPOT at all, and your model is small enough to leave space for a large batch on an 8 GPU host, that’s when you could get decent utilization. That’s extend only - it’s easy to get high utilization in prefill.

SuchAnonMuchWow1y ago

Its worse than that: the energy gains are when comparing computations made with fp32, but for fp8 the multipliers are really tiny and the adder/shifters represent a largest part of the operators (energy-wise and area-wise) and this paper will only have small gains.

On fp8, the estimated gate count of fp8 multipliers is 296 vs. 157 with their technique, so the power gain on the multipliers will be much lower (50% would be a more reasonable estimation), but again for fp8 the additions in the dot products are a large part of the operations.

Overall, its really disingenuous to claim 80% power gain and small drop in accuracy, when the power gain is only for fp32 operations and the small drop in accuracy is only for fp8 operators. They don't analyze the accuracy drop in fp32, and don't present the power saved for fp8 dot product.

bobsyourbuncle1y ago

I’m new to neural nets, when should one use fp8 vs fp16 vs fp32?

3 more replies

lifthrasiir1y ago

I'm also sure that fp8 is small enough that multiplication can really be done in a much simpler circuit than larger fp formats. Even smaller formats like fp4 would be able to just use a lookup table, and that makes them more like sort-of-standardized quantization schemes.

tankenmate1y ago

i suspect that you could do fp8 with log tables and interpolation if you really wanted to (compared to the memory required for the model it's peanuts), it just turns into a LUT (log table look up) and bit shift (interpolation). so again, memory bandwidth is the limiting factor for transformers (as far as energy is concerned).

1 more reply

brilee1y ago

fp4/fp8 for neural networks don't work the way you think they do - they are merely compression formats - a set of, say, 256 fp32 weights from 1 neuron are lossily turned into 1 max value (stored in fp32 precision) and 256 fp4/fp8 numbers. Those compressed numbers are multiplied by the fp32 number at runtime to restore the original weights and full fp32 multiplication + additions are executed.

4 more replies

bee_rider1y ago

What is fp4? 3 bits of exponent and one of mantissa?

1 more reply

api1y ago

Sounds like the awesome architecture for transformers would be colocation of memory and compute.

Joker_vD1y ago

Yes, that's why we generally run them on GPUs.

3 more replies

imjonse1y ago

That is true for single user/light inference only. For training and batch inference you can get compute bound fast enough.

saagarjha1y ago

That really depends on what you're doing. Trying to feed a tensor core is pretty hard–they're really fast.

kendalf891y ago

Maybe this technique can be used for training then since that is a lot more compute intensive?

mikewarot1y ago

Imagine if you had a systolic array large enough that all the weights would only have to be loaded once at startup. Eliminating the memory-compute bottleneck of the von Neumann architecture could make this quite a bit more efficient.

h_tbob1y ago

Bro... they are NOT lightweight on compute!

shrubble1y ago· 12 in thread

I remember that many years ago, when floating point computation was expensive for Intel CPUs to do, there were multiple ways that programmers used integer trickery to work around this.

Chuck Moore of Forth fame demonstrated taking the value, say 1.6 multiplied by 4.1 and doing all the intermediate calculations via integers (16 * 41) and then formatting the output by putting the decimal point back in the "right place"; this worked as long as the range of floating point values was within a range that multiplying by 10 didn't exceed 65536 (16 bit integers), for instance. For embedded chips where for instance, you have an analog reading with 10 bits precision to quickly compute multiple times per second, this worked well.

I also recall talking many years ago with a Microsoft engineer who had worked with the Microsoft Streets and Trips program (https://archive.org/details/3135521376_qq_CD1 for a screenshot) and that they too had managed to fit what would normally be floating point numbers and the needed calculations into some kind of packed integer format with only the precision that was actually needed, that was faster on the CPUs of the day as well as more easily compressed to fit on the CDROM.

dajoh1y ago

What you're describing is called fixed point arithmetic, a super cool technique I wish more programmers knew about.

Proper finance related code should use it, but in my experience in that industry it doesn't seem very common unless you're running mainframes.

Funnily enough, I've seen a lot more fixed point arithmetic in software rasterizers than anywhere else. FreeType, GDI, WPF, WARP (D3D11 reference rasterizer) all use it heavily.

kccqzy1y ago

I have worked on firmware that has plenty of fixed point arithmetic. The firmware usually runs on processors without hardware floating point units. For example certain Tesla ECUs use 32-bit integers where they divide it into four bits of integer part and 28 bits of fractional part. So values are scaled by 2^28.

1 more reply

aatd861y ago

What do they use? Not float I hope. Plus given that some currencies have different precisions... Don't tell me it's rounding errors over trillion monies?! :o)

2 more replies

EGreg1y ago

Smart contracts on EVM and other blockchains all use fixed point, for the simple reason that all machines have to get exactly the same result.

myst1y ago

Every half-competent software engineer knows about fixed point arithmetic, my friend.

1 more reply

andrewla1y ago

I recall playing with FRACTINT, which was a fractal generator that existed before floating point coprocessors were common, that used fixed point math to calculate and display fractals. That was back when fractals were super cool and everyone wanted to be in the business of fractals, and all the Nobel Prizes were given out to fractal researchers.

touisteur1y ago

Ozaki has been doing fp64 matrix-multiplication using int8 tensor cores

https://arxiv.org/html/2306.11975v4

Interesting AF.

candiddevmike1y ago

AFAIK this is still the best way to handle money/financial numbers.

amanda991y ago

That's got nothing to do with perf tho.

1 more reply

dwattttt1y ago

That particular trick is known as fixed point arithmetic (not to be confused with a fixed point of a function)

asadalt1y ago

this is still true for many embedded projects. like pi pico (2040) uses a table.

kragen1y ago

Sure, FRACTINT is called FRACTINT because it uses fixed-point ("integer") math. And fixed-point math is still standard in Forth; you can do your example in GForth like this:

    : organize; gforth
    Gforth 0.7.3, Copyright (C) 1995-2008 Free Software Foundation, Inc.
    Gforth comes with ABSOLUTELY NO WARRANTY; for details type `license'
    Type `bye' to exit
    : %* d>s 10 m*/ ;  : %. <# # [char] . hold #s #> type ;  ok
    1.6 4.1 %* %. 6.5 ok

Note that the correct answer is 6.56, so the result 6.5 is incorrectly rounded. Here's how this works.

(If you're not familiar with Forth, Forth's syntax is that words are separated by spaces. "ok" is the prompt, ":" defines a subroutine terminated with ";", and you use RPN, passing parameters and receiving results on a stack.)

In standard Forth, putting a decimal point in a number makes it a double-precision number, occupying two cells on the stack, and in most Forths the number of digits after the decimal point is stored (until the next number) in the non-standardized variable dpl, decimal point location. Here I've just decided that all my numbers are going to have one decimal place. This means that after a multiplication I need to divide by 10, so I define a subroutine called %* to do this operation. (Addition and subtraction can use the standard d+ and d- subroutines; I didn't implement division, but it would need to pre-multiply the dividend by the scale factor 10.)

"%*" is defined in terms of the standard subroutine m*/, which multiplies a double-precision number by a single-precision number and divides the result by a divisor, and the standard subroutine d>s, which converts a double-precision number to a single-precision number. (There's probably a better way to do %*. I'm no Forth expert.)

I also need to define a way to print out such numbers, so I define a subroutine called "%.", using Forth's so-called "pictured numeric output", which prints out an unsigned double-precision number inserting a decimal point in the right place with "hold", after printing out the least significant digit. (In PNO we write the format backwards, starting from the least significant digit.) The call to "type" types out the formatted number from the hold space used by PNO.

Then I invoked %* on 1.6 and 4.1 and %. on its result, and it printed out 6.5 before giving me the "ok" prompt.

If you want to adapt this to use two decimal places:

    : %* d>s 100 m*/ ;  : %. <# # # [char] . hold #s #> type ; redefined %*  redefined %.   ok
    1.60 4.10 %* %. 6.56 ok

Note, however, that a fixed-point multiplication still involves a multiplication, requiring potentially many additions, not just an addition. The paper, which I haven't read yet, is about how to approximate a floating-point multiplication by using an addition, presumably because in multiplication you add the mantissas, or maybe using a table of logarithms.

Forth's approach to decimal numbers was a clever hack for the 01970s and 01980s on sub-MIPS machines with 8-bit and 16-bit ALUs, where you didn't want to be invoking 32-bit arithmetic casually, and you didn't have floating-point hardware. Probably on 32-bit machines it was already the wrong approach (a double-precision number on a 32-bit Forth is 64 bits, which is about 19 decimal digits) and clearly it is on 64-bit machines, where you don't even get out of the first 64-bit word until that many digits:

    0 1 %. 184467440737095516.16 ok

GForth and other modern standard Forths do support floating-point, but for backward compatibility, they treat input with decimal points as double-precision integers.

ranguna1y ago· 12 in thread

I've seen this claim a few time across the last couple years and I have a pet theory why this isn't explored a lot:

Nvidia funds most research around LLMs, and they also fund other companies that fund other research. If transformers were to use addition and remova all usage of floating point multiplication, there's a good chance the gpu would no longer be needed, or in the least, cheaper ones would be good enough. If that were to happen, no one would need nvidia anymore and their trillion dollar empire would start to crumble.

University labs get free gpus from nvidia -> University labs don't want to do research that would make said gpus obsolete because nvidia won't like that.

If this were to be true, it would mean that we are stuck on an inificient research path due to corporate greed. Imagine if this really was the next best thing, and we just don't explore it more because the ruling corporation doesn't want to lose their market cap.

Hopefully I'm wrong.

cpldcpu1y ago

I have to disagree. Nvidia spent a lot of effort on researching improved numerical representations. You can see a summary in this talk:

https://www.youtube.com/watch?v=gofI47kfD28

A lot of their work was published but went by unnoticed. But in fact the majority of their performance increase in new architecture is resulting from this work.

Reading between the lines, it seems that they came to the conclusion that a 4 bit representation with a group exponent ("FP4") is the most efficient representation of weights for inference. Reducing the number of bits in weights has the biggest impact on LLMs inference, since they are mostly memory bound. At these low bit numbers, the impact of using multiplication or other approaches is not really significiant anymore.

(multiplying a 4 bit wight with a larger activation is effectively 4 additions, barely more than what the paper proposes)

nayroclade1y ago

"Good enough" for what? We're in the middle of an AI arms race. Why do you believe people would choose to run the same LLMs on cheaper equipment instead of using the greater efficiency to train and run even larger LLMs?

Given LLM performance seems to scale with their size, this would result in more powerful models, which would grow the applicability, use and importance of AI, which would in turn grow the use and importance of Nvidia's hardware.

So this theory doesn't really stack up for me.

chpatrick1y ago

It's still a massively parallel problem suited to GPUs, whether it's float or int, or addition or multiplication doesn't really matter.

londons_explore1y ago

If an addition-only LLM performed better, nvidia would probably still be the market leader.

Next gen nvidia chips would have more adders and fewer multipliers.

yunohn1y ago

Google & Apple already run custom chips, Meta and MS are deploying their own soon too. Your theory is that none of them have researched non-matrix-multiplication solutions before investing billions?

miohtama1y ago

There are several patents on this topic so they have

twoodfin1y ago

I’d estimate that fraction of Nvidia’s dominance that’s dependent on their distinctive advantages in kernel primitives (add vs. multiply) would be a rounding error in FP8.

The CUDA tooling and ecosystem, VLSI architecture, organizational prowess… all matter at multiple orders of magnitude more.

teaearlgraycold1y ago

NVidia GPUs support integer operations specifically for use with deep learning models.

WrongAssumption1y ago

So let me get this straight. Universities don’t want to show that Nvidia gpus are obsolete, so they can receive a steady stream of obsolete gpus? For what possible reason, that doesn’t make sense.

iamgopal1y ago

no matter how fast cpu, network and browser has become, websites are still slow. we will run out of data to train much earlier than people will stop inventing even larger models.

yieldcrv1y ago

Alternatively, other people fund LLM research

raincole1y ago

> I have a pet theory

You mean you have a conspiracy theory.

Why wouldn't other companies that buy Nvidia GPU fund these researches? It would greatly cut their cost.

A4ET8a8uTh01y ago· 6 in thread

Uhh.. I hate to be the one to ask this question, but shouldn't we be focused on making LLMs work well first and then focused on desired optimizations? Using everyone's car analogy, it is like making sure early cars are using lower amount of coal. It is a fool's errand.

itishappy1y ago

Coal (and even wood!) powered cars actually existed long before Ford, but didn't take off because they were too heavy and unwieldly. The Model T was the result of a century of optimization.

https://en.wikipedia.org/wiki/Nicolas-Joseph_Cugnot

lukev1y ago

Also, making neural networks faster/cheaper is a big part of how they advance.

We've known about neural architectures since the 70s, but we couldn't build them big enough to be actually useful until the advent of the GPU.

Similarly, the LLM breakthrough was because someone decided it was worth spending millions of dollars to train one. Efficiency improvements lower that barrier for all future development (or alternatively, allow us to build even bigger models for the same cost.)

spencerchubb1y ago

Cheaper compute is basically a prerequisite to making better models. You can get some improvements on the margins by making algorithms better with current hardware, but not an order of magnitude improvement.

When there is an order of magnitude improvement in hardware, the AI labs will figure out an algorithm to best take advantage of it.

Maken1y ago

The optimizations described could easily work on other models, not just transformers. Following your analogy, this is optimizing plumbing, pistons and valves on steam engines, it could be useful for whatever follows.

fennecfoxy1y ago

You're also welcome to contribute. There are many people doing many things at once in this space, I don't think experiments like this are a problem at all.

andrewchambers1y ago

What if working well means making them efficient enough to run more 'neurons' on our current hardware?

tantalor1y ago· 2 in thread

[2023] GradIEEEnt half decent: The hidden power of imprecise lines

http://tom7.org/grad/murphy2023grad.pdf

Also in video form: https://www.youtube.com/watch?v=Ae9EKCyI1xU

dang1y ago

GradIEEEnt half decent: The hidden power of imprecise lines [video] - https://news.ycombinator.com/item?id=36806970 - July 2023 (9 comments)

GradIEEEnt half decent - https://news.ycombinator.com/item?id=35780921 - May 2023 (32 comments)

indrora1y ago

I had hoped that they would reference this in their paper as some kind of "supporting previous exploration" but no, alas.

js81y ago· 2 in thread

Haven't read it, but isn't this just logarithmic tables in some form?

I am asking not to dismiss it, I genuinely feel I don't understand logarithms on a fundamental level (of logic gates etc.). If multiplication can be replaced with table lookup and addition, then there has to be a circuit that gives you difficult addition and easy multiplication, or any combination of those tradeoffs.

sabhiram1y ago

Log space is nice, multiplication can be replaced by addition.

This part is easy and anyone can implement hardware to do this. The tricky bit is always the staying in log space while doing accumulations, especially ones across a large range.

pclmulqdq1y ago

Yes, this is logarithmic number systems at work.

cpldcpu1y ago· 2 in thread

It puzzles me that there does not seem to be a proper derivation and discussion of the error term in the paper. It's all treated indirectly way inference results.

Lerc1y ago

The paper has an odd feel about it to me too. Doing a gate estimation as a text explanation without a diagram makes it too easy to miss some required part. It wouldn't need to be a full gate level explanation but blocks labeled 'adder'.

Seeing the name de Vries in the first paragraph didn't help my sense of confidence either.

brcmthrowaway1y ago

Because of the twisted mentat?

1 more reply

ein0p1y ago· 2 in thread

More than 10x the amount of energy is spent moving bytes around. Compute efficiency is not as big of an issue as people think. It’s just that the compute is in the wrong place now - it needs to be right next to memory cells, bypassing the memory bus, at least in the initial aggregations that go into dot products.

entropicdrifter1y ago

This could still be useful for battery constrained devices, right?

ein0p1y ago

It’s even worse in battery constrained devices - they tend to also be memory constrained and run with batch size 1 during extend. IOW the entire model (or parts thereof, if the model is MoE), gets read for every generated token. Utilization of compute is truly abysmal in that case and almost all energy is spent pushing bytes through the memory bus, which on battery powered devices doesn’t have high throughput

CGamesPlay1y ago· 1 in thread

I believe this reduces the compute required, but still uses 8 bits per value, so it does not reduce the memory requirements required to run inference, so it doesn’t particularly make the models more accessible for inference. Is this storage method suitable for training? That could potentially be an interesting application.

Manabu-eo1y ago

It actually is about 0.5 bits less efficient per weight in terms of precision/range, something the paper never highlights.

presspot1y ago· 1 in thread

From my experience, the absolute magicians in fixed point math were the 8-bit and 16-bit video game designers. I was in awe of the optimizations they did. They made it possible to calculate 3D matrix maths in real time, for example, in order to make the first flight simulators and first person shooter games.

hinkley1y ago

Redefining degrees to be 2pi = 256 was a pretty clever trick.

cpldcpu1y ago· 1 in thread

Bill Dally from nvidia introduced a log representation that basically allows to replace a multiplication with an add, without loss of accuracy (in contract to proposal above)

https://youtu.be/gofI47kfD28?t=2248

nickpsecurity1y ago

Paper?

https://research.nvidia.com/publication/2022-12_lns-madam-lo...

scotty791y ago· 1 in thread

All You Need is Considered Harmful.

TaurenHunter1y ago

We will need a paper titled '"Considered Harmful" Articles is All You Need' to complete that cycle.

md_rumpf1y ago· 1 in thread

The return of the CPU?!

anticensor1y ago

The reign of Threadripper!

jenda231y ago

Highly recommended!! Success achieved! Previously I had worked with another well regarded company to attempt recovering an Ethereum presale wallet passphrase that I had forgotten. After 14 months of trying there was no success, so then I looked into ReWallet. They were able to find the password solution in 6 weeks! Since I only remembered a few portions or clues, it seemed like a nearly impossible task. They worked diligently and very professionally. I fully recommend and trust these guys, the result speaks for itself. Contact email, ‎rewalletshieldcoinrecovery@ aol.com or WhatsApp::+1 (757) 332-1885

pjc501y ago

"We recommend training and hosting L-Mul-based models on devices integrated with specialized architectural designs. Patent pending"

(from footnote in method section)

Buttons8401y ago

Would using this neural network based on integer addition be faster? The paper does not claim it would be faster, so I'm assuming not?

What about over time? If this L-Mul (the matrix operation based on integer addition) operation proved to be much more energy efficient and became popular, would new hardware be created that was faster?

concrete_head1y ago

Just too add an alternative addition based architecture into the mix.

https://www.youtube.com/watch?v=VqXwmVpCyL0

dwrodri1y ago

7 years of the same title format is all you need.

m3kw91y ago

So instead of say 2x3 you go 2+2+2?

j / k navigate · click thread line to collapse

126 comments

77 comments · 19 top-level

visarga1y ago· 15 in thread

> can potentially reduce 95% energy cost by elementwise floating point tensor multiplications and 80% energy cost of dot products

woadwarrior011y ago

ein0p1y ago

SuchAnonMuchWow1y ago

bobsyourbuncle1y ago

I’m new to neural nets, when should one use fp8 vs fp16 vs fp32?

3 more replies

lifthrasiir1y ago

tankenmate1y ago

1 more reply

brilee1y ago

4 more replies

bee_rider1y ago

What is fp4? 3 bits of exponent and one of mantissa?

1 more reply

api1y ago

Sounds like the awesome architecture for transformers would be colocation of memory and compute.

Joker_vD1y ago

Yes, that's why we generally run them on GPUs.

3 more replies

imjonse1y ago

That is true for single user/light inference only. For training and batch inference you can get compute bound fast enough.

saagarjha1y ago

That really depends on what you're doing. Trying to feed a tensor core is pretty hard–they're really fast.

kendalf891y ago

Maybe this technique can be used for training then since that is a lot more compute intensive?

mikewarot1y ago

h_tbob1y ago

Bro... they are NOT lightweight on compute!

shrubble1y ago· 12 in thread

I remember that many years ago, when floating point computation was expensive for Intel CPUs to do, there were multiple ways that programmers used integer trickery to work around this.

dajoh1y ago

What you're describing is called fixed point arithmetic, a super cool technique I wish more programmers knew about.

Proper finance related code should use it, but in my experience in that industry it doesn't seem very common unless you're running mainframes.

Funnily enough, I've seen a lot more fixed point arithmetic in software rasterizers than anywhere else. FreeType, GDI, WPF, WARP (D3D11 reference rasterizer) all use it heavily.

kccqzy1y ago

1 more reply

aatd861y ago

What do they use? Not float I hope. Plus given that some currencies have different precisions... Don't tell me it's rounding errors over trillion monies?! :o)

2 more replies

EGreg1y ago

Smart contracts on EVM and other blockchains all use fixed point, for the simple reason that all machines have to get exactly the same result.

myst1y ago

Every half-competent software engineer knows about fixed point arithmetic, my friend.

1 more reply

andrewla1y ago

touisteur1y ago

Ozaki has been doing fp64 matrix-multiplication using int8 tensor cores

https://arxiv.org/html/2306.11975v4

Interesting AF.

candiddevmike1y ago

AFAIK this is still the best way to handle money/financial numbers.

amanda991y ago

That's got nothing to do with perf tho.

1 more reply

dwattttt1y ago

That particular trick is known as fixed point arithmetic (not to be confused with a fixed point of a function)

asadalt1y ago

this is still true for many embedded projects. like pi pico (2040) uses a table.

kragen1y ago

Sure, FRACTINT is called FRACTINT because it uses fixed-point ("integer") math. And fixed-point math is still standard in Forth; you can do your example in GForth like this:

    : organize; gforth
    Gforth 0.7.3, Copyright (C) 1995-2008 Free Software Foundation, Inc.
    Gforth comes with ABSOLUTELY NO WARRANTY; for details type `license'
    Type `bye' to exit
    : %* d>s 10 m*/ ;  : %. <# # [char] . hold #s #> type ;  ok
    1.6 4.1 %* %. 6.5 ok

Note that the correct answer is 6.56, so the result 6.5 is incorrectly rounded. Here's how this works.

Then I invoked %* on 1.6 and 4.1 and %. on its result, and it printed out 6.5 before giving me the "ok" prompt.

If you want to adapt this to use two decimal places:

    : %* d>s 100 m*/ ;  : %. <# # # [char] . hold #s #> type ; redefined %*  redefined %.   ok
    1.60 4.10 %* %. 6.56 ok

    0 1 %. 184467440737095516.16 ok

GForth and other modern standard Forths do support floating-point, but for backward compatibility, they treat input with decimal points as double-precision integers.

ranguna1y ago· 12 in thread

I've seen this claim a few time across the last couple years and I have a pet theory why this isn't explored a lot:

University labs get free gpus from nvidia -> University labs don't want to do research that would make said gpus obsolete because nvidia won't like that.

Hopefully I'm wrong.

cpldcpu1y ago

I have to disagree. Nvidia spent a lot of effort on researching improved numerical representations. You can see a summary in this talk:

https://www.youtube.com/watch?v=gofI47kfD28

A lot of their work was published but went by unnoticed. But in fact the majority of their performance increase in new architecture is resulting from this work.

(multiplying a 4 bit wight with a larger activation is effectively 4 additions, barely more than what the paper proposes)

nayroclade1y ago

So this theory doesn't really stack up for me.

chpatrick1y ago

It's still a massively parallel problem suited to GPUs, whether it's float or int, or addition or multiplication doesn't really matter.

londons_explore1y ago

If an addition-only LLM performed better, nvidia would probably still be the market leader.

Next gen nvidia chips would have more adders and fewer multipliers.

yunohn1y ago

Google & Apple already run custom chips, Meta and MS are deploying their own soon too. Your theory is that none of them have researched non-matrix-multiplication solutions before investing billions?

miohtama1y ago

There are several patents on this topic so they have

twoodfin1y ago

I’d estimate that fraction of Nvidia’s dominance that’s dependent on their distinctive advantages in kernel primitives (add vs. multiply) would be a rounding error in FP8.

The CUDA tooling and ecosystem, VLSI architecture, organizational prowess… all matter at multiple orders of magnitude more.

teaearlgraycold1y ago

NVidia GPUs support integer operations specifically for use with deep learning models.

WrongAssumption1y ago

iamgopal1y ago

no matter how fast cpu, network and browser has become, websites are still slow. we will run out of data to train much earlier than people will stop inventing even larger models.

yieldcrv1y ago

Alternatively, other people fund LLM research

raincole1y ago

> I have a pet theory

You mean you have a conspiracy theory.

Why wouldn't other companies that buy Nvidia GPU fund these researches? It would greatly cut their cost.

A4ET8a8uTh01y ago· 6 in thread

itishappy1y ago

Coal (and even wood!) powered cars actually existed long before Ford, but didn't take off because they were too heavy and unwieldly. The Model T was the result of a century of optimization.

https://en.wikipedia.org/wiki/Nicolas-Joseph_Cugnot

lukev1y ago

Also, making neural networks faster/cheaper is a big part of how they advance.

We've known about neural architectures since the 70s, but we couldn't build them big enough to be actually useful until the advent of the GPU.

spencerchubb1y ago

When there is an order of magnitude improvement in hardware, the AI labs will figure out an algorithm to best take advantage of it.

Maken1y ago

fennecfoxy1y ago

You're also welcome to contribute. There are many people doing many things at once in this space, I don't think experiments like this are a problem at all.

andrewchambers1y ago

What if working well means making them efficient enough to run more 'neurons' on our current hardware?

tantalor1y ago· 2 in thread

[2023] GradIEEEnt half decent: The hidden power of imprecise lines

http://tom7.org/grad/murphy2023grad.pdf

Also in video form: https://www.youtube.com/watch?v=Ae9EKCyI1xU

dang1y ago

GradIEEEnt half decent: The hidden power of imprecise lines [video] - https://news.ycombinator.com/item?id=36806970 - July 2023 (9 comments)

GradIEEEnt half decent - https://news.ycombinator.com/item?id=35780921 - May 2023 (32 comments)

indrora1y ago

I had hoped that they would reference this in their paper as some kind of "supporting previous exploration" but no, alas.

js81y ago· 2 in thread

Haven't read it, but isn't this just logarithmic tables in some form?

sabhiram1y ago

Log space is nice, multiplication can be replaced by addition.

This part is easy and anyone can implement hardware to do this. The tricky bit is always the staying in log space while doing accumulations, especially ones across a large range.

pclmulqdq1y ago

Yes, this is logarithmic number systems at work.

cpldcpu1y ago· 2 in thread

It puzzles me that there does not seem to be a proper derivation and discussion of the error term in the paper. It's all treated indirectly way inference results.

Lerc1y ago

Seeing the name de Vries in the first paragraph didn't help my sense of confidence either.

brcmthrowaway1y ago

Because of the twisted mentat?

1 more reply

ein0p1y ago· 2 in thread

entropicdrifter1y ago

This could still be useful for battery constrained devices, right?

ein0p1y ago

CGamesPlay1y ago· 1 in thread

Manabu-eo1y ago

It actually is about 0.5 bits less efficient per weight in terms of precision/range, something the paper never highlights.

presspot1y ago· 1 in thread

hinkley1y ago

Redefining degrees to be 2pi = 256 was a pretty clever trick.

cpldcpu1y ago· 1 in thread

Bill Dally from nvidia introduced a log representation that basically allows to replace a multiplication with an add, without loss of accuracy (in contract to proposal above)

https://youtu.be/gofI47kfD28?t=2248

nickpsecurity1y ago

Paper?

https://research.nvidia.com/publication/2022-12_lns-madam-lo...

scotty791y ago· 1 in thread

All You Need is Considered Harmful.

TaurenHunter1y ago

We will need a paper titled '"Considered Harmful" Articles is All You Need' to complete that cycle.

md_rumpf1y ago· 1 in thread

The return of the CPU?!

anticensor1y ago

The reign of Threadripper!