undefined | Better HN

0 pointssebzim45002y ago0 comments

These still run on GPUs

0 comments

londons_explore2y ago

GPU's aren't yet awfully efficient at 1 bit math.

I could imagine FPGA designs might be competitive.

And dedicated ASIC's would almost certainly beat both by a decent margin.

int_19h2y ago

I don't think it would be difficult to make them efficient.

The main reason why we run this stuff on GPUs is their memory bandwidth, anyway.

sebzim4500OP2y ago

I'm very unconvinced that ASICs are better suited for this than for FP16/FP8 models that are being used today.

londons_explore2y ago

BF16 is a pretty big unit in an ASIC - You need at least 9 * 5 gates to calculate the exponent of the result, a 10 bit barrel shifter (10*10 + 10*ceil(log2(10)) gates), and a 10 bit multiplier (approximately 10 * 10 * 9 gates)

Total = 1085 gates. The reality is probably far more, because you're going to want to use carry-look-ahead and pipelining.

Whereas 1 bit multiplies and add's of say a 16 bit accumulator use... 16 gates! (and probably half since you can probably use scheduling tricks to skip past the zero's, at the expense of variable latency...)

So when 1 bit math uses only 1/100th of the silicon area of 16 bit math, and according to this paper gets the same results, the future is clearly silicon that can do 1 bit math.

leroman2y ago

- we have llama.cpp (could be enough or at least as mentioned in the paper a co-processor to accelerate the calc can be added, less need for large RAM / high end hardware)

- as most work is inference, might not need for as many GPUs

- consumer cards (24G) could possibly run the big models

sebzim4500OP2y ago

If consumer cards can run the big models, then datacenter cards will be able to efficiently run the really big models.

leroman2y ago

Some tasks we are using LLMs for are performing very close to GPT-4 levels using 7B models, so really depends on what value you are looking to get.

j / k navigate · click thread line to collapse

0 comments

londons_explore2y ago

GPU's aren't yet awfully efficient at 1 bit math.

I could imagine FPGA designs might be competitive.

And dedicated ASIC's would almost certainly beat both by a decent margin.

int_19h2y ago

I don't think it would be difficult to make them efficient.

The main reason why we run this stuff on GPUs is their memory bandwidth, anyway.

sebzim4500OP2y ago

I'm very unconvinced that ASICs are better suited for this than for FP16/FP8 models that are being used today.

londons_explore2y ago

Total = 1085 gates. The reality is probably far more, because you're going to want to use carry-look-ahead and pipelining.

So when 1 bit math uses only 1/100th of the silicon area of 16 bit math, and according to this paper gets the same results, the future is clearly silicon that can do 1 bit math.

leroman2y ago

- we have llama.cpp (could be enough or at least as mentioned in the paper a co-processor to accelerate the calc can be added, less need for large RAM / high end hardware)

- as most work is inference, might not need for as many GPUs

- consumer cards (24G) could possibly run the big models

sebzim4500OP2y ago

If consumer cards can run the big models, then datacenter cards will be able to efficiently run the really big models.

leroman2y ago

Some tasks we are using LLMs for are performing very close to GPT-4 levels using 7B models, so really depends on what value you are looking to get.

j / k navigate · click thread line to collapse