undefined | Better HN

0 pointsmetadat3y ago0 comments

Interesting, though apparently the OPT175B model is 350GB:

> You will need at least 350GB GPU memory on your entire cluster to serve the OPT-175B model. For example, you can use 4 x AWS p3.16xlarge instances, which provide 4 (instance) x 8 (GPU/instance) x 16 (GB/GPU) = 512GB memory.

https://alpa.ai/tutorials/opt_serving.html

(Scroll down to the second "Note", not far from the top)

I wonder what FlexGen is doing.. a naive guess is a mix of SSD and system memory. Definitely curious about what FlexGen's underlying strategy translates to in terms of actual data paths.

0 comments

5 comments · 1 top-level

SekstiNi3y ago· 4 in thread

> Interesting, though apparently the OPT175B model is 350GB:

Only in FP16. In the paper they use int4 quantization to reduce it to a quarter of that. In addition to the model weights, there's also a KV cache that takes up considerable amounts of memory, and they use int4 on that as well.

> I wonder what FlexGen is doing.. a naive guess is a mix of SSD and system memory.

That's correct, but other approaches have done this as well. What's "new" here seems to be the optimized data access pattern in combination with some other interesting techniques (prefetching, int4 quantization, CPU offload).

stevenhuang3y ago

I want to emphasize how fascinating I find that the transform from 16 bit to a 4 bit quantization results in negligible performance loss. That's huge. Is the original FP16 not compressed?

The allowance for this more granular quantization seems to suggest the "bottleneck" is in some other aspect of the system, and maybe until that is addressed, a higher fidelity quantization does not improve performance.

Or maybe it's the relative values/ratio between weights that is important, and as long as the intended ratio between weights can be expressed, the exact precision of the weights themselves may not be important?

Found an interesting paper on this below. There's doubtless heavy research underway in this area

- https://www.researchgate.net/publication/367557918_Understan...

stevenhuang3y ago

A recent discussion I found on int4, definitely looks like this is the new hotness. Very exciting!

https://news.ycombinator.com/item?id=34404859

t-vi3y ago

In my understanding, at a very high level and omitting many crucial details, the key is that when you have mainly largish matrix multiplications (as in transformers) well-behaved (mean zero uncorrelated random or so) quantization errors cancel out. People do/did experiment with 1 or 2 bit compression of gradients/updates in the context of distributed training, but there it has been generally deemed useful to keep track of compression errors locally.

inciampati3y ago

Very insightful! Now I'm curious what the bottleneck is.

j / k navigate · click thread line to collapse

0 comments

5 comments · 1 top-level

SekstiNi3y ago· 4 in thread

> Interesting, though apparently the OPT175B model is 350GB:

> I wonder what FlexGen is doing.. a naive guess is a mix of SSD and system memory.

stevenhuang3y ago

I want to emphasize how fascinating I find that the transform from 16 bit to a 4 bit quantization results in negligible performance loss. That's huge. Is the original FP16 not compressed?

Found an interesting paper on this below. There's doubtless heavy research underway in this area

- https://www.researchgate.net/publication/367557918_Understan...

stevenhuang3y ago

A recent discussion I found on int4, definitely looks like this is the new hotness. Very exciting!

https://news.ycombinator.com/item?id=34404859

t-vi3y ago

inciampati3y ago

Very insightful! Now I'm curious what the bottleneck is.

j / k navigate · click thread line to collapse