undefined | Better HN

0 pointssuperkuh3y ago0 comments

As far as I know, yes. https://arxiv.org/abs/2210.17323

"Specifically, GPTQ can quantize GPT models with 175 billion parameters in approximately four GPU hours, reducing the bitwidth down to 3 or 4 bits per weight, with negligible accuracy degradation relative to the uncompressed baseline."

This would be 175 billion 3 bit weights instead of 175 billion 16 (or 32!) bit weights. It massively reduces the size of the model. It makes loading it in ram on consumer computers feasible. The number of parameters stays the same.

0 comments

5 comments · 4 top-level

rnosov3y ago· 1 in thread

> https://arxiv.org/abs/2210.17323

I've read the paper and to be honest I'm not sure what to make of it. Their headline benchmark is perplexity on WikiText2 which would not be particularly relevant to most users. If you look at the tables in the appendix A.4 with some more relevant benchmarks you'll sometimes find that straight RTN 4 bit quantisation beats both GPTQ and even full 16 bit original! No explanation of it is given in the paper.

sebzim45003y ago

Some of those benchmarks have a pretty small sample size IIRC, might just be coincidence that the noise introduced by RTN just happens to slightly improve them.

GPTQ beats RTN on almost every benchmark at almost every size, though.

coeneedell3y ago

I wonder if reducing the bit depth of parameters like we have been acts as a normalization feature in these huge deep models.

rcme3y ago

The number of parameters stays the same, but the amount of information encodable by those parameters is not the same.

thomasahle3y ago

But they have to expand it back out to actually use it, right? Or does NVIDIA support 3 bit matrix mult?

j / k navigate · click thread line to collapse