undefined | Better HN

0 pointsDer_Einzige3y ago0 comments

I think a lot of it is that they are intentionally not measuring the "degradation" in quality experienced. I've noticed that 8 bit quantization of a model like dolly is significantly worse than the 32bit version of it. Seen similar results with using quantization with stable diffusion - the images really are worse, just so little at half percision that it's worth the trade-off.

0 comments

readyplayeremma3y ago

What size model are you quantizing and comparing? The interesting thing about quantization, is how the larger the number of parameters, the less of a difference it makes to quantize the weights, even to an extreme degree when working with the largest parameter models. For small models is can be a disaster though.

quickthrower23y ago

Thanks. I am more surprised it works at all.

So do they use the weights that are say 32 bit floats and just round them to the nearest something putting them in a range 0-255? I guess I can see how it could work if weights are all close to zero, so -1 to 1 is mapped to 0-255.

But I would have though the model relied on the higher accuracy during training. So losing that would screw it up.

MacsHeadroom3y ago

That commenter is just wrong. We have empirical tests of quality loss due to quantization and even down to 4bits the loss is so negligible no human would ever be able to detect it. The loss only even registers on the benchmarks after generating tens of thousands of full context generations.

>So do they use the weights that are say 32 bit floats and just round them to the nearest

That's how they used to do it, and still how 8bit quantization works. That's called "Round to Nearest" or RTN quantization. That's not how it works anymore though.

The current algorithms (GPTQ, RTPQ, etc.) are more complex, including things like lining up the weights in order of least to greatest, placing them in bins (typically 32 or 128 weights per bin), and then computing an offset for each bin which is added to the RTN value. In some cases bins are identical and redundant and can be re-used without saving the same identical bin twice. These are just a few of the space saving measures which go into effective low-bit quantization without sacrificing quality.

It's very similar to state of the art video codecs or image compression algorithms. A raw photograph taken by my digital camera is 60MB, but a PNG of the same photo is 30x smaller at 2MB without a single artifact. It should be no surprise that we can reduce models by 4x, 8x, or even more without sacrificing quality.

Der_EinzigeOP3y ago

I am not wrong, you are wrong. The fact is that NLP and other fields are FULL of people using automated benchmarks to claim that they are "state of the art". They are incentivized to downplay or trivialize any quality losses. Scores like ROUGE and BLEU are terrible and the whole community knows it, but they're still used because we have nothing "better".

I can actually see jpg artifacts on the jpg variants of the png files that I generate in Stable Diffusion, and the impacts from quantization down to 3,2, even 1 bit are FAR more than the impacts of switching from png to jpg.

Also, I actually have published peer reviewed research on LLMs and spend a majority of my time on this earth thinking about and coding for them. I know what I'm talking about and you shouldn't try to dismiss my criticisms so quickly.

Even the coomers at civitai have done polls where their own users find dreambooth models better than lora models on average, likely because the likeness of a person can be more properly trained when heavier/stronger methods are utilized. Same dynamic here with quantization.

Yes, as a model scales up in size quantization hurts it less. The claims made that extreme quantization is not noticable at all when the model is super large is just pathetically wrong.

alpaca1283y ago

> But I would have though the model relied on the higher accuracy during training. So losing that would screw it up.

Yes, during training, where you need to make tiny adjustments to weights. But as far as I understand it inference can still work well because of the sheer number of weights. Give a black-and-white image a high resolution and you can represent any shade of gray if you zoom out a bit.

quickthrower23y ago

This seems to explain the process: https://medium.com/towards-data-science/how-to-accelerate-an...

j / k navigate · click thread line to collapse

0 comments

readyplayeremma3y ago

quickthrower23y ago

Thanks. I am more surprised it works at all.

But I would have though the model relied on the higher accuracy during training. So losing that would screw it up.

MacsHeadroom3y ago

>So do they use the weights that are say 32 bit floats and just round them to the nearest

That's how they used to do it, and still how 8bit quantization works. That's called "Round to Nearest" or RTN quantization. That's not how it works anymore though.

Der_EinzigeOP3y ago

Yes, as a model scales up in size quantization hurts it less. The claims made that extreme quantization is not noticable at all when the model is super large is just pathetically wrong.

alpaca1283y ago

> But I would have though the model relied on the higher accuracy during training. So losing that would screw it up.

quickthrower23y ago

This seems to explain the process: https://medium.com/towards-data-science/how-to-accelerate-an...

j / k navigate · click thread line to collapse