undefined | Better HN

0 pointsBaculumMeumEst3y ago0 comments

Wow you can run a 30B model on 16gb ram? Is it hitting swap?

0 comments

7 comments · 2 top-level

mcbuilder3y ago· 5 in thread

These models can run FP16, with LLM quantization going down to Int8 and beyond.

BaculumMeumEstOP3y ago

i'm just starting to get into deep learning so i look forward to understanding that sentence

sp3323y ago

Training uses gradient descent, so you want to have good precision during that process. But once you have the overall structure of the network, https://arxiv.org/abs/2210.17323 (GPTQ) showed that you can cut down the precision quite a bit without losing a lot of accuracy. It seems you can cut down further for larger models. For the 13B Llama-based ones, going below 5 bits per parameter is noticeably worse, but for 30B models you can do 4 bits.

The same group did another paper https://arxiv.org/abs/2301.00774 which shows that in addition to reducing the precision of each parameter, you can also prune out a bunch of parameters entirely. It's harder to apply this optimization because models are usually loaded into RAM densely, but I hope someone figures out how to do it for popular models.

2 more replies

mike006323y ago

How much resources are required is directly related to the memory size devoted to each weight. If the weights are stored as 32-bit floating points then each weight is 32 bits which adds up when we are talking about billions of weights. But if the weights are first converted to 16-bit floating point numbers (precise to fewer decimal places) then fewer resources are needed to store and compute the numbers. Research has shown that simply chopping off some of the precision of the weights still yields good AI performance in many cases.

Note too that the numbers are standardized, e.g. floats are defined by IEEE 754 standard. Numbers in this format have specialized hardware to do math with them, so when considering which number format to use it's difficult to get outside of the established ones (foat32, float16, int8).

balder19913y ago

You’ll notice that a lot of language allow you more control when dealing with number representations, such as C/C++, Numpy in Python etc.

Ex: Since C and C++ number sizes depend on processor architecture, C++ has types like int16_t and int32_t to enforce a size regardless of architecture, Python always uses the same side, but Numpy has np.int16 and np.int32, Java also uses the same size but has short for 16-bit and int for 32-bit integers.

It just happens that some higher level languages hide this abstraction from the programmers and often standardize in one default size for integers.

MobiusHorizons3y ago

FP16 and Int8 are about how many bits are being used for floating point and integer numbers. FP16 is 16bit floating point. The more bits the better the precision, but the more ram it takes. Normally programmers use 32 or 64bit floats so 16bit floats have significantly reduced precision, but take up half the space of fp32 which is the smallest floating point format for most CPUs. similarly 8 bit integers have only 256 total possibilities and go from -128 to 127.

sp3323y ago

Most people are running these at 4 bits per parameter for speed and RAM reasons. That means the model would take just about all of the RAM. But instead of swap (writing data to disk and then reading it again later), I would expect a good implementation to only run into cache eviction (deleting data from RAM and then reading it back from disk later), which should be a lot faster and cause less wear and tear on SSDs.

j / k navigate · click thread line to collapse

0 comments

7 comments · 2 top-level

mcbuilder3y ago· 5 in thread

These models can run FP16, with LLM quantization going down to Int8 and beyond.

BaculumMeumEstOP3y ago

i'm just starting to get into deep learning so i look forward to understanding that sentence

sp3323y ago

2 more replies

mike006323y ago

balder19913y ago

You’ll notice that a lot of language allow you more control when dealing with number representations, such as C/C++, Numpy in Python etc.

It just happens that some higher level languages hide this abstraction from the programmers and often standardize in one default size for integers.

MobiusHorizons3y ago

sp3323y ago

j / k navigate · click thread line to collapse