undefined | Better HN

0 pointssp3323y ago0 comments

Training uses gradient descent, so you want to have good precision during that process. But once you have the overall structure of the network, https://arxiv.org/abs/2210.17323 (GPTQ) showed that you can cut down the precision quite a bit without losing a lot of accuracy. It seems you can cut down further for larger models. For the 13B Llama-based ones, going below 5 bits per parameter is noticeably worse, but for 30B models you can do 4 bits.

The same group did another paper https://arxiv.org/abs/2301.00774 which shows that in addition to reducing the precision of each parameter, you can also prune out a bunch of parameters entirely. It's harder to apply this optimization because models are usually loaded into RAM densely, but I hope someone figures out how to do it for popular models.

0 comments

3 comments · 2 top-level

mycall3y ago· 1 in thread

I wonder if specialization of the LLM is another way to reduce the RAM requirements. For example, if you can tell which nodes are touched through billions of web searches on a topic, then you can delete the ones that never are touched.

opyate3y ago

Kind of like "tree shaking" for weights? Like dead code elimination.

jimmySixDOF3y ago

Some people are having some success speeding token rates and clawback on VRAM using a 0- group size flag but ymmv I did not test this yet (they were discussing gptq btw)

j / k navigate · click thread line to collapse