undefined | Better HN

0 pointslumost2y ago0 comments

Aye was on quantized weights using gptq.

0 comments

3 comments · 1 top-level

LoganDark2y ago· 2 in thread

Try GGML, llama.cpp is pretty fast

makes sense - I ultimately need to train the weights so was focusing on GPTQ, I'll try out ggml and see if the latency is better. I have some flexibility on whether I run inference/training on the same model instance. What context length were you using? I was maxing ~2048 tokens, which may also explain the apparent latency.

LoganDark2y ago

llama.cpp builds a prefix cache so the only latency is on the first generation :)

j / k navigate · click thread line to collapse