undefined | Better HN

0 pointsitake3y ago0 comments

Do you know how well it performs compared to llama.cpp?

0 comments

4 comments · 1 top-level

rain13y ago· 3 in thread

my understanding is that the engine used (pytorch transformers library) is still faster than llama.cpp with 100% of layers running on the GPU.

qeternity3y ago

It's the Huggingface transformers library which is implemented in pytorch.

In terms of speed, yes running fp16 will indeed be faster with vanilla gpu setup. However most people are running 4bit quantized versions, and the GPU quantization landscape as been a mess (GPTQ-for-llama project). llama.cpp has taken a totally different approach, and it looks like they are currently able to match native GPU perf via cuBLAS with much less effort and brittleness.

itakeOP3y ago

I only have an m1

rain13y ago

I don't think the integrated GPU on that supports CUDA. So you will need to use CPU mode only.

1 more reply

j / k navigate · click thread line to collapse