undefined | Better HN

0 pointstyfon3y ago0 comments

llama.cpp is not using the GPU, it runs fine on the CPU (if fast enough)

I've scoured the web page for ram requirements for the various models but I can't see anything, will it be able to run let's say the 30B open assistant llama or 65B raw llama model on a consumer gpu (let's say 3060 with 12gb vram) using this?

Not trying to take anything away, but the readme etc is very lacking in actual technical details I feel without reading through the code or actually testing it.

0 comments

5 comments · 3 top-level

junrushao19943y ago· 2 in thread

Thanks for the feedback! This is definitely something we need to do. To share some data, currently the default model is Vicuna-7b, aggressively quantized to 2.9G.

We are expanding the coverage to more models, particularly, Dolly and StableLM are just around the corner, needing some clean up work.

As a fresh new project, right now we are starting to collect data points of which GPU models are supported well and fixing issues being reported. Please don't hesitate to report in our github issue!

tyfonOP3y ago

I see, the 2.9 GB requirements seems to imply a 3 bit weights?

In any case I am happy to see these projects taking form. Perhaps one can eventually make the level of quantization dynamic based on the available vram etc :)

I will definitively play around with it (on linux though, not a phone!)

int_19h3y ago

When people tried 3-bit quantization for 7B models before, it did not exactly go well in terms of detrimental side effects. Are you using some new quantization techniques that mitigate that?

eulers_secret3y ago

The local llama subreddit wiki has good info about RAM requirements: https://www.reddit.com/r/LocalLLaMA/wiki/models/

azeirah3y ago

Llama.cpp recently added partial GPU acceleration. Model dequantization as well as some BLAS operations have been moved to GPU.

It runs a lot faster if you compile with cuBLAS (nvidia) or clblast (other). GPU vram doesn't matter much since it doesn't offload the model to vram.

j / k navigate · click thread line to collapse