undefined | Better HN

0 pointsEisenstein2y ago0 comments

You can load a 7B parameter model quantized at Q4_K_M as gguf. I don't know ollama, but you can load it in koboldcpp -- use cuBLAS and gpu layers 100 context 2048 and it should fit it all into 8GB of VRAM. For quantized models look at TheBloke on huggingface -- Mistral 7B is a good one to try.

0 comments

1 comments · 1 top-level

dizhn2y ago

If I am not mistaken, layer offloading is a llama.cpp feature so a lot of frontends/loaders that use it also have it. I use it with koboldcpp and text-generation-webui.

j / k navigate · click thread line to collapse