Over the last few months I've been working with some folks on a tool named Ollama (https://github.com/jmorganca/ollama) to run open-source LLMs like Llama 2, Code Llama and Falcon locally, starting with macOS.
The biggest ask since then has been "how can I run Ollama on Linux?" with GPU support out of the box. Setting up and configuring CUDA and then compiling and running llama.cpp (which is a fantastic library and runs under the hood) can be quite painful on different combinations of linux distributions and Nvidia GPUs. The goal for Ollama's linux version was to automate this process to make it easy to get up and running.
The is the first Linux release! There's still lots to do, but I wanted to share it here for to see what everyone thinks. Thanks for anyone who has given it a try and sent feedback!
I get the boot concern, and the maintenance concern (!!!), but as you say, these models are already quite huge anyway :)
They basically just ship executables for different llama.cpp backends and select the correct one with a python script, which is fine, as the executables are really small.
https://github.com/YellowRoseCx/koboldcpp-rocm
Some other projects support rocm less explicitly, and not as easily.
I saw this on HN before, but I thought it was another from-scratch llama implementation... Which is fine, but much less interesting to me, as a from-scratch implementation probably not as fast/feature packed as llama.cpp or the TVM implementation.
Keeping up with llama.cpp's rapid evolution is very difficult, and there's a need for projects like this.
* https://github.com/jmorganca/ollama/blob/main/docs/api.md
* https://github.com/ggerganov/llama.cpp/blob/master/examples/...
Some like c0sogi/llama-api are pretty neat because they support concurrency, and supports multiple backends (llama.cpp and Exllama, although it could be expanded).
While you might lose out on some low-level configurability, being able to easily swap between OpenAI and local models is a big win in my book.
- https://github.com/openai/triton
- https://github.com/NVIDIA/TensorRT
TVM and other compiler-based approaches seem to really perform really well and make supporting different backends really easy. A good friend who's been in this space for a while told me llama.cpp is sort of a "hand crafted" version of what these compilers could output, which I think speaks to the craftmanship Georgi and the ggml team have put into llama.cpp, but also the opportunity to "compile" versions of llama.cpp for other model architectures or platforms.
https://github.com/vllm-project/vllm
As Ollama uses a llama.cpp fork on the backend, I'd expect its memory usage to be very similar to that.
Somewhat related note -- does anyone know what are the performance differences for GPU-only inference using this loader (llama.cpp + GGUF/GGML modles) vs exllama using GPTQ? My understanding is that exllama/GPTQ gets a lot higher tok/s on a consumer GPU like a [34]090.
Would save me many gigabytes of downloads of testing if someone knew.
I'd say that you should pick the backend that has the quantized models or other features (sampler, context window, API compatibility, etc) that suits you best.
But the EX2 quantization is very new, and you will have to quantize many models yourself.
But its missing some killer features of llama.cpp, like grammar based sampling.
Current standing is something like:
- vLLM is the fastest overall with batching, and has decent (but not SOTA) 4 bit quantization.
- Llama.cpp has the best hybrid CPU/GPU inference by far, has the most bells and whistles, has good and very flexible quantization, and is reasonably fast in CUDA without batching (but is getting batching soon). It has opencl and rocm backends, but support is focused on CUDA/Metal/CPU. Its the best backend for dGPUs that wont fit the whole model, and is otherwise a jack of all trades.
- MLC-LLM (with the TVM Vulkan backend) is the king of speed on IGPs, mobile devices and AMD/Intel dGPUs without having to fuss with a ROCM install. Its extremely fast on Nvidia dGPUS even without CUDA. It theoretically has "easy" support for webGPU and exotic hardware like FPGAs or AI blocks. But its 4-bit quantization was not as good as llama.cpp, last I checked.
- exLLAMAv2 has, by far, the best quantization for squeezing models onto small GPUs, and is the fastest CUDA (and ROCM?) backend with no batching. Its feature rich with a frontend like text-gen-ui
- Plain HF Transformers is... a fine default, but the master of none. The best use case is probably for testing research implementations.
He has benchmarks on an A6000 which should be roughly in line w/ a 3090 if you want to compare to my numbers (I test mlc as well, although my 3090 results are slower since I'm testing a llama2-7b @ 4K context and mlc currently slows down significantly w/ longer context): https://docs.google.com/spreadsheets/d/1kT4or6b0Fedd-W_jMwYp...
I haven't done benchmarking vs. vLLM, but it's quite fast; in my tests on an A100-80g w/ llama2 70b I was getting over 25 tok/sec which is just mind blowing. I was even getting around 30 tok/sec on llama2 7b on an old RTX 1070, which is equally crazy.
Show HN: Ollama – Run LLMs on your Mac - https://news.ycombinator.com/item?id=36802582 - July 2023 (94 comments)
(about this see https://news.ycombinator.com/showhn.html)
We needed to change the main architecture to support different GPUs out-of-the-box. We thought this was ShowHN worthy as other tools require users to manually install nvidia toolkit / drivers. [It sounds really simple, but to do it across the board on different distros was a lot of work]
Also curious, do you plan to support speculative sampling if/when the feature is merged into llama.cpp? Excited about the possibility of running a 34b at high speeds on a standard laptop
What about https://github.com/ggerganov/llama.cpp ?
It compiles and run easily on Linux.
Though doesn't currently support GPU.
for those that haven't used ollama, being able to specify how a model behaves via a "modelfile" is pretty darned awesome. I have a chef, a bartender, and a programmer that I use, personally.
makes it very convenient.
Getting started was literally as easy as:
pacman -S ollama
ollama serve
ollama run llama2:13b 'insert prompt'
You guys are doing the lord's work hereHow? By forcing users into a custom model serialization format (GGUF) that is claimed to literally contain "magic"?
Dark lord, maybe.
As a solutions developer not so much interested in training models but leveraging them in a pipeline, I hadn’t bothered to try to run anything locally due to the complexity of setup, even with llama.cpp. You enabled me to be up and running in just a few minutes.
As an app dev, we have 2 choices:
(1) Build our own support for LLMs, GPU/CPU execution, model downloading, inference optimizations, etc.
(2) Just tell users "run Ollama" and have our app hit the Ollama API on localhost (or shell out to `ollama`).
Obviously choice 2 is much, much simpler. There are some things in the middle, like less polished wrappers around llama.cpp, but Ollama is the only thing that 100% of people I've told about have been able to install without any problems.
That's huge because it's finally possible to build real apps that use local LLMs—and still reach a big userbase. Your userbase is now (pretty much) "anyone who can download and run a desktop app and who has a relatively modern laptop", which is a big population.
I'm really excited to see what people build on Ollama.
(And Ollama will simplify deploying server-side LLM apps as well, but right now from participating in the community, it seems most people are only thinking of it for local apps. I expect that to change when people realize that they can ship a self-contained server app that runs on a cheap AWS/GCP instance and uses an Ollama-executed LLM for various features.)
[1] Shameless plug for the WIP PR where I'm implementing Ollama support in Cody, our code AI app: https://github.com/sourcegraph/cody/pull/905.
This is either for backup purpose, or to share model files with other applications. Those model files are large!