Ollama for Linux – Run LLMs on Linux with GPU Acceleration (opens in new tab)

(github.com)

173 pointsjmorgan2y ago54 comments

Hi HN,

Over the last few months I've been working with some folks on a tool named Ollama (https://github.com/jmorganca/ollama) to run open-source LLMs like Llama 2, Code Llama and Falcon locally, starting with macOS.

The biggest ask since then has been "how can I run Ollama on Linux?" with GPU support out of the box. Setting up and configuring CUDA and then compiling and running llama.cpp (which is a fantastic library and runs under the hood) can be quite painful on different combinations of linux distributions and Nvidia GPUs. The goal for Ollama's linux version was to automate this process to make it easy to get up and running.

The is the first Linux release! There's still lots to do, but I wanted to share it here for to see what everyone thinks. Thanks for anyone who has given it a try and sent feedback!

Ollama for Linux – Run LLMs on Linux with GPU Acceleration

(github.com)

173 pointsjmorgan2y ago54 comments

Hi HN,

The is the first Linux release! There's still lots to do, but I wanted to share it here for to see what everyone thinks. Thanks for anyone who has given it a try and sent feedback!

54 comments

47 comments · 16 top-level

jrm42y ago· 7 in thread

Very cool. Does anyone know exactly how out of luck us AMD folk are? I know there are efforts out there, but I'm kind of hoping for something "as easy as this?"

Patrick_Devine2y ago

We don't current compile in CLBlast or ROCm support but if there's a lot of demand for this, we'll definitely add it in the future. One concern is not wanting to bloat out the binary size too much (CUDA is already huge!) but given how big the LLM models are anyway, maybe it's not a huge concern.

globuous2y ago

AMD support would be amazing <3

I get the boot concern, and the maintenance concern (!!!), but as you say, these models are already quite huge anyway :)

capableweb2y ago

Offer two builds :) One AMD and one NVIDIA.

1 more reply

brucemacd2y ago

It could be possible to take a similar installer approach in the future for cuBLAS or hipBLAS for AMD GPUs, so there is hope.

brucethemoose22y ago

Koboldcpp does this: https://github.com/LostRuins/koboldcpp/releases/tag/v1.44.2

They basically just ship executables for different llama.cpp backends and select the correct one with a python script, which is fine, as the executables are really small.

spmurrayzzz2y ago

llama.cpp added ROCm linux support a little over a month ago. Details on that can be found in their readme: https://github.com/ggerganov/llama.cpp#hipblas (no luck for windows users though)

brucethemoose22y ago

This one is basically SOTA for AMD, if you can install rocm properly:

https://github.com/YellowRoseCx/koboldcpp-rocm

Some other projects support rocm less explicitly, and not as easily.

brucethemoose22y ago· 6 in thread

Oh, this is a llama.cpp frontend. Y'all should have lead with that!

I saw this on HN before, but I thought it was another from-scratch llama implementation... Which is fine, but much less interesting to me, as a from-scratch implementation probably not as fast/feature packed as llama.cpp or the TVM implementation.

Keeping up with llama.cpp's rapid evolution is very difficult, and there's a need for projects like this.

mgreg2y ago

I'm actually using Ollama for it's Rest API endpoint. Llama.cpp does now have it's server implementation. Unfortunately they do have different endpoints and behave a little differently.

* https://github.com/jmorganca/ollama/blob/main/docs/api.md

* https://github.com/ggerganov/llama.cpp/blob/master/examples/...

lhl2y ago

I put together a list of OpenAI API compatibility layers for local LLMs recently: https://llm-tracker.info/books/llms/page/openai-api-compatib...

Some like c0sogi/llama-api are pretty neat because they support concurrency, and supports multiple backends (llama.cpp and Exllama, although it could be expanded).

While you might lose out on some low-level configurability, being able to easily swap between OpenAI and local models is a big win in my book.

2 more replies

jmorganOP2y ago

There's a ton of cool opportunity in the runtime layer. I've been keeping my eye on the compiler-based approaches. From what I've gathered many of the larger "production" inference tools use compilers:

- https://github.com/openai/triton

- https://github.com/NVIDIA/TensorRT

TVM and other compiler-based approaches seem to really perform really well and make supporting different backends really easy. A good friend who's been in this space for a while told me llama.cpp is sort of a "hand crafted" version of what these compilers could output, which I think speaks to the craftmanship Georgi and the ggml team have put into llama.cpp, but also the opportunity to "compile" versions of llama.cpp for other model architectures or platforms.

1 more reply

andy992y ago

Hi, fyi I am working on a from-scratch implementation (currently llama2 on linux focused) in Fortran, CPU only but in my initial test about as fast as llama.cpp. Currently has fp16 and 4-bit quantization and I hope this week to finish support for ggml files - I based it off of Karpathy's llama2.c and so it uses that format now which is not great. Llama.cpp is the leader and has more diverse hardware support, for CPU inference and simplicity (complexity of llama.cpp had exploded) there is still room for competition I believe. https://github.com/rbitr/llama2.f90 Once I make it a bit easier to use I want to promote it more.

UncleOxidant2y ago

Where does one find this TVM implementation you mention?

capableweb2y ago

Maybe they're talking about https://github.com/mlc-ai/mlc-llm which is used for web-llm (https://github.com/mlc-ai/web-llm)? Seems to be using TVM.

1 more reply

aftbit2y ago· 3 in thread

How does this compare to vLLM or exllama? Can it run llama2 30B on one 3090 24G or 70B on two 3090 24G?

https://github.com/vllm-project/vllm

https://github.com/turboderp/exllama

https://github.com/turboderp/exllamav2

harph2y ago

Llama2 was not released with 30B parameters, or was it?

lhl2y ago

While the llama2-34b base model hasn't been released, CodeLlama2 is effectively a fine-tuned version of 34b and there are some people working with that.

As Ollama uses a llama.cpp fork on the backend, I'd expect its memory usage to be very similar to that.

aftbit2y ago

Oh nope you are 100% correct, I was thinking of the first llama. My buddy is running the 70B llama 2 on two 3090s and the 30B llama 1 on one 3090.

kelvie2y ago· 3 in thread

Amazing! I use text-generation-webui to play with LLMs, but was always jealous of this much simpler interface.

Somewhat related note -- does anyone know what are the performance differences for GPU-only inference using this loader (llama.cpp + GGUF/GGML modles) vs exllama using GPTQ? My understanding is that exllama/GPTQ gets a lot higher tok/s on a consumer GPU like a [34]090.

Would save me many gigabytes of downloads of testing if someone knew.

lhl2y ago

The numbers are always changing, but from my testing, they're close enough that it doesn't really matter. My most recent benchmarks: https://docs.google.com/spreadsheets/d/1kT4or6b0Fedd-W_jMwYp...

I'd say that you should pick the backend that has the quantized models or other features (sampler, context window, API compatibility, etc) that suits you best.

qeternity2y ago

It is hardware and use case dependent but I would say roughly that ExLlama is 10-20% faster than llama.cpp and ExLlama v2 is 10-20% faster than ExLlama (my experiences at 4 bit quantization).

brucethemoose22y ago

exLLAMAv2 and GPTQ are different implementations, and currently exLLAMA is definitely the best for discrete Nvidia (and AMD?) GPUs. Its faster and uses less VRAM for the same perplexity than pretty much anything else.

But the EX2 quantization is very new, and you will have to quantize many models yourself.

But its missing some killer features of llama.cpp, like grammar based sampling.

binarymax2y ago· 3 in thread

Congrats on the launch! I'll give it a try. I've been using vLLM on Linux so far but have wanted to be able to use a ggml backend - have you done any perf comparisons?

brucethemoose22y ago

vLLM has far more throughput with batching, but that is also a WIP feature for llama.cpp.

Current standing is something like:

- vLLM is the fastest overall with batching, and has decent (but not SOTA) 4 bit quantization.

- Llama.cpp has the best hybrid CPU/GPU inference by far, has the most bells and whistles, has good and very flexible quantization, and is reasonably fast in CUDA without batching (but is getting batching soon). It has opencl and rocm backends, but support is focused on CUDA/Metal/CPU. Its the best backend for dGPUs that wont fit the whole model, and is otherwise a jack of all trades.

- MLC-LLM (with the TVM Vulkan backend) is the king of speed on IGPs, mobile devices and AMD/Intel dGPUs without having to fuss with a ROCM install. Its extremely fast on Nvidia dGPUS even without CUDA. It theoretically has "easy" support for webGPU and exotic hardware like FPGAs or AI blocks. But its 4-bit quantization was not as good as llama.cpp, last I checked.

- exLLAMAv2 has, by far, the best quantization for squeezing models onto small GPUs, and is the fastest CUDA (and ROCM?) backend with no batching. Its feature rich with a frontend like text-gen-ui

- Plain HF Transformers is... a fine default, but the master of none. The best use case is probably for testing research implementations.

lhl2y ago

Hamel Husain hasn't done testing vs llama.cpp, but this still might be of interest (includes mlc which is roughly in line w/ llama.cpp batch=1 perf): https://hamel.dev/notes/llm/inference/03_inference.html

He has benchmarks on an A6000 which should be roughly in line w/ a 3090 if you want to compare to my numbers (I test mlc as well, although my 3090 results are slower since I'm testing a llama2-7b @ 4K context and mlc currently slows down significantly w/ longer context): https://docs.google.com/spreadsheets/d/1kT4or6b0Fedd-W_jMwYp...

Patrick_Devine2y ago

One nice thing about Ollama vs. stock llama.cpp is Ollama supports both ggml and gguf models. If you've still got a lot of old ggml bins around you can easily create a model file and use them.

I haven't done benchmarking vs. vLLM, but it's quite fast; in my tests on an A100-80g w/ llama2 70b I was getting over 25 tok/sec which is just mind blowing. I was even getting around 30 tok/sec on llama2 7b on an old RTX 1070, which is equally crazy.

dang2y ago· 2 in thread

It looks like great work but this isn't different enough from the recent Show HN to make a new Show HN:

Show HN: Ollama – Run LLMs on your Mac - https://news.ycombinator.com/item?id=36802582 - July 2023 (94 comments)

(about this see https://news.ycombinator.com/showhn.html)

mchiang2y ago

Hey Dang, sorry about this. Just wanted to clarify that this was a major overhaul to Ollama. In the past, we did not support Linux with GPU support.

We needed to change the main architecture to support different GPUs out-of-the-box. We thought this was ShowHN worthy as other tools require users to manually install nvidia toolkit / drivers. [It sounds really simple, but to do it across the board on different distros was a lot of work]

jmorganOP2y ago

Thanks Dang and sorry!!

sestinj2y ago· 2 in thread

This is a huge deal, congrats! We've had a ton of users asking how to run their own LLMs on Linux and the unfortunate answer was always that the existing options were slightly complicated. Having a single-click to download option is going to open this up for so many more people! If anyone is looking for a way to use Ollama inside VS Code, one option (what I've been working on) is https://continue.dev

Also curious, do you plan to support speculative sampling if/when the feature is merged into llama.cpp? Excited about the possibility of running a 34b at high speeds on a standard laptop

dorfsmay2y ago

> run their own LLMs on Linux and the unfortunate answer was always that the existing options were slightly complicate

What about https://github.com/ggerganov/llama.cpp ?

It compiles and run easily on Linux.

vorticalbox2y ago

Or https://gpt4all.io/index.html

Though doesn't currently support GPU.

jerrysievert2y ago· 2 in thread

this is awesome, congrats on an amazingly useful release!

for those that haven't used ollama, being able to specify how a model behaves via a "modelfile" is pretty darned awesome. I have a chef, a bartender, and a programmer that I use, personally.

all22y ago

Do you run these separately or as a mixture of experts? Are they fine-tuned for these behaviours or just prompted to behave a certain way?

jerrysievert2y ago

they are just prompted differently - you can choose the model and a prompt, and ollama presents it as its own "model", so instead of `llama2:7b`, it gets presented as `bartender:latest`.

makes it very convenient.

ForkMeOnTinder2y ago· 1 in thread

Huge fan of ollama. Although this is the first official linux release, I've been using it on linux already for a few months now with no issues (through the arch package which builds from source).

Getting started was literally as easy as:

  pacman -S ollama
  ollama serve
  ollama run llama2:13b 'insert prompt'

You guys are doing the lord's work here

factibicongue2y ago

"You guys are doing the lord's work here"?

How? By forcing users into a custom model serialization format (GGUF) that is claimed to literally contain "magic"?

Dark lord, maybe.

biddit2y ago· 1 in thread

I’ve been using this on my MacBook Pro for the last couple weeks and want to say thank you!

As a solutions developer not so much interested in training models but leveraging them in a pipeline, I hadn’t bothered to try to run anything locally due to the complexity of setup, even with llama.cpp. You enabled me to be up and running in just a few minutes.

jerrysievert2y ago

give Dumbar a try, since you're on macOS! https://github.com/JerrySievert/Dumbar

politelemon2y ago· 1 in thread

How is WSL2 able to work with GPUs?

Patrick_Devine2y ago

NVidia provides driver support for CUDA inside of WSL2. More details are here: https://docs.nvidia.com/cuda/wsl-user-guide/index.html

sqs2y ago

Ollama is awesome. I am part of a team building a code AI application[1], and we want to give devs the option to run it locally instead of only supporting external LLMs from Anthropic, OpenAI, etc. Those big remote LLMs are incredibly powerful and probably the right choice for most devs, but it's good for devs to have a local option as well—for security, privacy, cost, latency, simplicity, freedom, etc.

As an app dev, we have 2 choices:

(1) Build our own support for LLMs, GPU/CPU execution, model downloading, inference optimizations, etc.

(2) Just tell users "run Ollama" and have our app hit the Ollama API on localhost (or shell out to `ollama`).

Obviously choice 2 is much, much simpler. There are some things in the middle, like less polished wrappers around llama.cpp, but Ollama is the only thing that 100% of people I've told about have been able to install without any problems.

That's huge because it's finally possible to build real apps that use local LLMs—and still reach a big userbase. Your userbase is now (pretty much) "anyone who can download and run a desktop app and who has a relatively modern laptop", which is a big population.

I'm really excited to see what people build on Ollama.

(And Ollama will simplify deploying server-side LLM apps as well, but right now from participating in the community, it seems most people are only thinking of it for local apps. I expect that to change when people realize that they can ship a self-contained server app that runs on a cheap AWS/GCP instance and uses an Ollama-executed LLM for various features.)

[1] Shameless plug for the WIP PR where I'm implementing Ollama support in Cody, our code AI app: https://github.com/sourcegraph/cody/pull/905.

WiSaGaN2y ago

Ollama is awesome. I am however still waiting for the support of controlling the model cache location: https://github.com/jmorganca/ollama/issues/153

This is either for backup purpose, or to share model files with other applications. Those model files are large!

hathym2y ago

There is also https://faraday.dev/

aglazer2y ago

Ollama is fantastic. Thanks for building it!

agilob2y ago

On NVidia GPU*

j / k navigate · click thread line to collapse

54 comments

47 comments · 16 top-level

jrm42y ago· 7 in thread

Very cool. Does anyone know exactly how out of luck us AMD folk are? I know there are efforts out there, but I'm kind of hoping for something "as easy as this?"

Patrick_Devine2y ago

globuous2y ago

AMD support would be amazing <3

I get the boot concern, and the maintenance concern (!!!), but as you say, these models are already quite huge anyway :)

capableweb2y ago

Offer two builds :) One AMD and one NVIDIA.

1 more reply

brucemacd2y ago

It could be possible to take a similar installer approach in the future for cuBLAS or hipBLAS for AMD GPUs, so there is hope.

brucethemoose22y ago

Koboldcpp does this: https://github.com/LostRuins/koboldcpp/releases/tag/v1.44.2

They basically just ship executables for different llama.cpp backends and select the correct one with a python script, which is fine, as the executables are really small.

spmurrayzzz2y ago

llama.cpp added ROCm linux support a little over a month ago. Details on that can be found in their readme: https://github.com/ggerganov/llama.cpp#hipblas (no luck for windows users though)

brucethemoose22y ago

This one is basically SOTA for AMD, if you can install rocm properly:

https://github.com/YellowRoseCx/koboldcpp-rocm

Some other projects support rocm less explicitly, and not as easily.

brucethemoose22y ago· 6 in thread

Oh, this is a llama.cpp frontend. Y'all should have lead with that!

Keeping up with llama.cpp's rapid evolution is very difficult, and there's a need for projects like this.

mgreg2y ago

I'm actually using Ollama for it's Rest API endpoint. Llama.cpp does now have it's server implementation. Unfortunately they do have different endpoints and behave a little differently.

* https://github.com/jmorganca/ollama/blob/main/docs/api.md

* https://github.com/ggerganov/llama.cpp/blob/master/examples/...

lhl2y ago

I put together a list of OpenAI API compatibility layers for local LLMs recently: https://llm-tracker.info/books/llms/page/openai-api-compatib...

Some like c0sogi/llama-api are pretty neat because they support concurrency, and supports multiple backends (llama.cpp and Exllama, although it could be expanded).

While you might lose out on some low-level configurability, being able to easily swap between OpenAI and local models is a big win in my book.

2 more replies

jmorganOP2y ago

- https://github.com/openai/triton

- https://github.com/NVIDIA/TensorRT

1 more reply

andy992y ago

UncleOxidant2y ago

Where does one find this TVM implementation you mention?

capableweb2y ago

Maybe they're talking about https://github.com/mlc-ai/mlc-llm which is used for web-llm (https://github.com/mlc-ai/web-llm)? Seems to be using TVM.

1 more reply

aftbit2y ago· 3 in thread

How does this compare to vLLM or exllama? Can it run llama2 30B on one 3090 24G or 70B on two 3090 24G?

https://github.com/vllm-project/vllm

https://github.com/turboderp/exllama

https://github.com/turboderp/exllamav2

harph2y ago

Llama2 was not released with 30B parameters, or was it?

lhl2y ago

While the llama2-34b base model hasn't been released, CodeLlama2 is effectively a fine-tuned version of 34b and there are some people working with that.

As Ollama uses a llama.cpp fork on the backend, I'd expect its memory usage to be very similar to that.

aftbit2y ago

Oh nope you are 100% correct, I was thinking of the first llama. My buddy is running the 70B llama 2 on two 3090s and the 30B llama 1 on one 3090.

kelvie2y ago· 3 in thread

Amazing! I use text-generation-webui to play with LLMs, but was always jealous of this much simpler interface.

Would save me many gigabytes of downloads of testing if someone knew.

lhl2y ago

The numbers are always changing, but from my testing, they're close enough that it doesn't really matter. My most recent benchmarks: https://docs.google.com/spreadsheets/d/1kT4or6b0Fedd-W_jMwYp...

I'd say that you should pick the backend that has the quantized models or other features (sampler, context window, API compatibility, etc) that suits you best.

qeternity2y ago

It is hardware and use case dependent but I would say roughly that ExLlama is 10-20% faster than llama.cpp and ExLlama v2 is 10-20% faster than ExLlama (my experiences at 4 bit quantization).

brucethemoose22y ago

But the EX2 quantization is very new, and you will have to quantize many models yourself.

But its missing some killer features of llama.cpp, like grammar based sampling.

binarymax2y ago· 3 in thread

Congrats on the launch! I'll give it a try. I've been using vLLM on Linux so far but have wanted to be able to use a ggml backend - have you done any perf comparisons?

brucethemoose22y ago

vLLM has far more throughput with batching, but that is also a WIP feature for llama.cpp.

Current standing is something like:

- vLLM is the fastest overall with batching, and has decent (but not SOTA) 4 bit quantization.

- exLLAMAv2 has, by far, the best quantization for squeezing models onto small GPUs, and is the fastest CUDA (and ROCM?) backend with no batching. Its feature rich with a frontend like text-gen-ui

- Plain HF Transformers is... a fine default, but the master of none. The best use case is probably for testing research implementations.

lhl2y ago

Patrick_Devine2y ago

One nice thing about Ollama vs. stock llama.cpp is Ollama supports both ggml and gguf models. If you've still got a lot of old ggml bins around you can easily create a model file and use them.

dang2y ago· 2 in thread

It looks like great work but this isn't different enough from the recent Show HN to make a new Show HN:

Show HN: Ollama – Run LLMs on your Mac - https://news.ycombinator.com/item?id=36802582 - July 2023 (94 comments)

(about this see https://news.ycombinator.com/showhn.html)

mchiang2y ago

Hey Dang, sorry about this. Just wanted to clarify that this was a major overhaul to Ollama. In the past, we did not support Linux with GPU support.

jmorganOP2y ago

Thanks Dang and sorry!!

sestinj2y ago· 2 in thread

Also curious, do you plan to support speculative sampling if/when the feature is merged into llama.cpp? Excited about the possibility of running a 34b at high speeds on a standard laptop

dorfsmay2y ago

> run their own LLMs on Linux and the unfortunate answer was always that the existing options were slightly complicate

What about https://github.com/ggerganov/llama.cpp ?

It compiles and run easily on Linux.

vorticalbox2y ago

Or https://gpt4all.io/index.html

Though doesn't currently support GPU.

jerrysievert2y ago· 2 in thread

this is awesome, congrats on an amazingly useful release!

for those that haven't used ollama, being able to specify how a model behaves via a "modelfile" is pretty darned awesome. I have a chef, a bartender, and a programmer that I use, personally.

all22y ago

Do you run these separately or as a mixture of experts? Are they fine-tuned for these behaviours or just prompted to behave a certain way?

jerrysievert2y ago

they are just prompted differently - you can choose the model and a prompt, and ollama presents it as its own "model", so instead of `llama2:7b`, it gets presented as `bartender:latest`.

makes it very convenient.

ForkMeOnTinder2y ago· 1 in thread

Huge fan of ollama. Although this is the first official linux release, I've been using it on linux already for a few months now with no issues (through the arch package which builds from source).

Getting started was literally as easy as:

  pacman -S ollama
  ollama serve
  ollama run llama2:13b 'insert prompt'

You guys are doing the lord's work here

factibicongue2y ago

"You guys are doing the lord's work here"?

How? By forcing users into a custom model serialization format (GGUF) that is claimed to literally contain "magic"?

Dark lord, maybe.

biddit2y ago· 1 in thread

I’ve been using this on my MacBook Pro for the last couple weeks and want to say thank you!

jerrysievert2y ago

give Dumbar a try, since you're on macOS! https://github.com/JerrySievert/Dumbar

politelemon2y ago· 1 in thread

How is WSL2 able to work with GPUs?

Patrick_Devine2y ago

NVidia provides driver support for CUDA inside of WSL2. More details are here: https://docs.nvidia.com/cuda/wsl-user-guide/index.html

sqs2y ago

As an app dev, we have 2 choices:

(1) Build our own support for LLMs, GPU/CPU execution, model downloading, inference optimizations, etc.

(2) Just tell users "run Ollama" and have our app hit the Ollama API on localhost (or shell out to `ollama`).

I'm really excited to see what people build on Ollama.

[1] Shameless plug for the WIP PR where I'm implementing Ollama support in Cody, our code AI app: https://github.com/sourcegraph/cody/pull/905.

WiSaGaN2y ago

Ollama is awesome. I am however still waiting for the support of controlling the model cache location: https://github.com/jmorganca/ollama/issues/153

This is either for backup purpose, or to share model files with other applications. Those model files are large!

hathym2y ago

There is also https://faraday.dev/

aglazer2y ago

Ollama is fantastic. Thanks for building it!

agilob2y ago

On NVidia GPU*

j / k navigate · click thread line to collapse