Nowadays you get TTS, STT, text & image generation and image editing should also be possible. Besides being able to run via rocm, vulkan or on CPU, GPU and NPU. Quite a lot of options. They have a quite good and pragmatic pace in development. Really recommend this for AMD hardware!
Edit: OpenAI and i think nowaday ollama compatible endpoints allow me to use it in VSCode Copilot as well as i.e. Open Web UI. More options are shown in their docs.
On the performance side, lemonade comes bundled with ROCm and Vulkan. These are sourced from https://github.com/lemonade-sdk/llamacpp-rocm and https://github.com/ggml-org/llama.cpp/releases respectively.
Lemonade has a Web UI to set the context size and llama.cpp args, you need to set context to proper number or just to 0 so that it uses the default. If its too low, it wont work with agentic coding.
I will try some Claw app, but first need to research the field a bit. But I am using different models on Open Web UI. GPT 120B is fast, but also Qwen3.5 27B is fine.
27B is supposed to be really good but it's so slow I gave up on it (11-12 tg/s at Q4).
Running Qwen3.5 122B at 35t/s as a daily driver using Vulcan llama.cpp on kernel 7.0.0rc5 on a Framework Desktop board (Strix Halo 128).
Also a pair of AMD AI Pro r9700 cards as my workhorses for zimageturbo, qwen tts/asr and other accessory functions and experiments.
Finally have a Radeon 6900 XT running qwen3.5 32B at 60+t/s for a fast all arounder.
If I buy anything nvidia it will be only for compatibility testing. AMD hardware is 100% the best option now for cost, freedom, and security for home users.
The interesting part to me isn’t just local inference, but how much orchestration it’s trying to handle (text, image, audio, etc). That’s usually where things get messy when running models locally.
Curious how much of this is actually abstraction vs just bundling multiple tools together. Also wondering if the AMD/NPU optimizations end up making it less portable compared to something like Ollama in practice.
It’s portable in the sense it will install on any of the supported OS using CPU or vulkan backends. But it only supports out of the box ROCM builds and AMD NPUs. There is a way to override which llama.cpp version it uses if you want to run it on CUDA, but that adds more overhead to manage.
If you have an AMD machine and want to run local models with minimal headache…it’s really the easiest method.
This runs on my NAS, handles my home assistant setup.
I have a strix halo and another server running various CUDA cards I manage manually by updating to bleeding edge versions of llama.cpp or vllm.
My three NVIDIA cards are more power efficient than my one AMD card, both at idle and during usage.
Official ROCm is like pulling teeth with poor support for desktop cards. Debian, a volunteer led project, have better ROCm CI than AMD and support more cards.
Look at any benchmarks. NV midrange cards are faster than AMD and at least a generation in front. Owning a 7900XTX is an embarrassing disappointment.
I like AMD and want them to succeed, but they are way behind NV in this area.
I agree with most of your post and fled the AMD ecosystem some time ago because of the machine learning situation, but their problem seemed to be more the firmware bugs and memory management of compute shaders than the higher level libraries.
The obvious solution to this one would be not to use ROCm. ROCm has always been a bit of a train wreck for small users and it doesn't seem to do anything special anyway. The way forward would be something more like Vulkan which the server that today's link points to seems to be using. The existence of a badly managed software package doesn't really imply that users have to use it, they can use an alternative.
It would be nice if AMD sorts themselves out though. The NVidia driver situation on linux is painful and if AMD can reliably run LLMs without the hardware locking then I'd much rather move back to using their products.
This is answered from their Project Roadmap over on Github[0]:
Recently Completed: macOS (beta)
Under Development: MLX support
[0] https://github.com/lemonade-sdk/lemonade?tab=readme-ov-file#...
It also has endpoints that are compatible with OpenAI, Ollama, and Anthropic so you can throw any tool that is compatible with those and it will just run.
https://github.com/lemonade-sdk/llamacpp-rocm
But I'm not doing anything with images or audio. I get about 50 tokens a second with GPT OSS 120B. As others have pointed out, the NPU is used for low-powered, small models that are "always on", so it's not a huge win for the standard chatbot use case.
Maybe the assumption is that container-oriented users can build their own if given native packages?
I suppose a Dockerfile could be included but that also seems unconventional.
Under the hood they are both running llama.cpp, but this has specific builds for different GPUs. Not sure if the 9070 is one, I am running it on a 370 and 395 APU.
Model: qwen3.59b Prompt: "Hey, tell me a story about going to space"
Ollama completed in about 1:44 Lemonade completed in about 1:14
So it seems faster in this very limited test.
Thanks for that data point I should experiment with ROCm
I use an older Google Coral TPU running in my home lab being used by Frigate NVR for object detection for security cameras. It's more efficient, but less flexible than running it on the GPU.
Don't know if I need an NPU for my daily driver computer, but I would want one for my next home server.
AMD employees work on it/have been making blog posts about it for a bit.
Found this on the github readme.
[1]: https://github.com/lemonade-sdk/lemonade/releases/tag/v10.0....
This way software adoption will be very limited.
"FastFlowLM (FLM) support in Lemonade is in Early Access. FLM is free for non-commercial use, however note that commercial licensing terms apply. "
Lemonade is really just a management plane/proxy. It translates ollama/anthropic APIs to OpenAI format for llama.cpp. It runs different backends for sst/tts and image generation. Lets you manage it all in one place.
AMD are doing gods work here