Lemonade by AMD: a fast and open source local LLM server using GPU and NPU (opens in new tab)

(lemonade-server.ai)

572 pointsAbuAssar2mo ago111 comments

111 comments

73 comments · 24 top-level

moconnor2mo ago· 10 in thread

Is... is this named because they have a lemon they're trying to make the most of?

parsimo20102mo ago

I think saying "L-L-M" sounds kind of like "lemon," so this is an LLM-aid (sounds like lemonade).

projektfu2mo ago

Wonder why they didn't call it LLMonade, which would be unique.

metalliqaz2mo ago

so obvious and yet I didn't connect the dots. thank you

1 more reply

TeMPOraL2mo ago

If life keeps giving it them, they should instead invent a combustible lemon.

eddieroger2mo ago

Do they know who you are? They're the guys who are going to blow your house up ... with the lemons.

1 more reply

altmanaltman2mo ago

Lemonsqueeze was considered too violent

nathan_douglas2mo ago

If you run it in a cluster, does it become a Lemon Party?

1 more reply

lrvick2mo ago

I exclusively buy AMD hardware for local inference. For open drivers, power efficiency, and cost AMD beats Nvidia easily for consumers.

suprjami2mo ago

You have got to be joking.

My three NVIDIA cards are more power efficient than my one AMD card, both at idle and during usage.

Official ROCm is like pulling teeth with poor support for desktop cards. Debian, a volunteer led project, have better ROCm CI than AMD and support more cards.

Look at any benchmarks. NV midrange cards are faster than AMD and at least a generation in front. Owning a 7900XTX is an embarrassing disappointment.

I like AMD and want them to succeed, but they are way behind NV in this area.

2 more replies

javchz2mo ago

Any recommendations in the current market? Love how plug and play and is on Linux from the driver side of things.

1 more reply

nijave2mo ago· 9 in thread

Anyone compare to ollama? I had good success with latest ollama with ROCm 7.4 on 9070 XT a few days ago

RealFloridaMan2mo ago

It is optimized for compatibility across different APIs as well as has specific hardware builds for AMD GPUs and NPUs. It’s run by AMD.

Under the hood they are both running llama.cpp, but this has specific builds for different GPUs. Not sure if the 9070 is one, I am running it on a 370 and 395 APU.

martin-adams2mo ago

I just compared this on my Mac book M1 Max 64GB RAM with the following:

Model: qwen3.59b Prompt: "Hey, tell me a story about going to space"

Ollama completed in about 1:44 Lemonade completed in about 1:14

So it seems faster in this very limited test.

nezhar2mo ago

I'm also curious about this one, also I want to compare this to vLLM.

iugtmkbdfil8342mo ago

Seconded. Currently on ollama for local inference, but I am curious how it compares.

LumielGR2mo ago

Lemonade is using llama.cpp for text and vision with a nightly ROCm build. It can also load and serve multiple LLMs at the same time. It can also create images, or use whisper.cpp, or use TTS models, or use NPU (e.g Strix Halo amdxdna2), and more!

metalliqaz2mo ago

better than Vulkan?

cpburns20092mo ago

In my experience using llama.cpp (which ollama uses internally) on a Strix Halo, whether ROCm or Vulkan performs better really depends on the model and it's usually within 10%. I have access to an RX 7900 XT I should compare to though.

1 more reply

0x4572mo ago

For me Vulkan performs better on integrated cards, but ROCm (MIGraphX) on 7900 XTX.

nijave2mo ago

As I understand it, it depends on your GPU and ROCm version but they're similar-ish

dennemark2mo ago· 5 in thread

I have been using lemonade for nearly a year already. On Strix Halo I am using nothing else - although kyuz0's toolboxes are also nice (https://kyuz0.github.io/amd-strix-halo-toolboxes/)

Nowadays you get TTS, STT, text & image generation and image editing should also be possible. Besides being able to run via rocm, vulkan or on CPU, GPU and NPU. Quite a lot of options. They have a quite good and pragmatic pace in development. Really recommend this for AMD hardware!

Edit: OpenAI and i think nowaday ollama compatible endpoints allow me to use it in VSCode Copilot as well as i.e. Open Web UI. More options are shown in their docs.

UncleOxidant2mo ago

How much of a speedup might I get for, say, Qwen3.5-122B if I were to run with lemonade on my Strix Halo vs running it using vulkan with llama.cpp ?

sawansri2mo ago

You would get similar performance. Lemonade is designed as a turnkey (optimized for AMD Hardware) for local AI models. The software helps you manage backends (llama.cpp, flm, whispercpp, stable‑diffusion.cpp, etc) for different GenAI modalities from a single utility.

On the performance side, lemonade comes bundled with ROCm and Vulkan. These are sourced from https://github.com/lemonade-sdk/llamacpp-rocm and https://github.com/ggml-org/llama.cpp/releases respectively.

syntaxing2mo ago

Have you used it with any agents or claw? If so, which model do you run?

dennemark2mo ago

I have two Strix Halo devices at hand. Privately a framework desktop with 128gb and at work 64GB HP notebook. The 64GB machine can load Qwen3.5 30B-A3B, with VSCode it needs a bit of initial prompt processing to initialize all those tools I guess. But the model is fighting with the other resources that I need. So I am not really using it anymore these days, but I want to experiment on my home machine with it. I just dont work on it much right now.

Lemonade has a Web UI to set the context size and llama.cpp args, you need to set context to proper number or just to 0 so that it uses the default. If its too low, it wont work with agentic coding.

I will try some Claw app, but first need to research the field a bit. But I am using different models on Open Web UI. GPT 120B is fast, but also Qwen3.5 27B is fine.

1 more reply

lrvick2mo ago

As another data point.

Running Qwen3.5 122B at 35t/s as a daily driver using Vulcan llama.cpp on kernel 7.0.0rc5 on a Framework Desktop board (Strix Halo 128).

Also a pair of AMD AI Pro r9700 cards as my workhorses for zimageturbo, qwen tts/asr and other accessory functions and experiments.

Finally have a Radeon 6900 XT running qwen3.5 32B at 60+t/s for a fast all arounder.

If I buy anything nvidia it will be only for compatibility testing. AMD hardware is 100% the best option now for cost, freedom, and security for home users.

2 more replies

zozbot2342mo ago· 4 in thread

Note that the NPU models/kernels this uses are proprietary and not available as open source. It would be nice to develop more open support for this hardware.

plagiarist2mo ago

I bought one of their machines to play around with under the expectation that I may never be able to use the NPU for models. But I am still angry to read this anyway.

zozbot2342mo ago

AMD/Xilinx's software support for the NPU is fully open, it's only FFLM's models that are proprietary. See https://github.com/amd/iron https://github.com/Xilinx/mlir-aie https://github.com/amd/RyzenAI-SW/ . It would be nice to explore whether one can simply develop kernels for these NPU's using Vulkan Compute and drive them that way; that would provide the closest unification with the existing cross-platform support for GPU's.

swiftcoder2mo ago

Are they? The docs say "You can also register any Hugging Face model into your Lemonade Server with the advanced pull command options"

zozbot2342mo ago

That won't give you NPU support, which relies on https://github.com/FastFlowLM/FastFlowLM . And that says "NPU-accelerated kernels are proprietary binaries", not open source.

JSR_FDED2mo ago· 4 in thread

I’ve read the website and the news announcement, and I still don’t understand what it is. An alternative to LM Studio? Does it support MLX or metal on Macs? I’m assuming it will optimize things for AMD, but are you at a disadvantage using other GPUs?

molticrystal2mo ago

>Does it support MLX or metal on Macs?

This is answered from their Project Roadmap over on Github[0]:

Recently Completed: macOS (beta)

Under Development: MLX support

[0] https://github.com/lemonade-sdk/lemonade?tab=readme-ov-file#...

RealFloridaMan2mo ago

It’s an easy way to get started and maintain a local AI stack that concentrates on AMD optimization. It is a one stop install for endpoints for sst, tts, image generation, and normal LLM. It has its own webui for management and interacting with the endpoints.

It also has endpoints that are compatible with OpenAI, Ollama, and Anthropic so you can throw any tool that is compatible with those and it will just run.

zelphirkalt2mo ago

I think LM Studio itself uses other software to actually make use of LLMs. If that other software does not support your NPUs, then you are not going to get much performance out of those. This Lemonade thing I am guessing is one such other software, that LM Studio could be using.

0x4572mo ago

It's alternative to LM Studio in a way that it's an abstraction over multiple runtimes. AMD part is that it supports FastFlowML runtime which is the only way to utilize NPU on Ryzen AI CPUs on linux.

cpburns20092mo ago· 4 in thread

Just in case anyone isn't aware. NPUs are low power, slow, and meant for small models.

jcgrillo2mo ago

I wonder what was the imagined use case? TBH I was seriously thinking about buying a framework desktop but the NPU put me off.. I don't get why I should have to pay money for a bunch of silicon that doesn't do anything. And now that there's some software support... it still doesn't do anything? Why does it even exist at all then?

naasking2mo ago

Small models aren't entirely useless, and the NPU can run LLMs up to around 8B parameters from what I've seen. So one way they could be useful: Qwen3 text to speech models are all under 2B parameters, and Open AI's whisper-small speech to text model is under 1B parameters, so you could have an AI agent that you could talk to and could talk back, where, in theory, you could offload all audio-text and text-audio processing to the low power NPU and leave the GPU to do all of the LLM processing.

2 more replies

ThatPlayer2mo ago

At least part of it is probably Microsoft's 40 TOPS NPU requirement for their Copilot+ badge. Intel also have NPUs in their modern CPUs. Phones CPU manufacturers have been doing it even longer, though Google calls theirs TPU.

I use an older Google Coral TPU running in my home lab being used by Frigate NVR for object detection for security cameras. It's more efficient, but less flexible than running it on the GPU.

Don't know if I need an NPU for my daily driver computer, but I would want one for my next home server.

cpburns20092mo ago

The NPU is entirely useless for the Framework Desktop, and really all Strix Halo devices. Where it could be useful is cell phones with the examples mentioned by @naasking (audio-text and text-audio processing), and maybe IoT.

gnarlouse2mo ago· 4 in thread

Maybe it's a language barrier problem, but "by AMD" makes me think its a project distributed by AMD. Is that actually the case? I'm not seeing any reason to believe it is.

buildbot2mo ago

It’s a community project supported and sponsored by AMD according to their GitHub; https://github.com/lemonade-sdk/lemonade

AMD employees work on it/have been making blog posts about it for a bit.

AbuAssarOP2mo ago

guipsp2mo ago

It is mostly developed by AMD and used to be hosted on the AMD github iirc

hombre_fatal2mo ago

> You can reach us by filing an issue, emailing lemonade@amd.com

Found this on the github readme.

kouunji2mo ago· 3 in thread

I’m looking forward to trying this currently Strix halo’s npu isn’t accessible if you’re running Linux, and previously I don’t think lemonade was either. If this opens up the npu that would be great! Resolute raccoon is adding npu support as well.

dennemark2mo ago

Maybe you have seen NPU support via FLM already: https://lemonade-server.ai/flm_npu_linux.html

"FastFlowLM (FLM) support in Lemonade is in Early Access. FLM is free for non-commercial use, however note that commercial licensing terms apply. "

cpburns20092mo ago

The NPU works on Linux (Arch at least) on Strix Halo using FastFlowLM [1]. Their NPU kernels are proprietary though (free up to a reasonable amount of commercial revenue). It's neat you can run some models basically for free (using NPU instead of CPU/GPU), but the performance is underwhelming. The target for NPUs is really low power devices, and not useful if you have an APU/GPU like Strix Halo.

[1]: https://github.com/FastFlowLM/FastFlowLM

boomskats2mo ago

I thought the NPU has been available since something like 6.12?

jmillikin2mo ago· 2 in thread

Surprising that the Linux setup instructions for the server component don't include Docker/Podman as an option, its Snap/PPA for Ubuntu and RPM for Fedora.

Maybe the assumption is that container-oriented users can build their own if given native packages?

freedomben2mo ago

They do have some container options, though I definitely think they should be added to the release page: https://lemonade-server.ai/install_options.html#docker

zenoprax2mo ago

Why should this be on the "Releases"? Shouldn't that just be for build artifacts? Pre-built containers belong on a registry, no?

I suppose a Dockerfile could be included but that also seems unconventional.

1 more reply

sensitiveCal2mo ago· 1 in thread

Feels like this is sitting somewhere between Ollama and something like LM Studio, but with a stronger focus on being a unified “runtime” rather than just model serving.

The interesting part to me isn’t just local inference, but how much orchestration it’s trying to handle (text, image, audio, etc). That’s usually where things get messy when running models locally.

Curious how much of this is actually abstraction vs just bundling multiple tools together. Also wondering if the AMD/NPU optimizations end up making it less portable compared to something like Ollama in practice.

RealFloridaMan2mo ago

It bundles tools, model selection, and overall management.

It’s portable in the sense it will install on any of the supported OS using CPU or vulkan backends. But it only supports out of the box ROCM builds and AMD NPUs. There is a way to override which llama.cpp version it uses if you want to run it on CUDA, but that adds more overhead to manage.

If you have an AMD machine and want to run local models with minimal headache…it’s really the easiest method.

This runs on my NAS, handles my home assistant setup.

I have a strix halo and another server running various CUDA cards I manage manually by updating to bleeding edge versions of llama.cpp or vllm.

rpdillon2mo ago· 1 in thread

Been running lemonade for some time on my Strix Halo box. It dispatches out to other backends that they include, like diffusion and llama. I actually don't like their combined server, and what I use instead is their llama CPP build for ROCm.

https://github.com/lemonade-sdk/llamacpp-rocm

But I'm not doing anything with images or audio. I get about 50 tokens a second with GPT OSS 120B. As others have pointed out, the NPU is used for low-powered, small models that are "always on", so it's not a huge win for the standard chatbot use case.

zozbot2342mo ago

Even small NPUs can offload some compute from prefill which can be quite expensive with longer contexts. It's less clear whether they can help directly during decode; that depends on whether they can access memory with good throughput and do dequant+compute internally, like GPUs can. Apple Neural Engine only does INT8 or FP16 MADD ops, so that mostly doesn't help.

ilaksh2mo ago· 1 in thread

Cool but is there a reason they can't just make PRs for vLLM and llama.cpp? Or have their own forks if they take too long to merge?

RealFloridaMan2mo ago

They use the latest llama.cpp under the hood but built for specific AMD GPU hardware.

Lemonade is really just a management plane/proxy. It translates ollama/anthropic APIs to OpenAI format for llama.cpp. It runs different backends for sst/tts and image generation. Lets you manage it all in one place.

9dc2mo ago· 1 in thread

so... what does it do? i dont get it Lol

iugtmkbdfil8342mo ago

Initial read suggests it is a mini-swiss army knife, because it seems to be able to do a lot ( based on website claims anyway ). The app integration seems to suggest they want to be more of a control dashboard.

steffs2mo ago

The multi-modal bundling is the part that stands out more than the raw inference speed. If you are building an app that needs text generation, image generation, and speech recognition, right now the local setup is three separate services with three different APIs and three different model management stories. Having one server handle all of that behind OpenAI-compatible endpoints is a real quality of life improvement for anyone prototyping locally. The NPU angle is interesting but probably overstated for most use cases. The discussion in the thread confirms what I would expect: NPUs shine for small always-on models and prefill offloading, not for the chatbot workloads most people care about. Where this gets genuinely compelling is if AMD can make the combined GPU plus NPU scheduling transparent enough that developers do not need to think about which hardware is running which part of the pipeline. That is not a solved problem on any platform yet, and if Lemonade gets it right for even a subset of workloads, it becomes the default choice on AMD hardware regardless of how it benchmarks against Ollama on pure text generation.

freedomben2mo ago

Neat, they have rpm, deb, and a companion AppImage desktop app[1]! Surprised I wasn't aware of this project before. Definitely going to give it a try.

[1]: https://github.com/lemonade-sdk/lemonade/releases/tag/v10.0....

bravetraveler2mo ago

A fun observation: pulling models sends ~200mbit of progress updates to your browser

pantalaimon2mo ago

It's pretty annoying that you need vendor specific APIs and a large vendor specific stack to do anything with those NPUs.

This way software adoption will be very limited.

syntaxing2mo ago

Wow this is super interesting. This creates a local “Gemini” front end and all. This is more or less a generative AI aggregator where it installs multiple services for different gen modes. I’m excited to try this out on my strix halo. The biggest issue I had is image and audio gen so this seems like a great option.

metalliqaz2mo ago

my most powerful system is Ryzen+Radeon, so if there are tools that do all the hard work of making AI tools work well on my hardware, I'm all for it. I find it very frustrating to get LLMs, diffusion, etc. working fast on AMD. It's way too much work.

Sparkyte2mo ago

What is the lowest process I can implement this on?

LowLevelKernel2mo ago

Which specific NPU’s?

robotswantdata2mo ago

Forget all the vibe coded slop or Ollama. Lemonade is the real deal and very good, been using about a year now.

AMD are doing gods work here

ozgrakkurt2mo ago

For people with AMD card. This is garbage, rocm is garbage. Just install llama.cpp and run llama-server with vulkan option. This is just some slop + JS/Electron garbage put on top.

luxuryballs2mo ago

this is funny I’m working on building an AI project called lemonade right now

j / k navigate · click thread line to collapse

111 comments

73 comments · 24 top-level

moconnor2mo ago· 10 in thread

Is... is this named because they have a lemon they're trying to make the most of?

parsimo20102mo ago

I think saying "L-L-M" sounds kind of like "lemon," so this is an LLM-aid (sounds like lemonade).

projektfu2mo ago

Wonder why they didn't call it LLMonade, which would be unique.

metalliqaz2mo ago

so obvious and yet I didn't connect the dots. thank you

1 more reply

TeMPOraL2mo ago

If life keeps giving it them, they should instead invent a combustible lemon.

eddieroger2mo ago

Do they know who you are? They're the guys who are going to blow your house up ... with the lemons.

1 more reply

altmanaltman2mo ago

Lemonsqueeze was considered too violent

nathan_douglas2mo ago

If you run it in a cluster, does it become a Lemon Party?

1 more reply

lrvick2mo ago

I exclusively buy AMD hardware for local inference. For open drivers, power efficiency, and cost AMD beats Nvidia easily for consumers.

suprjami2mo ago

You have got to be joking.

My three NVIDIA cards are more power efficient than my one AMD card, both at idle and during usage.

Official ROCm is like pulling teeth with poor support for desktop cards. Debian, a volunteer led project, have better ROCm CI than AMD and support more cards.

Look at any benchmarks. NV midrange cards are faster than AMD and at least a generation in front. Owning a 7900XTX is an embarrassing disappointment.

I like AMD and want them to succeed, but they are way behind NV in this area.

2 more replies

javchz2mo ago

Any recommendations in the current market? Love how plug and play and is on Linux from the driver side of things.

1 more reply

nijave2mo ago· 9 in thread

Anyone compare to ollama? I had good success with latest ollama with ROCm 7.4 on 9070 XT a few days ago

RealFloridaMan2mo ago

It is optimized for compatibility across different APIs as well as has specific hardware builds for AMD GPUs and NPUs. It’s run by AMD.

Under the hood they are both running llama.cpp, but this has specific builds for different GPUs. Not sure if the 9070 is one, I am running it on a 370 and 395 APU.

martin-adams2mo ago

I just compared this on my Mac book M1 Max 64GB RAM with the following:

Model: qwen3.59b Prompt: "Hey, tell me a story about going to space"

Ollama completed in about 1:44 Lemonade completed in about 1:14

So it seems faster in this very limited test.

nezhar2mo ago

I'm also curious about this one, also I want to compare this to vLLM.

iugtmkbdfil8342mo ago

Seconded. Currently on ollama for local inference, but I am curious how it compares.

LumielGR2mo ago

metalliqaz2mo ago

better than Vulkan?

cpburns20092mo ago

1 more reply

0x4572mo ago

For me Vulkan performs better on integrated cards, but ROCm (MIGraphX) on 7900 XTX.

nijave2mo ago

As I understand it, it depends on your GPU and ROCm version but they're similar-ish

dennemark2mo ago· 5 in thread

I have been using lemonade for nearly a year already. On Strix Halo I am using nothing else - although kyuz0's toolboxes are also nice (https://kyuz0.github.io/amd-strix-halo-toolboxes/)

Edit: OpenAI and i think nowaday ollama compatible endpoints allow me to use it in VSCode Copilot as well as i.e. Open Web UI. More options are shown in their docs.

UncleOxidant2mo ago

How much of a speedup might I get for, say, Qwen3.5-122B if I were to run with lemonade on my Strix Halo vs running it using vulkan with llama.cpp ?

sawansri2mo ago

syntaxing2mo ago

Have you used it with any agents or claw? If so, which model do you run?

dennemark2mo ago

Lemonade has a Web UI to set the context size and llama.cpp args, you need to set context to proper number or just to 0 so that it uses the default. If its too low, it wont work with agentic coding.

I will try some Claw app, but first need to research the field a bit. But I am using different models on Open Web UI. GPT 120B is fast, but also Qwen3.5 27B is fine.

1 more reply

lrvick2mo ago

As another data point.

Running Qwen3.5 122B at 35t/s as a daily driver using Vulcan llama.cpp on kernel 7.0.0rc5 on a Framework Desktop board (Strix Halo 128).

Also a pair of AMD AI Pro r9700 cards as my workhorses for zimageturbo, qwen tts/asr and other accessory functions and experiments.

Finally have a Radeon 6900 XT running qwen3.5 32B at 60+t/s for a fast all arounder.

If I buy anything nvidia it will be only for compatibility testing. AMD hardware is 100% the best option now for cost, freedom, and security for home users.

2 more replies

zozbot2342mo ago· 4 in thread

Note that the NPU models/kernels this uses are proprietary and not available as open source. It would be nice to develop more open support for this hardware.

plagiarist2mo ago

I bought one of their machines to play around with under the expectation that I may never be able to use the NPU for models. But I am still angry to read this anyway.

zozbot2342mo ago

swiftcoder2mo ago

Are they? The docs say "You can also register any Hugging Face model into your Lemonade Server with the advanced pull command options"

zozbot2342mo ago

That won't give you NPU support, which relies on https://github.com/FastFlowLM/FastFlowLM . And that says "NPU-accelerated kernels are proprietary binaries", not open source.

JSR_FDED2mo ago· 4 in thread

molticrystal2mo ago

>Does it support MLX or metal on Macs?

This is answered from their Project Roadmap over on Github[0]:

Recently Completed: macOS (beta)

Under Development: MLX support

[0] https://github.com/lemonade-sdk/lemonade?tab=readme-ov-file#...

RealFloridaMan2mo ago

It also has endpoints that are compatible with OpenAI, Ollama, and Anthropic so you can throw any tool that is compatible with those and it will just run.

zelphirkalt2mo ago

0x4572mo ago

It's alternative to LM Studio in a way that it's an abstraction over multiple runtimes. AMD part is that it supports FastFlowML runtime which is the only way to utilize NPU on Ryzen AI CPUs on linux.

cpburns20092mo ago· 4 in thread

Just in case anyone isn't aware. NPUs are low power, slow, and meant for small models.

jcgrillo2mo ago

naasking2mo ago

2 more replies

ThatPlayer2mo ago

I use an older Google Coral TPU running in my home lab being used by Frigate NVR for object detection for security cameras. It's more efficient, but less flexible than running it on the GPU.

Don't know if I need an NPU for my daily driver computer, but I would want one for my next home server.

cpburns20092mo ago

gnarlouse2mo ago· 4 in thread

Maybe it's a language barrier problem, but "by AMD" makes me think its a project distributed by AMD. Is that actually the case? I'm not seeing any reason to believe it is.

buildbot2mo ago

It’s a community project supported and sponsored by AMD according to their GitHub; https://github.com/lemonade-sdk/lemonade

AMD employees work on it/have been making blog posts about it for a bit.

AbuAssarOP2mo ago

guipsp2mo ago

It is mostly developed by AMD and used to be hosted on the AMD github iirc

hombre_fatal2mo ago

> You can reach us by filing an issue, emailing lemonade@amd.com

Found this on the github readme.

kouunji2mo ago· 3 in thread

dennemark2mo ago

Maybe you have seen NPU support via FLM already: https://lemonade-server.ai/flm_npu_linux.html

"FastFlowLM (FLM) support in Lemonade is in Early Access. FLM is free for non-commercial use, however note that commercial licensing terms apply. "

cpburns20092mo ago

[1]: https://github.com/FastFlowLM/FastFlowLM

boomskats2mo ago

I thought the NPU has been available since something like 6.12?

jmillikin2mo ago· 2 in thread

Surprising that the Linux setup instructions for the server component don't include Docker/Podman as an option, its Snap/PPA for Ubuntu and RPM for Fedora.

Maybe the assumption is that container-oriented users can build their own if given native packages?

freedomben2mo ago

They do have some container options, though I definitely think they should be added to the release page: https://lemonade-server.ai/install_options.html#docker

zenoprax2mo ago

Why should this be on the "Releases"? Shouldn't that just be for build artifacts? Pre-built containers belong on a registry, no?

I suppose a Dockerfile could be included but that also seems unconventional.

1 more reply

sensitiveCal2mo ago· 1 in thread

Feels like this is sitting somewhere between Ollama and something like LM Studio, but with a stronger focus on being a unified “runtime” rather than just model serving.

RealFloridaMan2mo ago

It bundles tools, model selection, and overall management.

If you have an AMD machine and want to run local models with minimal headache…it’s really the easiest method.

This runs on my NAS, handles my home assistant setup.

I have a strix halo and another server running various CUDA cards I manage manually by updating to bleeding edge versions of llama.cpp or vllm.

rpdillon2mo ago· 1 in thread

https://github.com/lemonade-sdk/llamacpp-rocm

zozbot2342mo ago

ilaksh2mo ago· 1 in thread

Cool but is there a reason they can't just make PRs for vLLM and llama.cpp? Or have their own forks if they take too long to merge?

RealFloridaMan2mo ago

They use the latest llama.cpp under the hood but built for specific AMD GPU hardware.

9dc2mo ago· 1 in thread

so... what does it do? i dont get it Lol

iugtmkbdfil8342mo ago

steffs2mo ago

freedomben2mo ago

Neat, they have rpm, deb, and a companion AppImage desktop app[1]! Surprised I wasn't aware of this project before. Definitely going to give it a try.

[1]: https://github.com/lemonade-sdk/lemonade/releases/tag/v10.0....

bravetraveler2mo ago

A fun observation: pulling models sends ~200mbit of progress updates to your browser

pantalaimon2mo ago

It's pretty annoying that you need vendor specific APIs and a large vendor specific stack to do anything with those NPUs.

This way software adoption will be very limited.

syntaxing2mo ago

metalliqaz2mo ago

Sparkyte2mo ago

What is the lowest process I can implement this on?

LowLevelKernel2mo ago

Which specific NPU’s?

robotswantdata2mo ago

Forget all the vibe coded slop or Ollama. Lemonade is the real deal and very good, been using about a year now.

AMD are doing gods work here

ozgrakkurt2mo ago

For people with AMD card. This is garbage, rocm is garbage. Just install llama.cpp and run llama-server with vulkan option. This is just some slop + JS/Electron garbage put on top.

luxuryballs2mo ago

this is funny I’m working on building an AI project called lemonade right now

j / k navigate · click thread line to collapse