undefined | Better HN

0 pointspu_pe8mo ago0 comments

That's really great performance! Could you share more details about the implementation (ie which quantized version of the model, how much RAM, etc.)?

0 comments

kgeist8mo ago

Model: Qwen3 32b

GPU: RTX 5090 (no rops missing), 32 GB VRAM

Quants: Unsloth Dynamic 2.0, it's 4-6 bits depending on the layer.

RAM is 96 GB: more RAM makes a difference even if the model fits entirely in the GPU: filesystem pages containing the model on disk are cached entirely in RAM so when you switch models (we use other models as well) the overhead of unloading/loading is 3-5 seconds.

The Key Value Cache is also quantized to 8 bit (less degrades quality considerably).

This gives you 1 generation with 64k context, or 2 concurrent generations with 32k each. Everything takes 30 GB VRAM, which also leaves some space for a Whisper speech-to-text model (turbo & quantized) running in parallel as well.

pu_peOP8mo ago

Thanks a lot. Interesting that without concurrent requests the context could be doubled, 64k is pretty decent for working on a few files at once. A local LLM server is something a lot of companies should be looking into I think.

oceansweep8mo ago

Are you doing this with vLLM? If you're using Llama.cpp/Ollama, you could likely see some pretty massive improvements.

kgeist8mo ago

We're using llama.cpp. We use all kinds of different models other than Qwen3, and vLLM startup when switching models is prohibitively slow (several times slower than llama.cpp, which is already 5 sec)

From what I understand, vLLM is best when there's only 1 active model pinned to the GPU and you have many concurrent users (4, 8 etc.). But with just a single 32 GB GPU you have to switch the models pretty often, and you can't fit more than 2 concurrent users anyway (without sacrificing the context length considerably: 4 users = just 16k context, 8 users = 8k context), so I think vLLM so far isn't worth it. Once we have several cards, we may switch to vLLM.

j / k navigate · click thread line to collapse

0 comments

kgeist8mo ago

Model: Qwen3 32b

GPU: RTX 5090 (no rops missing), 32 GB VRAM

Quants: Unsloth Dynamic 2.0, it's 4-6 bits depending on the layer.

The Key Value Cache is also quantized to 8 bit (less degrades quality considerably).

pu_peOP8mo ago

oceansweep8mo ago

Are you doing this with vLLM? If you're using Llama.cpp/Ollama, you could likely see some pretty massive improvements.

kgeist8mo ago

j / k navigate · click thread line to collapse