undefined | Better HN

0 pointstrvz24d ago0 comments

If you have to ask then your GPU is too small.

With 16 GB you'll be only able to run a very compressed variant with noticable quality loss.

0 comments

Not true. With a MoE, you can offload quite a bit of the model to CPU without losing a ton of performance. 16GB should be fine to run the 4-bit (or larger) model at speeds that are decent. The --n-cpu-moe parameter is the key one on llama-server, if you're not just using -fit on.

boppo124d ago

I've been way out of the local game for a while now, what's the best way to run models for a fairly technical user? I was using llama.cpp in the command line before and using bash files for prompts.

adrian_b24d ago

Running llama-server (it belongs to llama.cpp) starts a HTTP server on a specified port.

You can connect to that port with any browser, for chat.

Or you can connect to that port with any application that supports the OpenAI API, e.g. a coding assistant harness.

palmotea24d ago

> If you have to ask then your GPU is too small.

What's the minimum memory you need to run a decent model? Is it pretty much only doable by people running Macs with unified memory?

giobox24d ago

It's worth noting now there are other machines than just Apple that combine a powerful SoC with a large pool of unified memory for local AI use:

> https://www.dell.com/en-us/shop/cty/pdp/spd/dell-pro-max-fcm...

> https://marketplace.nvidia.com/en-us/enterprise/personal-ai-...

> https://frame.work/products/desktop-diy-amd-aimax300/configu...

etc.

But yes, a modern SoC-style system with large unified memory pool is still one of the best ways to do it.

jchw24d ago

32 GiB of VRAM is possible to acquire for less than $1000 if you go for the Arc Pro B70. I have two of them. The tokens/sec is nowhere near AMD or NVIDIA high end, but its unexpectedly kind of decent to use. (I probably need to figure out vLLM though as it doesn't seem like llama.cpp is able to do them justice even seemingly with split mode = row. But still, 30t/s on Gemma 4 (on 26B MoE, not dense) is pretty usable, and you can do fit a full 256k context.)

When I get home today I totally look forward to trying the unsloth variants of this out (assuming I can get it working in anything.) I expect due to the limited active parameter count it should perform very well. It's obviously going to be a long time before you can run current frontier quality models at home for less than the price of a car, but it does seem like it is bound to happen. (As long as we don't allow general purpose computers to die or become inaccessible. Surely...)

dist-epoch24d ago

NVIDIA 5070 Ti can run Gemma 4 26B at 4-bit at 120 tk/s.

Arc Pro B70 seems unexpectedely slow? Or are you using 8-bit/16-bit quants.

1 more reply

zozbot23424d ago

New versions of llama.cpp have experimental split-tensor parallelism, but it really only helps with slow compute and a very fast interconnect, which doesn't describe many consumer-grade systems. For most users, pipeline parallelism will be their best bet for making use of multi-GPU setups.

1 more reply

angoragoats24d ago

Macs with unified memory are economical in terms of $/GB of video memory, and they match an optimized/home built GPU setup in efficiency (W/token), but they are slow in terms of absolute performance.

With this model, since the number of active parameters is low, I would think that you would be fine running it on your 16GB card, as long as you have, say 32GB of regular system memory. Temper your expectations about speed with this setup, as your system memory and CPU are multiple times slower than the GPU, so when layers spill over you will slow down.

To avoid this, there's no need to buy a Mac -- a second 16GB GPU would do the trick just fine, and the combined dual GPU setup will likely be faster than a cheap mac like a Mac mini. Pay attention to your PCIe slots, but as long as you have at least an x4 slot for the second GPU, you'll be fine (LLM inference doesn't need x8 or x16).

TechSquidTV24d ago

My Mac Studio with 96GB of RAM is maybe just at the low end of passable. It's actually extremely good for local image generation. I could somewhat replace something like Nano Banana comfortably on my machine.

But I don't need Nano Banana very much, I need code. While it can, there's no way I would ever opt to use a local model on my machine for code. It makes so much more sense to spend $100 on Codex, it's genuinely not worth discussing.

For non-thinking tasks, it would be a bit slower, but a viable alternative for sure.

slopinthebag24d ago

You just need to adjust your workflow to use the smaller models for coding. It's primarily just a case of holding them wrong if you end up with worse outputs.

layer824d ago

It’s also doable with AMD Strix Halo.

bfivyvysj24d ago

A bit like asking how long is a piece of string.

latentsea24d ago

It's twice as long as from one end to the middle.

palmotea24d ago

More like "about how long of a string do I need to run between two houses in the densest residential neighborhood of single-family homes in the US?"

littlestymaar24d ago

No, GP is excessively restrictive. Llama.cpp supports RAM offloading out of the box.

It's going to be slower than if you put everything on your GPU but it would work.

And if it's too slow for your taste you can try the quantized version (some Q3 variant should fit) and see how well it works for you.

utilize180824d ago

Obviously going to depend on your definition of "decent". My impression so far is that you will need between 90GB to 100GB of memory to run medium sized (31B dense or ~110B MoE) models with some quantization enabled.

cjbgkagh24d ago

I’m running Gemma4 31B (Q8) on my 2 4090s (48GB) with no problem.

1 more reply

FusionX24d ago

Aren't 4bits model decent? Since, this is an MOE model, I'm assuming it should have respectable tk/s, similar to previous MOE models.

gunalx24d ago

Running q3 xss with full and quantizised context as options on a 16gb gpu and still has pretty decent quality and fitting fine with up to 64k context.

j / k navigate · click thread line to collapse

0 comments

coder54324d ago

boppo124d ago

I've been way out of the local game for a while now, what's the best way to run models for a fairly technical user? I was using llama.cpp in the command line before and using bash files for prompts.

adrian_b24d ago

Running llama-server (it belongs to llama.cpp) starts a HTTP server on a specified port.

You can connect to that port with any browser, for chat.

Or you can connect to that port with any application that supports the OpenAI API, e.g. a coding assistant harness.

palmotea24d ago

> If you have to ask then your GPU is too small.

What's the minimum memory you need to run a decent model? Is it pretty much only doable by people running Macs with unified memory?

giobox24d ago

It's worth noting now there are other machines than just Apple that combine a powerful SoC with a large pool of unified memory for local AI use:

> https://www.dell.com/en-us/shop/cty/pdp/spd/dell-pro-max-fcm...

> https://marketplace.nvidia.com/en-us/enterprise/personal-ai-...

> https://frame.work/products/desktop-diy-amd-aimax300/configu...

etc.

But yes, a modern SoC-style system with large unified memory pool is still one of the best ways to do it.

jchw24d ago

dist-epoch24d ago

NVIDIA 5070 Ti can run Gemma 4 26B at 4-bit at 120 tk/s.

Arc Pro B70 seems unexpectedely slow? Or are you using 8-bit/16-bit quants.

1 more reply

zozbot23424d ago

1 more reply

angoragoats24d ago

Macs with unified memory are economical in terms of $/GB of video memory, and they match an optimized/home built GPU setup in efficiency (W/token), but they are slow in terms of absolute performance.

TechSquidTV24d ago

For non-thinking tasks, it would be a bit slower, but a viable alternative for sure.

slopinthebag24d ago

You just need to adjust your workflow to use the smaller models for coding. It's primarily just a case of holding them wrong if you end up with worse outputs.

layer824d ago

It’s also doable with AMD Strix Halo.

bfivyvysj24d ago

A bit like asking how long is a piece of string.

latentsea24d ago

It's twice as long as from one end to the middle.

palmotea24d ago

More like "about how long of a string do I need to run between two houses in the densest residential neighborhood of single-family homes in the US?"

littlestymaar24d ago

No, GP is excessively restrictive. Llama.cpp supports RAM offloading out of the box.

It's going to be slower than if you put everything on your GPU but it would work.

And if it's too slow for your taste you can try the quantized version (some Q3 variant should fit) and see how well it works for you.

utilize180824d ago

cjbgkagh24d ago

I’m running Gemma4 31B (Q8) on my 2 4090s (48GB) with no problem.

1 more reply

FusionX24d ago

Aren't 4bits model decent? Since, this is an MOE model, I'm assuming it should have respectable tk/s, similar to previous MOE models.

gunalx24d ago

Running q3 xss with full and quantizised context as options on a 16gb gpu and still has pretty decent quality and fitting fine with up to 64k context.

j / k navigate · click thread line to collapse