undefined | Better HN

0 pointsdatameta8mo ago0 comments

Realistically, how useful is local LLM usage? What are your use cases, hardware, and models used?

0 comments

I have a old system with 3 ancient Tesla K40s which can easily run inference on ~30B parameter models (e.g. qwen3-coder:30b). I mostly use it as a compute box for other workloads, but its not completely incapable for some AI assisted coding. It is power hungry though, and the recent spike in local electricity rates is enough of an excuse to keep it off most of the time.

datametaOP8mo ago

I'm surprised the accelerators of yore trick actually worked and balancing a trio is trivially more difficult than duo? I enjoy the idea of having tons of VRAM and system RAM and loading a big model and getting responses a few times per hour as long as its high quality

neutrinobro8mo ago

Yeah, I was equally surprised. I am using a patched version of ollama to run the models: https://github.com/austinksmith/ollama37 which has a trivial change to allow it to run with old versions of cuda (3.5, 3.7). Obviously this was before tensor cores were a thing, so you're not going to be blown away by the performance, but it was cheap. I got 3x k40s for $75 on ebay, they are passively cooled, so they do need to be in a server chassis.

lawlessone8mo ago

>Realistically, how useful is local LLM usage?

For me, none really, just as a toy. I don't get much use out of online either. There was Kaggle competition to find issues with OpenAI's open weights model, but because my RTX gpu didn't have enough memory i had to run it very slowly from with CPU/ram.

Maybe other people have actual uses, but i don't

npteljes8mo ago

The differentiator is that locally, you can use abliterated models - models where they undid the guardrails.

hnuser1234568mo ago

Lots of people already have RTX 3090/4090/5090 for gaming and they can run 30b-class models at 40+ tok/sec. There is a huge field of models and finetunes of this size on huggingface. They are a little bit dumber than the big cloud models but not by much. And being able to run them 24/7 for just the price of electricity (and the privacy) is a big pull.

nomel8mo ago

> they can run 30b-class models at 40+ tok/sec.

No, they can run quantized versions of those models, which are dumber than the base 30b models, which are much dumber than > 400b models (from my use).

> They are a little bit dumber than the big cloud models but not by much.

If this were true, we wouldn't see people paying the premiums for the bigger models (like Claude).

For every use case I've thrown at them, it's not a question of "a little dumber", it's the binary fact that the smaller models are incapable of doing what I need with any sort of consistency, and hallucinate at extreme rates.

What's the actual use case for these local models?

hnuser1234568mo ago

With quantization-aware-training techniques, q4 models are less than 1% off from bf16 models. And yes, if your use case hinges on the very latest and largest cloud-scale models, there are things they can do the local ones just can't. But having them spitting tokens 24/7 for you would have you paying off a whole enterprise-scale GPU in a few months, too.

If anyone has a gaming GPU with gobs of VRAM, I highly encourage they experiment with creating long-running local-LLM apps. We need more independent tinkering in this space.

1 more reply

datametaOP8mo ago

What kind of interactions do you have? Brainstorming, knowledge framework, rubber duck debug plus? Help me understand please if you will because I have a 3090 sitting without a suitable rest of it all and I wonder invest or not?

j / k navigate · click thread line to collapse

0 comments

neutrinobro8mo ago

datametaOP8mo ago

neutrinobro8mo ago

lawlessone8mo ago

>Realistically, how useful is local LLM usage?

Maybe other people have actual uses, but i don't

npteljes8mo ago

The differentiator is that locally, you can use abliterated models - models where they undid the guardrails.

hnuser1234568mo ago

nomel8mo ago

> they can run 30b-class models at 40+ tok/sec.

No, they can run quantized versions of those models, which are dumber than the base 30b models, which are much dumber than > 400b models (from my use).

> They are a little bit dumber than the big cloud models but not by much.

If this were true, we wouldn't see people paying the premiums for the bigger models (like Claude).

What's the actual use case for these local models?

hnuser1234568mo ago

If anyone has a gaming GPU with gobs of VRAM, I highly encourage they experiment with creating long-running local-LLM apps. We need more independent tinkering in this space.

1 more reply

datametaOP8mo ago

j / k navigate · click thread line to collapse