undefined | Better HN

0 pointssegmondy1d ago0 comments

I run Q4_K_XL. All it takes to run to get about 6tk/sec is 512gb of ram and 2 3090 GPUs with llama.cpp -cmoe. I also have crappy DDR4, 2400mhz, 3200mhz will bring that speed up to about 9tk/sec. I also have ok 32core epyc CPU, a better 64core would bring it up to about 11tk/sec. I did a budget build before the crazy hardware cost and I regret it everyday. Nevertheless, it's fantastic being able to run this model at home. It's great for planning, one shot prompting once you have a plan or all the context you need. This entire hardware cost $2400 when it was built. If you're willing to be resourceful, you can find ways to run these models at home. I often get the silly question of why, and suggestions about how much I can save using cloud API, but the Fable drama has opened up eyes on why it's good for us to be independent. Thanks team unsloth, Q4_K_XL is solid, if you are going to grab a quant, make sure to get the K_XL variant if it can fit.

0 comments

46 comments · 12 top-level

discordance1d ago· 19 in thread

Running that full load is at least 600 W, so in a day ~14 kWh. At $0.2 a kWH, that would be $2.80/day or $1k a year of op-ex in electricity.

Unless you really want privacy or the fuzzy feeling of owning your own, it’s cheaper, more convenient and has much faster tok/s if you pay a hyper scaler.

That said, I do like the direction we are heading and look forward to seeing what host your own hardware we get in 2 years.

segmondyOP1d ago

No one locally runs full load all day. The only way to see that is if you're training. We are talking about inference. I limit my GPU to 300watts. You can limit them down to 200w. Since everything is not in GPU and the bottleneck is between CPU/system ram. The GPUs don't even get to spike, I see 160w-180w for each GPU during inference. So redo your calculation again. Figure about 6 hrs of daily inference, and we are down to roughly $125 a year. Thanks again for your speculation.

walrus011d ago

Not everyone lives in a place where electricity is $0.20 a kWh. For instance BC Hydro residential rates are $0.11 (CAD) for the first tier and $0.14 for the second tier of consumption in a month. At current exchange rate $0.14 CAD is $0.099 USD a kWh. Hydro Quebec is even cheaper.

At a theoretical 6 tok/s, 86400 seconds in a day, approx 500,000 tokens of GLM5.2 output for 2 bucks a day seems like a pretty good bargain to me. Of course not counting the one time cost of the hardware to run it. But I see people dropping $4000-5000 on all kinds of much less useful stuff.

Additionally in a place where people use electric baseboard heating or electric in floor radiant heating, or really any other heating element based system in winter that's less efficient than a heat pump, additional electrical from a computing load is basically "free" since you would be spending that same money otherwise to heat your house. If a computer with 512GB of RAM is dumping the waste heat into your room, it accomplishes a portion of the same thing as a baseboard.

Not to mention there is a whole other less measurable benefit of having a locally hosted model that can't be turned off or arbitrarily restricted by a service provider, and where all of your queries and context cache aren't subject to surveillance by any third party.

discordance1d ago

Where I live prices are often higher than 20c/kWh, but lets take your example and halve it (10c/kWh) so it's ~$1.40/day or ~$500/year.

On Openrouter, the cheapest GLM 5.2 provider costs $3/MTok (at 44 tps). Assuming most use is output tokens, that's still the equivalent of 450k token/day, so we're in the same ball park, but without the capex for 2 3090's and the machine.

Self hosted only makes economic sense if your priority is being in control / avoiding surveillance.

1 more reply

Incipient23h ago

Unless the token estimates I get from using Claude are wayyy out, I burn through 5m+ tokens/day, and I'm not doing a lot of time. 500k tokens in a 24h period for $5k of hardware seems quite poor?

1 more reply

tmountain1d ago

Lots of people have solar. Green AI, imagine that!

cultofmetatron1d ago

if only there was a magical place where geothermal and hydroelectric is ubiquitous and the weather is cold enough that no one is going to be complaining about free heating.

3 more replies

matheusmoreira1d ago

We do want privacy, and we also want to own the hardware so the US can't just turn it off whenever it feels like it.

I think the main reason not to run locally is to get the full models instead of quantized versions.

traceroute661d ago

> We do want privacy, and we also want to own the hardware so the US can't just turn it off whenever it feels like it.

I agree and I prefer on-prem where possible. The Apple Mac Studios have been great for that although I don't have enough of them to run GLM-5.2 without heavy quantization. I'm also waiting for the Apple next product refresh which I hope will enable me to do more with less.

Meanwhile there are hosted privacy-conscious options out there. Two names to look at are Tinfoil[1] and Privatemode (from Edgeless Systems)[2].

Tinfoil[1] is, sadly, US-based. EU-sovereignty-option is on their long-term radar. But they do have GLM-5.2 today.

Privatemode[2] is a German company (Edgeless Systems) with EU-based servers. But sadly no GLM-5.2 today, it is on their mid-long term radar though.

Both Tinfoil and Privatemode operate on the same concept of the LLM operating in a secure enclave and you have end-to-end attestation and encryption.

Tinfoil have not been independently audited, it is somewhere on their long-term radar.

Privatemode have been thoroughly independently audited with documentation available on request.

Both of them are API-tokens-only. So if you're currently one of those people throwing $200 a month down the pan at Anthropic/OpenAI for a so-called-alleged 'unlimited' plan, then neither Tinfoil or Privatemode will be the place for you.

[1]https://tinfoil.sh/ [2] https://www.privatemode.ai/

2 more replies

SXX1d ago

I guess you missed recent news. Problem is that cloud LLM might just sliently sabotage your work by downgrading output model with no notice.

Or cloud LLM might just refuse to sell to you because it dont like your passport.

yorwba1d ago

So you're buying expensive hardware as insurance for the case that your cloud provider turns against you and you have to switch to another of the twenty offering the same model https://openrouter.ai/z-ai/glm-5.2 or in the worst case buy the same hardware later? How does that make sense?

3 more replies

swiftcoder1d ago

This is not really a problem for the open-weight models, you can always give your money to an inference provider in a different jurisdiction

eptcyka18h ago

Isn't that still cheaper than the 100 or 200$ plan that Anthropic wants from you?

throwawayffffas1d ago

So in my experience with 2 7900XTs with models that sit fully in VRAM it's more like 400W the gpus spend a lot of time waiting for each other.

DrScientist1d ago

Depends on whether you've also gone for self-hosted electricity generation or not.

bawana22h ago

Even on a macStudio w 512 gig memory?

downut22h ago

I have rooftop solar and I have been building credit with my electric utility even though the daily high temperature is well over 100F outside and a comfortable 75F inside. That includes running three AMD 12 thread 128GB systems with obsolete GPUs 24x7x365. I'm not a gamer, so 6 years ago I went low-end low-power GPUs. Boy am I dumb. Currently running the qwen3.6:27b, 35b, and gemma4:31b models just fine.

As soon as VRAM prices drop to sanity I'm going to load up and I could care less about the power draw.

Some parts of the future are absolutely great.

downut15h ago

I am fascinated that I got down voted. I mean, isn't what I'm doing here nearly ideal? Or maybe not: why? My solar panels shade my roof under the incessant sun of the Sonoran Desert and turn a fraction of the insolation into electrical power that allows me to do almost SOTA local LLM stuff inside my house for free[1] that the parent commenter thought to be economically infeasible. Of course it's slow! So what! Right now I'm transcribing to text a podcast with whisper.cpp and it will take about as much time as the original podcast duration but I will be able to read it in 1/20th of the time.

Alternative interpretation of a downvote: we should all be enslaved to corporate electrical generation provided to "local" electric utility corporations so that we are economically incentivized to use cloud LLM providers. That's weird, no?

Teach me.

[1] It's a small nice house that cost ~$330K not too far off from my city center. This isn't rich privilege boasting.

poulpy1231d ago

which hyper scaler would you suggest ?

dzjkb1d ago

how do you rent 2 3090s for $2.80/day?

fsuts1d ago· 4 in thread

6 tokens per second?

Can you put up with that? As seems very slow. I aim for 40t/s on a laptop and choose models that deliver that speed over larger slower ones

segmondyOP1d ago

I have been putting up with it forever. We are spoiled by MixtureOfExperts. Folks were delighted to run llama3-70B at such speed. We were happy with 15-20tk/sec with 8b models, and if you could run llama3-405B at 1tk/sec you were a god. To each their own. I can live with 6 high quality tokens. If I could get a Fable equivalent model, I'll gladly take 2tk/sec if that's what it took to run it locally.

manmal1d ago

But what is it doing for you that you couldn’t do yourself at that speed? I‘m really curious and on the fence of partly going local.

3 more replies

froh1d ago

do you use caveman or similar?

walrus011d ago

I get a lot done with something that's also approximately 6 tokens/second, if you're willing to give it a well defined set of prompts and projects to work on, leave it for an hour or two, then come back and check what it's done. And often to remember to give it something of more consequence to do for at least 3-4 hours of wall clock runtime before heading to bed.

nextaccountic1d ago· 4 in thread

How can you combine CPU cores and multiple GPU? Are you running some layers in cpu, others in gpu #1, and others in gpu #2? What about the bandwidth and latency between them?

Or maybe the model itself only runs at gpus, and the cpu memory only store the weights for experts not corrently activated? If so, then what's the 32 or 64 cpu cores for?

I'm a big fan of fully utilizing one's hardware and it's kinda sad that it's not the norm to run things on either gpu, cpu or both, dynamically choosing at runtime, for everyday software

nodja1d ago

Pipeline parallelism. Instead of splitting layers by row/column. You split at the layer edges. So instead of having this huge bottleneck of bandwidth you only need to transfer about 4KB per token when changing devices on a model like Qwen 3 30BA3.

xrd1d ago

This is a good place to start reading about dual gpus.

https://github.com/noonghunna/club-3090/blob/master/docs/DUA...

nextaccountic1d ago

But in this case he used a cpu too

segmondyOP1d ago

checkout llama.cpp, the entire point of the project is for us mere mortals and GPU poor.

edg50001d ago· 3 in thread

Very cool. So it's not just about GPU VRAM which I incorrectly thought. I though you'd need 512 GB GPU VRAM. I don't think it cost only 2400; 512GB ram would be more expensive though back in the day. But not mortgage-grade 200.000 which I estimated myself (which assumed running in 100% VRAM; overkill for a single user probably).

segmondyOP1d ago

you can use system ram with a system like llama.cpp which offloads to system ram. token generation is a function of system bandwidth, the faster the bandwidth the better. so I'm on 8 channel 2400mhz. if I had a 12 ddr channel, I would get 1.5x the speed at 2400mhz. of course ddr5 is much faster, so a 12 ddr at 4800mhz will provide 3x the speed for token generation or roughly 18tk/sec. prompt processing is all about compute, so the more cpu cores you have the faster it can do PP.

nijave1d ago

Well, it's about GPU VRAM if you want something competitive with cloud-hosted offerings at the performance levels showing in benchmarks. This is a heavy quant with quality degradation and significantly lower performance.

Cloud offerings are 80-200tk/sec versus single digit tk/sec.

That said, I'm also surprised it runs at all locally. I do think it'd be painfully slow for anything interactive so you're relying on another model for a comprehensive design or you're hoping a one-shot with somewhat degraded quality turns out correctly.

edg50001d ago

I see. So not quite usable apart for specific use cases. Maybe in a few years we'll see new hardware players and better prices.

1 more reply

zozbot2341d ago· 2 in thread

AIUI the llama.cpp implementation for this model is still quite half-baked due to missing the support for DSA sparse attention mechanism. This leads to running the model with a different mechanism that it has not been trained for, which has been shown to lead to lower quality and performance.

Anyway, I think GLM 5.2 in many ways is not as interesting as DeepSeek V4 series, which uses an even more advanced attention mechanism and can save a lot of memory capacity for KV cache, especially at larger contexts. Which in turn opens up wide batching especially on consumer platforms. GLM doesn't have that, in some ways it feels broadly similar to Kimi 2.6 wrt. the underlying performance architecture. Both are a bit too heavy to run reasonably at full quality on ordinary hardware.

trollbridge1d ago

Particularly DeepSeek 4.1, which they appear to be A/B testing on the API and which also seems available on the free chat interface.

It also has an input image modality, which is a game changer. The cheap Sinofrontier models have generally been lacking in this regard.

Basically, Chinese competition is fierce - DeepSeek set the pricing tier, and the question for each lab now is how to justify charging a little more.

MiMo-2.5-Pro has gone with UltraSoeed, pumping out 1000t/s for a 3X price hike.

GLM has gone with 5.2, hitting Opus levels of reasoning at a fraction of the cost.

DeepSeek will probably keep their pricing model and just keep getting better and better.

Qwen-3.7 is the dark horse. Some rumours are Alibaba is simply making these models because they need them internally.

The real question is why this level of innovation and competition isn’t happening in America or Europe. In particular I see no reason Europe doesn’t have a lab competing on these terms.

SalariedSlave1d ago

Competing and innovating in the fast moving SOTA end of the llm space requires a ruthless disregard for copyright, IP, bureaucracies, formalities, risk assurances and other slowdowns. It requires a risk tolerant, quick and large flowing investment of capital. It requires a scoped focus that is pragmatic and sharp about key concerns, and efficiently dismissive of meaningless details.

Europe can provide none of this. They will never be at the frontier of AI tech, for the same reason they were never at the frontier of any tech.

I say this as a software engineer from Europe.

2 more replies

dxuh1d ago· 2 in thread

"All it takes to run" might be fair if you paid $2400, but right now the total price is way closer to $10k (almost 5k for the RAM and 2k each for the GPUs). Today that is a lot of expensive hardware.

segmondyOP1d ago

512gb 2400mhz ddr4 ram = $1600 not $5000. https://www.ebay.com/itm/188284985172 You can get creative and source 2-3 2080ti 22gb from China for about $250 a piece. You can either be resourceful and find a way or find a whole bunch of excuses.

officialchicken1d ago

> You can either be resourceful and find a way or find a whole bunch of excuses.

How about addressing this false dichotomy with the likelihood that someone who is new or interested in a tech isn't willing to drop thousands of dollars on used hardware for a whim or learning exercise.

effisfor1d ago

I applaud all you tinkerers for pushing on the state of the home-brewed art here. Like crypto, AI is drowned out by hucksters, very few people talk about developing resilience. Or the researchers who will push on open source models in efforts to cram them onto an electric toothbrush or tamagotchi. Bravo to you all.

pizza2341d ago

LOL, sure this works if one has a time machine or a LOT of money to burn.

32 CPU Epyc (Epyc is required for faster memory access) + 32 GB VRAM + 512 GB RAM is stupid expensive nowadays, and in best case, it will just downgrade to "very" expensive at some point in the future.

This makes sense only if 1. one is paranoid about privacy or 2. they have money to smoke or 3. they need to workaround cloud model restrictions, AND they have to do it routinely (because if not, a oneshot cloud bare metal setup is way cheaper, faster, and allows more powerful models, due to VRAM offering).

I did spend stupid money as well and yet, the system is 2x slower than cloud providers for comparable performance on vision tasks (I still have to test coding). Oh, and it's hot as hell.

SwellJoe20h ago

6 tokens per second is not fit for interactive use. I find Gemma 4 (QAT 4-bit, MTP) to be tolerable at about 30 tokens per second on my old GPUs. Anything slower than 15 is annoying. I tried DS4 on my Strix halo (1-bit quantization of DeepSeek V4 Flash, the biggest model that can realistically run on 128GB, right now), and it tops out at something like 10 or 11 with a long time to first response, and that's quite painful to use. I'd definitely rather spend money to use the big models on cloud infrastructure.

And, the several thousand dollars it costs to run these things unusably slowly buys a lot of tokens on the cheap Chinese models.

radku1d ago

I have pretty much almost this exact setup with 2x3090s and with slightly faster DDR4 512GB and 64 core Epyc! [0] I've been enjoying it a lot. Can't wait to give this model a try.

Apart of running local models I use this rig as my main remote development platform. All Claude Code sessions are running there in tmux now. And my fingers can't be happier not having to deal with constantly hot laptop. Not to mention that Claude Code is such a battery hog.

[0] https://medium.com/@rathko/i-built-an-epyc-64-core-512gb-ram...

redox991d ago

That's crazy good for $2400.

ikari_pl1d ago

I can work out max 90GB to the agents. Advise. :)

j / k navigate · click thread line to collapse