On top of that, you will still be heavily quantized.
You can cluster these beasts too. Two and three (with two IP subnets) is fairly obvious. Four or more might need a switch depending on how much network latency affects things.
Apple seem to have forgotten about M series with gobs of RAM. I can't get the Apple shop to show more than 96GB of unified RAM and that costs a kidney.
If you are training and doing research it's great, if you want to cluster them it cant be beat, but if you just want local inference on a single box buy a mac or even a strix halo device.
(Still potentially very useful! But not magically ultra fast.)
I am very excited for local LLMs I think we may have GPT 5.5-xhigh level of performance for under 2000 EUR
This should put more pressure on the frontier models to avoid sitting on any fancy stuff and lower token prices as a whole.
Nothing beats a local LLM disconnected from the cloud.
The other way this could go is that Z.ai could decide to release a smaller model targeted towards coding. They've done that before (GLM-4.7-Flash had 30B params). It would be great if they decided to release something in the 80B-100B param range. Something that size would easily run in a current Strix Halo system.
We are maybe 10 years off that.
RAM prices are going to continue to increase for the next 2 years at least.
Even putting that aside it's currently around 40-70,000 EUR to run this with a FP8 quantization (which you need to get close to maximum performance).
To actually get GPT 5.5-xhigh performance in the real world you need more headroom to support things like subagents (which will fill up your KV cache).
I like local models but realism is important. The sweet spot for the next 3 years will continue to be ~35B MoE models. They might match GPT 5.5-xhigh for chat-style problems but not for coding.
Also I m not sure where you are getting the under 2k value. I bought a Framework desktop 128GB last year and my setup was around 2.7k. The same setup now sells for around 4.7k.
Do not mix the benchmark results of GLM 5.2 FP16/FP8 with FP4 or FP2.
* FP4 will mean a accuracy loss of about 3%. Not noticeable but more chance for mistakes.
* FP2 ... what is what most people are able to run at home, for a "reasonable" price. Your looking at over 17% loss in accuracy.
At that point, your running at less then claude-sonnet-4.6, as the issues compound with accuracy losses. And reasonable priced is still in the ~ $5000 range (192GB + GPU 32GB active/kv cache system).
For that price your using a Codex / Claude Pro subscription for the next 4+ years with better models (by default), let alone with a FP2 GLM 5.2 version. And your looking at < 10 fps. A MacStudio with 512GB will net you 18 a 20fps+ with FP4, but ... i mean, those used to be $10.000.
Unfortunately the local hardware cost is a major issue for running large models like that.
https://unsloth.ai/docs/models/glm-5.2#usage-guide
In a prior thread, someone said it would take $500k in hardware:
NVFP4 at reasonable speeds (~120 tok/s) and concurrency is possible at a $80/90k figure with today's prices, maybe even less. That buys you 6 RTX 6000 PRO Blackwells, a decent CPU and motherboard, power supply. 576gb of VRAM.
You could do it for under $50k if you're OK with 40 tok/s decode, ~1200 tok/s prefill.
Local inference needs really hard a 1.2 / 1.5 T/s memory bandwidth system with 512GB and 2/3 times the GPU compute of Mac Studio M3 Ultra, at an affordable 10/15k price point. A variant with 1TB memory would also be welcomed at 20k price point.
A model like GLM-5.2 being available as GGUF, usable through llama.cpp/Ollama/vLLM/SGLang/LM Studio, and wrapped for local agent workflows changes the category. It stops being an impressive open model exists and starts becoming this is something a small team can actually put into its development stack.
For instance, company buys an RX6000 setup for say $15k total. They could use this for handling data heavy sifting that would otherwise be a lot of Claude tokens.
It doesn't need to be as good as frontier-best. Just good enough.
I could see a business of people packaging this and handing it to companies who want Help Desk bots without any extra setup.
- new AI desktops with GB10s. They are relatively cheap and you can cluster them and load 1TB of VRAM
- Nvidia, amd, intel, Cerebras etc pushing new hardware
- oss models getting crazy good, like glm 5.2
- flash models getting very good like deepseek V4 flash
- quantizations
- harnesses being able to use different models (big for difficult stuff, small for grunt work)
So hopefully soon for the ones who want to break free from APIs, we will be able to host at home a cluster of AI desktops at a reasonable price with Opus-level capabilities, can't wait!!
Kinda shows they have a headstart rather than a magic moat
There should be more native 4bit, 1.25bit and likewise models. Those actually work great while making them smaller in comparison. But I guess there is some reason for them being pretty niche.
Offloading too many layers to CPU is not going to work at all. I have tried this many times and had to rm -rf on those heavy hf cache folders. Also I doubt 1-bit or 2-bit quants of GLM 5.2, running mostly outside of VRAM can beat Q8_0 of Qwen3.6-27B fully loaded in VRAM - on usefulness.
But I don't know how usable GLM 5.2 is vs the Big 2.
Do the runes make it smarter or just run faster (or both)?
...a bit of an odd question: how well do LLMs losslessly compress, as in for cold storage?
I definitely don't have the hardware to run this model at any kind of reasonable speed (and I don't want to use a super aggressive quantization that would kill performance). Even so, I think it would be cool to retain an offline copy, in case... I don't really know, a solar flare destroys the internet some day, or maybe a zombie apocalypse. It would just be cool to have.
But 1.5 TB is a bit too much! If it could be compressed down into something semi kind of reasonable, that would be fun!
But I do like Unsloth Studio, quite a lot. It's nicely designed.
Only a select few have the hardware required to run this to begin with, and even then the forecasted performance makes me wonder if it’s worth it at all.
And yet Apple won't sell them to you anymore. And I'm not too confident it will be even possible to hand then 10k to get one again.