If you throw the right GPUs at the problem, they become much better - but still not quite in the realm of Sonnet or DeepSeek 4 Flash, let alone Opus / DeepSeek Pro or Mythos/Fable/GPT-5.5.
Given enough budget, power, and cooling, you can run some pretty good data pipelines, but for code, I think it still makes sense to shell out to an API provider most of the time.
> but still not quite in the realm of Sonnet or DeepSeek 4 Flash
these are not mutually exclusive anymore. DS4 has set the bar for me these days. https://github.com/antirez/ds4
me thinks there's a lot of optimization strats we're currently leaving on the table just because the amount of things to explore and test are so expansive. but this one is super interesting targeting metal primarily and zeroing in on one model. instead of a one size fits all llama.cpp im very interested to see if theres a future where super tailor-made variants per model pans out to harnesses that can rapidly switch ultimately providing something akin to sonnet/early opus territory (that's my personal bench mark of good-enough i shall now cancel the hell out of this claude sub)
Even faster with the MLX builds.
Then when I need more heavy lifting I fire up a larger model.
IMHO the issue isn't the models. I've had OpenClaw give the same results as Claude using open models locally. Slower but does the job. Something that can do optimal model switching is what's needed.
You can do coding and agentic fine. For coding I use qwen3.6:35b-mlx and agentic granite4.1:3b works fine.
These are the models I use.
- granite4.1:3b
- granite4.1:30b
- gpt-oss:20b
- gpt-oss:120b (less so now)
- mistral-small3.2
- qwen3.6:35b-mlx
There will always be use cases that don't sit on your laptop, but most of what can be done can be done locally, it just requires a good framework to sit on it.
You're right that prefill kills perf, but shrug the GB10 has far more compute than it has memory bandwidth, so prefill isn't it's bottleneck.
I’m getting 40tk/s decode with 1000+tk/prefill with a 198B-A11B model on mine
The main reasons to use local models are:
1. Self-sovereignty & control
2. Data security
3. Offline availability
If none of those apply to you, then you should just use OpenAI or Anthropic.
llama-server.exe --host 0.0.0.0 --alias "Qwen3.6-27B-MTP" -m "F:\Qwen3.6-27B-UD-Q4_K_XL-MTP.gguf" -c 75000 -ngl 99 --metrics --temp 0.6 --top-p 0.95 --min-p 0.00 --top-k 20 --presence-penalty 0.0 --no-mmap -t 16 --spec-type draft-mtp --spec-draft-n-max 3 --reasoning on -fa on --parallel 1 -lv 4
Note that this does not use kv cache quants as in my case quants offload to CPU and tanks performance. Also keep in mind this almost maxes VRAM usage so any additional browsers or other programs that use VRAM should be closed. For chat go to http://localhost:8080/ and minimize the window to maximize perf as the web page UI draw itself consumes a lot of GPU perf via constant context switching.
Can try bigger than -c 75000 until perf gets lower than 100 tok/s - that means something is off as windows starts paging out memory or other issues. -c 50000 seems sweetspot if running browsers and stuff that consume 2GB VRAM. If wanting more than -c 140000 then likely need to use a bit smaller model quant.
CPU usage should be near zero, maybe 1 core load. If you see 8+ core load then settings are off and something is offloaded to CPU (for example kv cache). GPU load should be about 100%, meaning it utilizes work optimally in this case.
-t 16 can be omitted or set to the amount of physical cores, not important in this dense model that is 100% in GPU.
Can be pushed to 125 tok/s with that model if using --spec-draft-n-max 4 but VRAM usage also increases, so context needs to be smaller.
If speed is not important and want max context length then remove the draft-mtp parameters and also might need to use k and v quants like --cache-type-v q8_0, leave k f16 if possible to keep quality.
Prices will fall in the next few years. Maybe just play with the tiny toy models for now to learn how they work, then keep using API providers until they do.
Ultimately if you skip over the opportunity to play with these models on your own machine you are losing out on a lot of really interesting educational opportunities — it helps make a lot of stuff feel more concrete in a way that only tinkering can.
But then I think once I had an idea of something that I was building against Gemma 4 or Qwen 3.6 I would be looking at openrouter etc., to stabilise it for the next tier of experimentation (and to get back a kind of multi-device access without tailscale/lm link etc.).
Are they good enough to replace what people seem to want to do with Claude? Maybe not. But it's an unparalleled learning opportunity.
It's worse at general tasks, but in the precise domain of coding I actually prefer to use it over my claude subscription because it has 0 latency (and no privacy concerns whatsoever).