This setup is extremely optimized down to the last flag. Changing any param above the temp flag craters performance.
I don't have enough system RAM to properly handle the large context windows so I don't use local models.
# 1,257 tokens 17s 72.18 t/s
$env:CUDA_DEVICE_SCHEDULE = "SPIN"
cd D:\src\llama.cpp\
.\build\bin\Release\llama-server.exe `
--port 8080 `
--host 127.0.0.1 `
-m "D:\LLM\Qwen3.6-35B-A3B-MTP-UD-Q4_K_XL.gguf" `
-fitt 2048 `
-c 98304 `
-n 32768 `
-fa on `
-np 1 `
--kv-unified `
-ctk q8_0 `
-ctv q8_0 `
-ctkd q8_0 `
-ctvd q8_0 `
-ctxcp 64 `
--mlock `
--no-warmup `
--spec-type draft-mtp `
--spec-draft-n-max 2 `
--spec-draft-p-min 0.1 `
--chat-template-kwargs '{\"preserve_thinking\": true}' `
--temp 0.6 `
--top-p 0.95 `
--top-k 20 `
--min-p 0.0 `
--presence-penalty 0.0 `
--repeat-penalty 1.0Also, funny lumping the M4 "and" the M5, I find them 15% to 45% different performance, depending.
And for a good deal of work, an M3 Studio Ultra outpaces the M4 and ties the M5 on single work at a time, outpaces both doing multiple work at a time.
Ahh no I'm using the MLX version, it's about 5-10% faster than GGUFs in my experience.
The Q4_K_XL bit for those not in the know.
local models do involve some context engineering to get it okay, but it's not that rough
This isn't really good enough. Many of us need to get things done in a pinch and if our employers are already getting used to the idea of paying for enterprise subscriptions to cloud llm's then the local option needs to be good
I bought the memory at the end of the last year, and I was thinking, maybe this is excessive. No game will use that much memory, in a decade or more.
Now I realize it was one of the best purchases ever, I run qwen3-coder-next on it for just the cost of the electricity, while the coding and agents and whatever else is done in a laptop. Yeah, it is slower, I don't care. Infinite tokens is better than a few.
The cloud is another computer, but in this case it is mine =)