undefined | Better HN

0 pointsggerganov9d ago0 comments

I haven't spent a dime on cloud inference, so cannot make a direct comparison like you. But I can 100% attest to the fact that Qwen3.6-27B is a very capable local model for coding tasks. Over the last month and a half I've been using it almost daily, either on my M2 Ultra or on my RTX 5090 box. I use it for small mundane tasks at ggml-org [0] - nothing really impressive, but definitely a helpful tool for a maintainer. I think I would be using it much more, if I didn't have to spend a lot of my time on reviewing PRs. Currently, I have a very lightweight harness - the pi agent with everything stripped (`pi -nc --offline`) and a short system prompt [1] to align it a bit with my style. About the generation speed: ~100-150 t/s on the RTX 5090 and ~40 t/s on the Mac. I definitely prefer running it on the RTX machine - it's so much faster. But for the sake of testing and getting wider experience with local configurations, I often run it on the Mac too.

[0] - https://github.com/search?q=%22Assisted-by%22+user%3Aggml-or...

[1] - https://github.com/ggml-org/llama.cpp/blob/master/.pi/gg/SYS...

0 comments

6 comments · 6 top-level

trilogic9d ago

I also confirm that local inference is on par with proprietary cloud services (with a bit of local setup, simple agents.md and some utils skills). This local models come with tools, that's mind blowing, considering that some months ago we had to .md tools ourselves. What makes a model worth even more is "Memory". We implemented that long ago. Last time I used proprietary services was 3 months ago, don´t really need it, my subscription is going blank.

Gerganov, hope you will consider developing further the CLI cause we suffering with the server.

1 more reply

kpw949d ago

> About the generation speed: ~100-150 t/s on the RTX 5090 and ~40 t/s on the Mac

Curious if you can share the prefill speed too?

I run locally on a crappy desktop (some AMD iGPU with Vulkan llama.cpp, 32 GB DDR4 RAM) for experimentation. I get 15 tok/s on generation for the qwen & gemma4 MoE models. I get around 150 tok/s prefill speed.

Reason I'm asking about the prefill is looking at my stats at work, I use between 20M to peaks of 300M input tokens daily. Some of those token are cached but in general, I seem to have roughly 500x more input tokens than output. So interested in prefill tok/s stats.

Huge Thank you for llama.cpp btw!!

1 more reply

girvo9d ago

> Currently, I have a very lightweight harness - the pi agent with everything stripped (`pi -nc --offline`) and a short system prompt [1] to align it a bit with my style

This really is the secret to getting the most out of these models IMO. Pi is so damned good. I have a strongly tuned Pi for running Step 3.7 Flash (IQ4_XS) and Qwen 3.6 27B (FP8)

Also, thank you for llama.cpp mate :)

1 more reply

celrod9d ago

What quant do you run it at? 32GB seems like cutting it close on the rtx 5090 if going 8b, but other commenters are saying 4b lobotomizes the model.

1 more reply

toddmorey9d ago

For the curious, it looks like a PC with a RTX 5090 32GB graphics card will run you about $6,000.

fridder9d ago

Not too shabby. I like the regular Qwen but prompt prefill on my m1max is slow as hell

j / k navigate · click thread line to collapse

0 comments

6 comments · 6 top-level

trilogic9d ago

Gerganov, hope you will consider developing further the CLI cause we suffering with the server.

1 more reply

kpw949d ago

> About the generation speed: ~100-150 t/s on the RTX 5090 and ~40 t/s on the Mac

Curious if you can share the prefill speed too?

Huge Thank you for llama.cpp btw!!

1 more reply

girvo9d ago

> Currently, I have a very lightweight harness - the pi agent with everything stripped (`pi -nc --offline`) and a short system prompt [1] to align it a bit with my style

This really is the secret to getting the most out of these models IMO. Pi is so damned good. I have a strongly tuned Pi for running Step 3.7 Flash (IQ4_XS) and Qwen 3.6 27B (FP8)

Also, thank you for llama.cpp mate :)

1 more reply

celrod9d ago

What quant do you run it at? 32GB seems like cutting it close on the rtx 5090 if going 8b, but other commenters are saying 4b lobotomizes the model.

1 more reply

toddmorey9d ago

For the curious, it looks like a PC with a RTX 5090 32GB graphics card will run you about $6,000.

fridder9d ago

Not too shabby. I like the regular Qwen but prompt prefill on my m1max is slow as hell

j / k navigate · click thread line to collapse