undefined | Better HN

0 pointszozbot2349d ago0 comments

Maybe we shouldn't be running these models on laptops with their thermally constrained form factor, and we shouldn't expect quick inference on a par with a large cloud-based platform either, at least not for near-SOTA model quality. It's still worth it to avoid becoming massively reliant on centralized services.

0 comments

20 comments · 3 top-level

greenavocado9d ago· 16 in thread

I have a 5070 12 GB laptop GPU and can hit 72 tokens per second in the first couple thousand tokens before dropping to mid-high 50s after about 15k context.

This setup is extremely optimized down to the last flag. Changing any param above the temp flag craters performance.

I don't have enough system RAM to properly handle the large context windows so I don't use local models.

  # 1,257 tokens 17s 72.18 t/s

  $env:CUDA_DEVICE_SCHEDULE = "SPIN"
  cd D:\src\llama.cpp\
  .\build\bin\Release\llama-server.exe `
    --port 8080 `
    --host 127.0.0.1 `
    -m "D:\LLM\Qwen3.6-35B-A3B-MTP-UD-Q4_K_XL.gguf" `
    -fitt 2048 `
    -c 98304 `
    -n 32768 `
    -fa on `
    -np 1 `
    --kv-unified `
    -ctk q8_0 `
    -ctv q8_0 `
    -ctkd q8_0 `
    -ctvd q8_0 `
    -ctxcp 64 `
    --mlock `
    --no-warmup `
    --spec-type draft-mtp `
    --spec-draft-n-max 2 `
    --spec-draft-p-min 0.1 `
    --chat-template-kwargs '{\"preserve_thinking\": true}' `
    --temp 0.6 `
    --top-p 0.95 `
    --top-k 20 `
    --min-p 0.0 `
    --presence-penalty 0.0 `
    --repeat-penalty 1.0

themanualstates9d ago

That’s useless without describing WHY you chose those flags, and how you did the optimisation…

halJordan9d ago

The switches are all in the -h of llama.cpp (although the maintainers have a tendency to use the word in its definition). The actual values are essentially just what alibaba recommends. So you just need their model card. I would not call it highly optimized, more appropriately tuned.

greenavocado9d ago

I found every possible flag and its description including CUDA related environment variables and went back and iterated with Claude Opus 4.8 High until every single flag mattered above the temp one.

nateb20229d ago

I get over 100 tok/s sustained on my M4 Max and M5 Max, in MacBook Pro's. LM Studio + MLX.

boguscoder8d ago

Same experience on M4 Max .. but quality of qwen still leaves so much to be desired after getting used to virtually unlimited tokens at work. Many people on this (and similar) thread seem to believe local models would inevitably improve, and I want to believe this too, but I don’t see this ever happening without growing in size

Terretta9d ago

With Qwen3.6-35B-A3B-MTP-UD-Q4_K_XL.gguf?

Also, funny lumping the M4 "and" the M5, I find them 15% to 45% different performance, depending.

And for a good deal of work, an M3 Studio Ultra outpaces the M4 and ties the M5 on single work at a time, outpaces both doing multiple work at a time.

nateb20228d ago

> With Qwen3.6-35B-A3B-MTP-UD-Q4_K_XL.gguf?

Ahh no I'm using the MLX version, it's about 5-10% faster than GGUFs in my experience.

mattmanser9d ago

That's a quant 4 which the thread OP specifically called out as rubbish.

The Q4_K_XL bit for those not in the know.

greenavocado9d ago

I typically find myself using a context of between 150-500k with GPT models so local models are simply not enough and I stopped using them.

stymaar9d ago

That's way higher than their optimal ceiling (and absolutely suboptimal from a token cost point of view), why are you doing that?

1 more reply

c0rruptbytes9d ago

large contexts degrade the performance - attention doesn't work will for large windows like that and cloud models are kind of hacking it

local models do involve some context engineering to get it okay, but it's not that rough

stymaar9d ago

Anyone calling Qwen3.6-35B-A3B-Q4_K_XL “rubish” has no idea what they are talking about.

embedding-shape9d ago

I'd agree that the quality degrades a lot between Q8 and Q4, borderline unusable as they start to fail with tool calling syntax even. Personally I'd say Q8 is as low as you want to go.

greenavocado9d ago

He's probably calling me out for this comment https://news.ycombinator.com/item?id=48557579

c0rruptbytes9d ago

q4 isn't rubbish, but it's a compromise for a good value, q6 is essentially a no-compromise quantization and it's what i recommend for MoEs in my experience for agentic workflows

ridiculous_leke9d ago

Can you comment on the quality and accuracy of it? People have managed to run Gemma 26b without GPU on old CPUs but I don't think quality is anywhere close to what Gemma 12b offers.

stemlord8d ago· 1 in thread

> It's still worth it to avoid becoming massively reliant on centralized services.

This isn't really good enough. Many of us need to get things done in a pinch and if our employers are already getting used to the idea of paying for enterprise subscriptions to cloud llm's then the local option needs to be good

wolvoleo8d ago

For me I use only cloud for work. But I'd never trust any of my personal data to it.

Shorel7d ago

I have three laptops and a desktop. The desktop has 128GB of RAM.

I bought the memory at the end of the last year, and I was thinking, maybe this is excessive. No game will use that much memory, in a decade or more.

Now I realize it was one of the best purchases ever, I run qwen3-coder-next on it for just the cost of the electricity, while the coding and agents and whatever else is done in a laptop. Yeah, it is slower, I don't care. Infinite tokens is better than a few.

The cloud is another computer, but in this case it is mine =)

j / k navigate · click thread line to collapse

0 comments

20 comments · 3 top-level

greenavocado9d ago· 16 in thread

I have a 5070 12 GB laptop GPU and can hit 72 tokens per second in the first couple thousand tokens before dropping to mid-high 50s after about 15k context.

This setup is extremely optimized down to the last flag. Changing any param above the temp flag craters performance.

I don't have enough system RAM to properly handle the large context windows so I don't use local models.

  # 1,257 tokens 17s 72.18 t/s

  $env:CUDA_DEVICE_SCHEDULE = "SPIN"
  cd D:\src\llama.cpp\
  .\build\bin\Release\llama-server.exe `
    --port 8080 `
    --host 127.0.0.1 `
    -m "D:\LLM\Qwen3.6-35B-A3B-MTP-UD-Q4_K_XL.gguf" `
    -fitt 2048 `
    -c 98304 `
    -n 32768 `
    -fa on `
    -np 1 `
    --kv-unified `
    -ctk q8_0 `
    -ctv q8_0 `
    -ctkd q8_0 `
    -ctvd q8_0 `
    -ctxcp 64 `
    --mlock `
    --no-warmup `
    --spec-type draft-mtp `
    --spec-draft-n-max 2 `
    --spec-draft-p-min 0.1 `
    --chat-template-kwargs '{\"preserve_thinking\": true}' `
    --temp 0.6 `
    --top-p 0.95 `
    --top-k 20 `
    --min-p 0.0 `
    --presence-penalty 0.0 `
    --repeat-penalty 1.0

themanualstates9d ago

That’s useless without describing WHY you chose those flags, and how you did the optimisation…

halJordan9d ago

greenavocado9d ago

I found every possible flag and its description including CUDA related environment variables and went back and iterated with Claude Opus 4.8 High until every single flag mattered above the temp one.

nateb20229d ago

I get over 100 tok/s sustained on my M4 Max and M5 Max, in MacBook Pro's. LM Studio + MLX.

boguscoder8d ago

Terretta9d ago

With Qwen3.6-35B-A3B-MTP-UD-Q4_K_XL.gguf?

Also, funny lumping the M4 "and" the M5, I find them 15% to 45% different performance, depending.

And for a good deal of work, an M3 Studio Ultra outpaces the M4 and ties the M5 on single work at a time, outpaces both doing multiple work at a time.

nateb20228d ago

> With Qwen3.6-35B-A3B-MTP-UD-Q4_K_XL.gguf?

Ahh no I'm using the MLX version, it's about 5-10% faster than GGUFs in my experience.

mattmanser9d ago

That's a quant 4 which the thread OP specifically called out as rubbish.

The Q4_K_XL bit for those not in the know.

greenavocado9d ago

I typically find myself using a context of between 150-500k with GPT models so local models are simply not enough and I stopped using them.

stymaar9d ago

That's way higher than their optimal ceiling (and absolutely suboptimal from a token cost point of view), why are you doing that?

1 more reply

c0rruptbytes9d ago

large contexts degrade the performance - attention doesn't work will for large windows like that and cloud models are kind of hacking it

local models do involve some context engineering to get it okay, but it's not that rough

stymaar9d ago

Anyone calling Qwen3.6-35B-A3B-Q4_K_XL “rubish” has no idea what they are talking about.

embedding-shape9d ago

I'd agree that the quality degrades a lot between Q8 and Q4, borderline unusable as they start to fail with tool calling syntax even. Personally I'd say Q8 is as low as you want to go.

greenavocado9d ago

He's probably calling me out for this comment https://news.ycombinator.com/item?id=48557579

c0rruptbytes9d ago

q4 isn't rubbish, but it's a compromise for a good value, q6 is essentially a no-compromise quantization and it's what i recommend for MoEs in my experience for agentic workflows

ridiculous_leke9d ago

Can you comment on the quality and accuracy of it? People have managed to run Gemma 26b without GPU on old CPUs but I don't think quality is anywhere close to what Gemma 12b offers.

stemlord8d ago· 1 in thread

> It's still worth it to avoid becoming massively reliant on centralized services.

wolvoleo8d ago

For me I use only cloud for work. But I'd never trust any of my personal data to it.

Shorel7d ago

I have three laptops and a desktop. The desktop has 128GB of RAM.

I bought the memory at the end of the last year, and I was thinking, maybe this is excessive. No game will use that much memory, in a decade or more.

The cloud is another computer, but in this case it is mine =)

j / k navigate · click thread line to collapse