undefined | Better HN

0 pointsbensyverson6h ago0 comments

The article is based on running Qwen 3.6 on a 128GB MacBook Pro. For reference, a 128GB MBP currently starts at $6699 USD [0]

Some people will be happy to pay that premium for privacy, but at roughly 10X the cost of a MacBook Neo, that money could also buy a lot of credits on OpenRouter or frontier labs.

[0]: https://www.apple.com/shop/buy-mac/macbook-pro/14-inch-space...

0 comments

99 comments · 25 top-level

dofm6h ago· 26 in thread

The maths there is pretty undeniable, but it is not where I'd make the split. Having a machine that can run some modest local LLMs, like the Gemma 4 12B, is really worth it.

I don't know how much serious hands-free agentic coding I will ever do on my MacBook alone, but I do know that I would not have got so far into understanding this without tinkering with local models, llama.cpp, LM Studio, and LM Studio and all that.

I totally struggled to find the right frame of mind to explore any of this stuff without feeling defeated and bamboozled. Because it's just huge, exhausting, jargon-drenched, unknowable, and I am over the hill at fifty-plus.

Until, that is, I could poke around with setting it up on my own (secondhand) machine, watching the API calls, understanding some of the terminology. I didn't even buy the machine for that; it's just adequate to the task.

The Neo is too small to really get much benefit from this opportunity to make it more visceral and knowable.

pizza2345h ago

> Having a machine that can run some modest local LLMs, like the Gemma 4 12B, is really worth it.

Cloud models are (much) faster, they don't consume so much power/generate heat, they have much bigger (LLM) context, they're much more precise and they have a much wider (engineering) context of the given problem.

Except privacy and use cases that are blocked by cloud models (e.g. reverse engineering), local LLMs are currently an expensive toy.

When I try to program with a local LLM (I'm on a 32/128 GB system), I end up wasting time compared to a cloud LLM.

dofm5h ago

Again, I would not argue against any of this.

And I can't say that I won't switch to openrouter (even just for the same models) at some point.

But one of the things I have found about my own process learning is that some lessons only come to you when you make yourself available to them. And if that means doing things the difficult way, that is what you should do.

1 more reply

sanderjd4h ago

> currently

The interesting question is whether that gap will narrow, and if so, how much, and on what timescale.

The exact answer to this question is not knowable, but if you are the kind of person who comes to a site called "hacker news", and you think there is a nonzero chance that the answer is that yes, the gap will narrow and this won't always be an expensive toy, then now seems like a pretty great time to get in the game and start exploring the capabilities.

bogeholm4h ago

> Cloud models […] don't consume so much power/generate heat

I do realize the cloud is just someone else’s computer right? Power goes in, tokens and heat come out - just in another place

1 more reply

AlpacaJones5h ago

The key word there is 'currently'.

1 more reply

psychoslave5h ago

Anything done local will likely come at higher cost and at scale with less energy efficiency and commodity, with less possibility to fine tune engineer deeply on wider horizon of issues.

That's never the point of keeping local alternatives though.

1 more reply

VerifiedReports4h ago

Exactly. The distinction between the various layers in "AI" systems is pretty vague to the newcomer. What is the "model" vs. the engine "running" it vs. weights?

I don't recall any previous tech stack that was barfed onto the scene with so little background or reference material, going from zero to endless undefined jargon... and no primer in sight.

For people who demand an understanding of their tools, it's a lot of work. I recognize the value of "AI" in performing the tasks I'd have to do manually; for example, keeping the data structures of my front- and back-ends in sync in a project. But do I want to interrupt my development and take weeks off to digest all of these tools?

And if I do, I want to run the show and fully understand it. And like you, I think that's best done locally.

Fr0styMatt884h ago

The most unexpected thing for me was kind of philosophical in a ‘holy shit’ way.

Cloud models still feel ‘magic’, like you send a request off and get something back, like it’s something ‘special’. I used to joke that ChatGPT might be some kind of mechanical turk underneath.

Watching a model run local on your own machine hits different — you realise that yes, it IS just a computer program. Which for me actually makes me appreciate the leap we’ve made MORE, not less. From an information-theoretic point of view, LLMs really are something special.

The fact that they are just programs, that I’ve now experienced first-hand that they’re just programs, makes all those questions around consciousness and intelligence much more interesting.

2 more replies

ricardobayes4h ago

For the most part you can just download LM Studio and go from there. It provides a chat interface and an easy-to-use interface to browse, load and use LLM models. The engine: it is abstracted away by LM Studio, if you want to dig deep it's llama.cpp as the runtime. Weights are the files what you download, they are the models for practical purposes.

1 more reply

codazoda5h ago

I agree with the learning aspect, but I have another motivation. I suspect that closed models might become too expensive to run for personal hobbyist use. I’ve been planning to buy a 64GB machine just to allow the limited local models this enables.

ricardobayes4h ago

I'd say give it some time for the dust to settle. This field badly needs standardized benchmarks even before the conversation around model goodness can start.

not_kurt_godel4h ago

> Having a machine that can run some modest local LLMs, like the Gemma 4 12B, is really worth it.

Agree having a powerful machine is really worth it in general for professionals, but strong disagree that running local LLMs has anything to do with it. It's hard enough as it is getting a good ROI on your time/money prompting/wrangling with frontier models. IMO leaning on the comparatively limited capabilities of local LLMs is best avoided in favor of keeping your own personal coding skills fresh and continuing to learn new ones.

dofm4h ago

I'm not that bothered about my coding skills, which are fine, and pretty up-to-date considering I'm now an old bloke. I am bothered about building an instinctive understanding that helps me deal with my anxieties and decide whether I want to carry on with this working life or quit.

I needed to do this, this way, in my own time, to put my brain back together. It has worked for me, which is why I recommend it.

YMMV.

1 more reply

sanderjd3h ago

Continuing to learn new ones, like what?

To me, "how do contemporary AI systems work and interact with contemporary hardware and how can I best take advantage of their capabilities?" is the set of skills that are worth learning at this moment.

What else is there? New / additional programming languages? New / additional database systems? frameworks? orchestrators? cloud provider / infra tooling? architectural patterns?

I dunno, all of this seems really boring and "been there done that" to me at this moment in time!

1 more reply

rusk6h ago

> I totally struggled to find the right frame of mind to explore any of this stuff without feeling defeated and bamboozled.

I found LM studio to be a nice starting point. Frindlier and more featureful than Ollama and not as intimidating as llama.cpp (though you will want to use that eventually)

dofm5h ago

LM Studio is also nice because of the way the interface explains things; parameters have explanations and hints. It has been designed by people who really care about making it understandable.

I tried Ollama but I've settled on Unsloth Studio generally; once things really settle down I'll just run the llama-server UI, which is pretty nice.

A friend is tinkering with LLMs for amusement on a 16GB Raspberry Pi 5, and when I explained that llama.cpp now had a typical web chat interface he was so happy — it's amazing what the "table stakes" are now.

oceanplexian4h ago

Honestly your best bet is to buy a $20 Claude subscription, ask Claude to set it all up with Pi and llama.cpp and come back in 20 minutes after a cup of coffee. This is also a good idea because it will help set expectations of what a local model can do vs. a frontier model.

mullen4h ago

This is what I did after struggling to get llama.cpp working at a decent speed on my M1 Macbook. The secret is to very specific with your needs and targeted in what you are using llama.cpp for. Mine setup is just about strictly for qwen3-coder and now, I get a fairly decent speed out of it. I also installed Cursor to check Claude and it all worked out well.

cyanydeez6h ago

I've setup to local paradigms for local coding:

- opencode with it's webui

- deer-flow with it's research/powered front end

They both run websites so you don't have to baby sit them (eg, keep your mac open). I've build a pdf compressor over a few days by first having deer flow try and research the frameworks and pipeline. It stalls out because its not really a fluid programmer. Once it stalls out, I transferred it (manually for now) to opencode and it's refactoring it because it's just a collective bundle of sticks and it needs a lot of testing to tweak out the limited scop context. LLMs can't really hold large scopes (locally anyway, from what I've read from HN, it's possible with longer context).

It'll complete in a few days with maybe 3-4 hours of full attention interaction, but it's running 3x that without my attention. Obviously, if I paid more attention it'd run quicker, but since it's local, it's not pumping out large volumes of code, it's mostly looping over tests and capabilities as observed.

It's running Qwen3.6 35B MoE on a AMD 128GB strix halo. If I switched to the dense models, perhaps it'd be smarter, but the trade off seems to be much slower gen.

dofm5h ago

> - opencode with it's webui

Have you tried Paseo?

I have opencode in a VM, and the paseo daemon running in the VM, and then the Paseo Mac app. Really nice.

(You can also use the Opencode GUI to frame a remote opencode web interface)

1 more reply

bsder2h ago

> I totally struggled to find the right frame of mind to explore any of this stuff without feeling defeated and bamboozled. Because it's just huge, exhausting, jargon-drenched, unknowable, and I am over the hill at fifty-plus.

Hello, my brother, just know that you have a fellow passenger in life at the same age who thinks the same thing. I agree that the local stuff is helping my understanding a LOT.

However, my gut feel as someone who got to experience the TeleBomb after the DotBomb is that the obfuscation is INTENTIONAL--it's neither you nor your age. I remember asking people to explain to me what the OC-768 startup endgame was when roughly 10 OC-768 links could carry the world's traffic at the time--and everybody giving me blank looks. The AI Bubble has the EXACT same feel as the Telecom Bubble--just bigger.

What I really wish is that I could find a VPS-type provider where I could toss things into their NVIDIA/AMD machines for an hour or two. Alas, all of the providers seem to want massive paperwork and huge minimum purchases.

I can't wait for the bubble to pop so that we mere mortals can finally build with this stuff.

ddalex6h ago

I just got Claude to download and install all the models and servers and agents and prepare all the launch scripts for me... no need to learn, just ask it to do it for you

dofm5h ago

Right, but I am a middle-aged bloke who is experiencing existential angst about whether I can carry on in this industry.

I have a pretty deep, maybe paranoid need to be confident I have an intrinsic understanding, and I have found in my life that lessons come to you when you make yourself open to learning.

So I need to build on top of what I know, taking as much of the hard way as I can bear to take at any one time — it has to be not quite difficult enough to put me off.

I can't really explain what I have learned this way that is different, but I feel it in a way that I wouldn't if I'd simply pushed a button.

For the same reason, I have a really basic 3D printer that I've set up myself, set up Klipper, configured how I want it, learned how to calibrate, all that. And now I can say that I feel I have an understanding of 3D printing. I could hold my head above water in a discussion with a real expert, maybe find work in an adjacent field where my insights would keep me grounded.

I can afford a really good printer that has all that set up, and more, has no problems. But I'd just be someone who has a 3D printer.

(Also who am I kidding about the existence of a printer with no problems)

2 more replies

swiftcoder5h ago

I don't necessarily think your answer is wrong for all people, but if you work in software... how do you plan to differentiate yourself from everyone else out there, if the depth of your understanding is "Claude can do it for me"?

1 more reply

coldtea5h ago

>no need to learn, just ask it to do it for you

And that's how skills die.

2 more replies

sorokod5h ago

Then what is the point of ddalex?

1 more reply

Catloafdev6h ago· 11 in thread

The model they reference can be easily run with 24gb+ of VRAM, and there are other similar models capable of running easily on 16gb of VRAM. It's not like 128gb is a requirement here.

bitexploder5h ago

For a MBP I have 48 GB of RAM M5 Pro. It runs at about 12-14 t/s at Q4, you could probably optimize it further. RAM is not a limitation but overall memory bandwidth. Q8 is slower. 35B A3B Qwen is quite speedy, but a little less accurate. With Qwen 3.6 27B dense I can squeeze a 9B parameter model and use that for fast analysis or code scanning while 27B is churning on a task in the background. It is tight, but totally reasonable.

The real sweet spot for Qwen 27B is getting it on something like a Dual 3090 system or some other config where it can blaze at 50-80 t/s and that costs well under 6K currently. It is a surprisingly capable model. Using something like GLM for orchestration, specs, task farming and then letting Qwen churn is relatively inexpensive.

Overall I recommend people try models of this class out using OpenCode and some for pay service to experiment with them and understand how they work. I find they are very useful.

Long term, I am convinced enough that if I wanted to use local models for any number of reasons I would be okay investing in a dual GPU box. The Mac is not fast enough for me and M5 Max is just too expensive relative to GPU linux box. Still, it is nice to have the models local ON the laptop and it is useful for what I care about locally.

coder5431h ago

> For a MBP I have 48 GB of RAM M5 Pro. It runs at about 12-14 t/s at Q4

Are you running with MTP enabled? I have seen some people on M5 hardware report 20+ t/s on Qwen3.6-27B using MTP... and I think that was a regular M5, not even M5 Pro.

1 more reply

aunty_helen4h ago

I was doing some benchmarking last night on 2 3090s. The systems but old but I’m seeing 11tks 27b, 15tks 35b MoE.

The limited context is problematic. I’m not exactly sure what it’s got available but hermes was hit and miss on a prospecting job.

It does seem to be doing useful work but it’s not API call level quality

1 more reply

CMay4h ago

At 24GB, Gemma 4 31B QAT will be better and give more concise answers. This post is mostly about unquantized results, so it's less relevant and I can't say much about as I haven't tested Qwen or Gemma via cloud API or unquantized locally. All I can say is locally, quantized in a 24GB scenario, Gemma 4 31B is better in my tests which are mostly reasoning or C programming related.

Gemma 4 is the only model series at this parameter scale I've seen correctly answer some of these. One of the answers even made me re-evaluate what I thought the correct answer was, which I did not expect.

When I look at the Artificial Analysis numbers, I can see that some things about Qwen 3.6 look inflated as a result of either metrics that weren't measured yet for Gemma 4 31B, or for metrics that just aren't going to be relevant in a lot of the essential tasks. In a lot of the relevant metrics, Gemma 4 is either better or on par.

Then once it's all quantized all those benchmark results will be hurt, and Gemma 4 QAT has better quantized performance. I think it's more competitive unquantized than people give it credit for and way better quantized than people give it credit for.

Qwen 3.6 clearly isn't legitimately bad and maybe it's quite nice at fp16, but it was a disaster quantized in a 24GB scenario by comparison.

thewebguyd6h ago

I'd go for at least 32GB+. It'll fit in 24GB but leaves you little to no room for context, and that's at 4-bit quantization.

If you want to run unquantized, you definitely need 128GB.

Catloafdev6h ago

Nobody runs unquantized, there's literally no reason to. Q8 would be the largest anyone actually runs on consumer hardware for inference.

2 more replies

bitexploder5h ago

It also comes down to inference speed, not "can I run this". 8-bit quant is quite a bit slower on an M5 Pro.

Numerlor5h ago

And if you go for actual GPUs it'll run much faster, I'd say 24gb may be pushing it for context, but my 5090 with 32GB VRAM is usually somewhere between 60 to 100 tok/s with mtp and 2-3k tok/s for prompt processing. I'm not sure what they cost now but it's definitely still quite far from the macbook, and there's also some other 32GB GPUs that are considerably more affordable

nok22kon5h ago

a computer with 24 GB VRAM is at least $3000

sleepyeldrazi5h ago

I can't speak for the US, but in Germany (where hardware is usually more expensive, not less), I got my 3090 3 months ago for 750 euro and have been running the iq4_nl 27B using q4 kv (which after recent patches in llama.cpp is in my xp indistinguishably accurate from q8 of f16) at full ctx, with MTP at 2, peaking around 70 t/s on small ctx, around 50 t/s when im around 64k and ends around 40 t/s near the cap. The rest of the PC is a 50 euro ddr3 16gb i5 4th gen box, absolutely nothing special. And this setup is often more useful than dsv4pro (and sometimes kimi, but not glm) for research and ML work.

1 more reply

daemonologist4h ago

A 7900 XTX is about $850, and the rest of the computer basically just needs to boot Linux. You could easily build such a machine for $1500.

Even that isn't strictly necessary - you can get perfectly acceptable performance by splitting a model between multiple older 12 or 16 GB cards.

Insanity6h ago· 8 in thread

But you have to factor in that this device will last you 5-10 years. That said, I wouldn't spend almost $7k USD on this macbook lol.

petilon6h ago

Memory requirements of newer models will increase, so while the hardware may last 10 years it won't be able to run the latest models for 10 years.

roadside_picnic6h ago

My experience working in the open model space pretty deeply (both LLMs and diffusion models) for years now is that it is not quite as simple as that.

In the open model space an insane amount of effort goes into getting more powerful models to run with the same or less RAM. For example in the diffusion world many things that could not be run on easily under 24GB of VRAM actually run much better today with much less VRAM than they did a few years ago. You can do many things today with 8-16GB of VRAM that would not have been possible. At the same time the most advanced open models, like LTX 2.3 for video gen, still seem to respect 24GB of VRAM as the upper bound.

Similarly the standard "big" but localish open model for LLMs back in the day was Llama 3 70B, this was both a much worse and much larger model than Qwen 3.6 27B

So in two different spaces I've witnessed the "RAM required to run the best" decreasing or at least remaining stable, while the performance being achieved in both areas is astounding (LTX 2.3 is faster, better and more capable than the Wan 2.2 model that held popularity before it).

The biggest thing to watch out for is not just RAM/VRAM but memory bandwidth. You can try to "future proof" yourself with lots of RAM, but if it's 400 GB/S you're still constrained to smaller models.

2 more replies

Insanity6h ago

You raise a fair point, but I'm not convinced it'll offer a meaningful difference in performance as long as we're stuck with the current AI paradigm.

bluGill6h ago

Will they? Or will we find ways to optimize models and need less? Only time will tell.

naikrovek1h ago

Available models aren’t really trending upward in size. Not like I thought they would, anyway.

They’re trending to be the right size to be good.

Qwen3.6-35B is not as good as Qwen3.6-27B. The larger model is faster, but a lot dumber; it gets caught in loops, makes crazy mistakes, and is just not as good. It’s bigger, but it is nowhere near as good as the 27B variant.

cyanydeez6h ago

I think you have too much faith in context AGI.

at 128GB, you can find almost it's entire context for Qwen3.6 35B MoE.

Again, I think you have too much faith in extrapolation. It's like you got a baby at 0 months, then measured it at 12 months and expect it to be a giant.

simonw6h ago

It can't run the latest models today - GLM-5.2 class models already need 1TB+ of RAM.

... but, the models that WILL run on 128GB (or 64GB or even 32GB) models today are a huge improvement on the best models that would run in the same amount of memory six months ago.

2 more replies

someperson6h ago

In 5-10 years, incremental cloud tokens will be far cheaper (likely but not guaranteed).

porphyra6h ago· 4 in thread

You can also run Qwen 3.6 27B dense model on DGX Spark with comparable performance [1][2] for about $4000 (Asus Ascent GX10 is $3999 at various retailers).

In theory you can also get 48GB of VRAM with, say, two 3090s, but it will take up a lot of space and generate a lot of heat compared to the Macbook Pro and GB10.

[1] https://x.com/MiaAI_lab/status/2070859135399182444

[2] https://github.com/MiaAI-Lab/Qwen3.6-27B-NVFP4-vLLM

esperent6h ago

> 48GB of VRAM with, say, two 3090s

So like... $2000+ just for the used GPUs? Plus I assume it's considerably more effort to get it working.

fluoridation6h ago

>Plus I assume it's considerably more effort to get it working.

Nah, not really. It is a little annoying in terms of space and power, though. Not every case and motherboard can support cards that big.

lee_ars3h ago

The tweet you link shows "Qwen 3.6 35b NVFP4 - 256k ctx, 110 tok/s", but I'm getting only half that, around 50 tok/sec, on a DGX Spark with Qwen3.6-35B-A3B-NVFP4 (via vLLM) plus speculative decode w/EAGLE3. I'd be ecstatic to see 110 tok/sec and I wish they had some more sourcing for the exact config, because it's double what I'm getting.

edit - after actually reading the tweets (had to use xcancel) and visiting the source git repo, switching to MTP for speculative decode makes things a hell of a lot faster, and the abliterated model plus dflash makes it even faster! I'm now seeing 70-90 tok/sec for most stuff. I like!

porphyrarecent14m ago

I think Atlas might also be slightly faster than vLLM:

https://flowtivity.ai/blog/120-tok-s-1m-context-private-ai-d...

throw12345678915h ago· 4 in thread

But the tokens or credits are gone. MacBook stays. You can run other models on the same MacBook. What I read people burn every month on saas… for that money you break even on that MacBook in 5 months.

Edit: it’s not just “data privacy”, when you are using Claude, you are shipping EVERYTHING to Anthropic. It’s crazy.

wilsonnb35h ago

Companies are already shipping everything to Microsoft or Google and 17 other companies, just the cost of doing business.

throw12345678915h ago

Sure, but no one gets everything. Just that one.

DANmode5h ago

That’s at today-prices.

If the cost doubles, or 4x, which is seems to need to for them to go profitable, what then?

wahnfrieden5h ago

It's much slower, and often quantized

stymaar6h ago· 3 in thread

> The article is based on running Qwen 3.6 on a 128GB MacBook Pro. For reference, a 128GB MBP currently starts at $6699 USD [0]

Qwen3.6-27B would be faster on a 3090 that costs around $1000-1200 though so I don't think it's a good counter-argument.

Op just happened to have that MacBook, but it doesn't mean it's necessary to run the model.

boutell5h ago

That 3090 is going to burn 750W and it will still cap you at a 4 bit quant and ~48K context. Here's someone who worked through it:

https://github.com/noonghunna/qwen36-27b-single-3090

Flies though (50-70tps is impressive for a model this smart)

I went through roughly the same process to get it working on my M2 Macbook Pro... at awful speeds of course, since models like this one are mostly bound by memory bandwidth.

stymaar5h ago

> That 3090 is going to burn 750W

The 3090's TPD is 350W, but given that LLM's token generation isn't compute bound, people usually undervolt these cards to reduce power consumption. IIRC you can get as low as 200-250W without any degradation. Caveat these figures are without speculative decoding and at batch size =1.

1 more reply

hughw4h ago

My eyes glaze over reading all the AI produced verbiage.

I did find a few useful parameter settings I've already discovered using my single 3090 and ollama.

I'm just remarking that the LLMs overwhelm me with minutiae, especially as I'm working on code design. I frequently ask it to restate concisely, and that helps.

[edited to mention ollama as a nice alt]

dannyw6h ago· 3 in thread

I’m running the same model on a 48GB MBP with a q4 quant and it’s pretty decent. You definitely don’t 128GB. That’s the scale for 70B models at q8 or something.

dom966h ago

I've been running it on my 48GB MBP too and it's not particularly great. Super slow and not near enough to the quality provided by even Claude Sonnet.

doodlesdev6h ago

How much does one of those cost in the US? Here in Brazil, your notebook is worth as much as a used Honda Fit, which seems absolutely insane. For comparison, the ThinkPad I'm currently running cost me 1/20 of how much this MBP costs here, leaving me with over $8.000 to spend with LLM inference (if I actually spent money with that).

dannyw6h ago

I purchased mine for approximately $4400 AUD before the price hikes. That unit is now ~$5100 AUD.

I use my MBP essentially as my workstation, it's almost always plugged in. I have a MBA (M4, 24GB RAM) that I picked up for ~A$1500 or so, and that's an amazing daily driver. I don't do local LLM inference on that unit, I can just hit my own APIs (via LM Studio) on the MBP over Tailscale.

organsnyder6h ago· 3 in thread

I run Qwen 3.6 on my Framework Desktop 128GB, and it's very performant. I know Framework has had to raise the price since I preordered mine, but they're still well under half the cost of that Macbook.

andy996h ago

I get ~55 Tok/s on my framework desktop with the 35B A3B q8 model, and so far am also very happy with the coding performance.

cyanydeez6h ago

did you upgrade to MTP?

bityard3h ago

There are several variants of Qwen 3.6, the MoE models are performant on Strix Halo, but the 27B dense model (the one spoken about in TFA, and generally regarded as the best of the group in terms of quality) is not so performant: https://kyuz0.github.io/amd-strix-halo-toolboxes/

h4ny6h ago· 3 in thread

[flagged]

dang3h ago

Yikes, you broke the site guidelines badly with this post. Could you please review https://news.ycombinator.com/newsguidelines.html and stick to them?

You're welcome to make your substantive points thoughtfully, just not aggressively.

kllrnohj6h ago

> maybe tell us how much a non-Apple system that you can run that (probably similarly or faster) would cost?

Ryzen AI Max 395+ with 128GB of unified memory can be found around $3-4k.

But 27B isn't that large, either, especially if you are ok with the quantized models. So this laptop choice seems to more be a "because they had it" rather than "this is what's necessary for this particular workflow"

h4ny6h ago

That's my point. You can run Qwen3.6 27B with MTP and whatever else you want to bolt onto it at 256k context for much less than even a Ryzen AI Max 395+ with 128GB would cost. Even unquantized you don't need 128 GB so given your comment and the downvotes maybe I didn't word my original comment properly for this?

cyanydeez6h ago· 2 in thread

AMD started their 128GB Halo Strix at a pretty damn good point at ~2.5k; I got mine after the first memory bump at $3k.

I think you might be a little to into the stew here.

zdragnar5h ago

I got mine at the same price point, and I've been pretty pleased with it. Tailscale lets me use it from my ultrabook / lightweight laptop, no burning lap or crazy fan noises. Desktops with the amd ai+ 395 are still fairly affordable for what they can do.

I haven't tried it with https://lemonade-server.ai/ yet but I just might give it a shot.

organsnyder5h ago

I'm running Lemonade on Nixos on my Framework Desktop. I had been trying other tools out before finding Lemonade, but Lemonade really made it plug-and-play.

colinsane5h ago· 2 in thread

i like that people are taking the privacy argument seriously, after however many decades. i think there are other arguments to be made for running these locally which are less settled, but IMO the Fable debacle drives it home: the surest way to embrace this technology without worry that it will be taken away from you down the road is to physically own the compute.

r_lee4h ago

if you need to ensure that, then just back up the model and buy hardware if the need arises

colinsane4h ago

that's somewhere between saying "use Android, just switch to Graphene if/when they lock it down", and saying "just switch to postmarketOS/Ubuntu Touch/whatever flavor of Linux takes off".

i've watched friends try that route; i've been through this before. taking a downgrade is never fun: if it's a thing you're likely to care about in the future, then sometimes it's better to place yourself in the right ecosystem early.

1 more reply

AnimalMuppet6h ago· 2 in thread

How many credits would it buy? How long would it take to use them up? What's the payback period?

From what I understand, for a developer, $5000/month is maybe the high end, but $5000/year is fairly standard. (Is that accurate?) So if it pays back in 15 months, that's pretty decent. If it pays back in two months, that's spectacular.

dminik5h ago

Using some rough napkin (well, spreadsheet) math, if you ran Qwen 27B for every minute every day at the current price of $0.195/$1.56 with a 2:1 input to output ratio (eg. agentic coding) at the advertised 22 tps it would take you just about 11 years to get to ~$5000 spent.

Disclaimer: There's a 35% sale from Alibaba right now. And I'm not accounting for input tokens going faster than output tokens.

eli6h ago

Are you comparing the cost of hosted Opus to running Qwen 3.6 locally? That doesn't really seem fair.

acchow4h ago· 1 in thread

That $6700 is a $5000 upgrade over a base model Macbook Pro.

$5000 in US Treasuries (currently at 4.89%) yields $244.5/yr. That's more than enough to cover the annual Claude Pro subscription ($200/yr) which includes Claude Code with lots of Sonnet usage (far better than Qwen 3.6)

neonstatic3h ago

I think the argument isn't that local is cheaper - it's that local is doable and delivers unparalleled privacy.

jeffybefffy5191h ago· 1 in thread

I still dont trust the Anthopic and OpenAI are not training on my code. I even just thinking keeping track of what code you have received in prompts and to train/not train on it seems like an impossibly difficult task.

andrekandre33m ago

am i right in assuming your code is closed-source?

i'd expect anything on github for example to be already in their training set or is training on actual usage more useful to them?

georgeven6h ago· 1 in thread

I have a 1500 dollar machine that can run it at 50 tok/s (3 V100s)

Dig1t6h ago

How did you buy 3 V100's for $1500??

nozzlegear6h ago

Just putting it out there: I run Qwen 3.6 on my M1 Mac Studio with 64gb. It's quantized and all that, but I agree with TFA: it's the sweet spot for local development right now.

montebicyclelo5h ago

Isn't the directionality important. I.e. it is currently possible to run useful / great models locally, but on high end machines; and in a few years we will likely be able to run even better models on standard machines.

dmayle6h ago

For that price you can put together a PC with 128GB of ram ($2000) and an RTX 5090 ($3600) and get 70-100 tokens per second instead of 45

stared2h ago

All experiments with Qwen 3.6 required no more than 48GB Apple Silicon. I believe you can go even further with more aggressive quantizations - one can go down even further.

In any cases, from the economic point of view, running models on laptops make little sense. Even at the pure cost of energy consumption, it might be hard to beat pricing at tokens generated at scale.

At the same time, it is a breaktrough, that will change the game. Previously such vibe coding on consumer device was not hard or costly - it was impossible.

redox994h ago

I bought 2 used 3090s some years ago for $500 each. They're probably a bit more expensive now, but I guess for something like $2000 you can build a barebones 2x3090 PC which will be way faster than a Macbook. (you're fine with very basic hardware outside the GPUs)

dvduval6h ago

Absolutely for the average developer the token speed is just going to be too slow for it to be workable. I think we’re looking at 2028 when memory becomes cheaper again and they’ll be a lot more people using local models.

elorant5h ago

You can get an AMD Strix Halo with half that price even after hardware price adjustments. Besides you don't need 128GB of RAM to run a 27B model.

ricardobayes4h ago

Oh definitely. I've seen GLM 5.2 go for around $4 per million output tokens.

oldfuture6h ago

a lot of credits? we can’t predict any price change for them

trentor5h ago

Runs fine on 2x4080s or on two 5060/5070s with 16GBVRAM... and faster than on the mac.

j / k navigate · click thread line to collapse

0 comments

99 comments · 25 top-level

dofm6h ago· 26 in thread

The maths there is pretty undeniable, but it is not where I'd make the split. Having a machine that can run some modest local LLMs, like the Gemma 4 12B, is really worth it.

The Neo is too small to really get much benefit from this opportunity to make it more visceral and knowable.

pizza2345h ago

> Having a machine that can run some modest local LLMs, like the Gemma 4 12B, is really worth it.

Except privacy and use cases that are blocked by cloud models (e.g. reverse engineering), local LLMs are currently an expensive toy.

When I try to program with a local LLM (I'm on a 32/128 GB system), I end up wasting time compared to a cloud LLM.

dofm5h ago

Again, I would not argue against any of this.

And I can't say that I won't switch to openrouter (even just for the same models) at some point.

1 more reply

sanderjd4h ago

> currently

The interesting question is whether that gap will narrow, and if so, how much, and on what timescale.

bogeholm4h ago

> Cloud models […] don't consume so much power/generate heat

I do realize the cloud is just someone else’s computer right? Power goes in, tokens and heat come out - just in another place

1 more reply

AlpacaJones5h ago

The key word there is 'currently'.

1 more reply

psychoslave5h ago

Anything done local will likely come at higher cost and at scale with less energy efficiency and commodity, with less possibility to fine tune engineer deeply on wider horizon of issues.

That's never the point of keeping local alternatives though.

1 more reply

VerifiedReports4h ago

Exactly. The distinction between the various layers in "AI" systems is pretty vague to the newcomer. What is the "model" vs. the engine "running" it vs. weights?

I don't recall any previous tech stack that was barfed onto the scene with so little background or reference material, going from zero to endless undefined jargon... and no primer in sight.

And if I do, I want to run the show and fully understand it. And like you, I think that's best done locally.

Fr0styMatt884h ago

The most unexpected thing for me was kind of philosophical in a ‘holy shit’ way.

The fact that they are just programs, that I’ve now experienced first-hand that they’re just programs, makes all those questions around consciousness and intelligence much more interesting.

2 more replies

ricardobayes4h ago

1 more reply

codazoda5h ago

ricardobayes4h ago

I'd say give it some time for the dust to settle. This field badly needs standardized benchmarks even before the conversation around model goodness can start.

not_kurt_godel4h ago

> Having a machine that can run some modest local LLMs, like the Gemma 4 12B, is really worth it.

dofm4h ago

I needed to do this, this way, in my own time, to put my brain back together. It has worked for me, which is why I recommend it.

YMMV.

1 more reply

sanderjd3h ago

Continuing to learn new ones, like what?

What else is there? New / additional programming languages? New / additional database systems? frameworks? orchestrators? cloud provider / infra tooling? architectural patterns?

I dunno, all of this seems really boring and "been there done that" to me at this moment in time!

1 more reply

rusk6h ago

> I totally struggled to find the right frame of mind to explore any of this stuff without feeling defeated and bamboozled.

I found LM studio to be a nice starting point. Frindlier and more featureful than Ollama and not as intimidating as llama.cpp (though you will want to use that eventually)

dofm5h ago

LM Studio is also nice because of the way the interface explains things; parameters have explanations and hints. It has been designed by people who really care about making it understandable.

I tried Ollama but I've settled on Unsloth Studio generally; once things really settle down I'll just run the llama-server UI, which is pretty nice.

oceanplexian4h ago

mullen4h ago

cyanydeez6h ago

I've setup to local paradigms for local coding:

- opencode with it's webui

- deer-flow with it's research/powered front end

It's running Qwen3.6 35B MoE on a AMD 128GB strix halo. If I switched to the dense models, perhaps it'd be smarter, but the trade off seems to be much slower gen.

dofm5h ago

> - opencode with it's webui

Have you tried Paseo?

I have opencode in a VM, and the paseo daemon running in the VM, and then the Paseo Mac app. Really nice.

(You can also use the Opencode GUI to frame a remote opencode web interface)

1 more reply

bsder2h ago

Hello, my brother, just know that you have a fellow passenger in life at the same age who thinks the same thing. I agree that the local stuff is helping my understanding a LOT.

I can't wait for the bubble to pop so that we mere mortals can finally build with this stuff.

ddalex6h ago

I just got Claude to download and install all the models and servers and agents and prepare all the launch scripts for me... no need to learn, just ask it to do it for you

dofm5h ago

Right, but I am a middle-aged bloke who is experiencing existential angst about whether I can carry on in this industry.

I have a pretty deep, maybe paranoid need to be confident I have an intrinsic understanding, and I have found in my life that lessons come to you when you make yourself open to learning.

So I need to build on top of what I know, taking as much of the hard way as I can bear to take at any one time — it has to be not quite difficult enough to put me off.

I can't really explain what I have learned this way that is different, but I feel it in a way that I wouldn't if I'd simply pushed a button.

I can afford a really good printer that has all that set up, and more, has no problems. But I'd just be someone who has a 3D printer.

(Also who am I kidding about the existence of a printer with no problems)

2 more replies

swiftcoder5h ago

1 more reply

coldtea5h ago

>no need to learn, just ask it to do it for you

And that's how skills die.

2 more replies

sorokod5h ago

Then what is the point of ddalex?

1 more reply

Catloafdev6h ago· 11 in thread

The model they reference can be easily run with 24gb+ of VRAM, and there are other similar models capable of running easily on 16gb of VRAM. It's not like 128gb is a requirement here.

bitexploder5h ago

Overall I recommend people try models of this class out using OpenCode and some for pay service to experiment with them and understand how they work. I find they are very useful.

coder5431h ago

> For a MBP I have 48 GB of RAM M5 Pro. It runs at about 12-14 t/s at Q4

Are you running with MTP enabled? I have seen some people on M5 hardware report 20+ t/s on Qwen3.6-27B using MTP... and I think that was a regular M5, not even M5 Pro.

1 more reply

aunty_helen4h ago

I was doing some benchmarking last night on 2 3090s. The systems but old but I’m seeing 11tks 27b, 15tks 35b MoE.

The limited context is problematic. I’m not exactly sure what it’s got available but hermes was hit and miss on a prospecting job.

It does seem to be doing useful work but it’s not API call level quality

1 more reply

CMay4h ago

Qwen 3.6 clearly isn't legitimately bad and maybe it's quite nice at fp16, but it was a disaster quantized in a 24GB scenario by comparison.

thewebguyd6h ago

I'd go for at least 32GB+. It'll fit in 24GB but leaves you little to no room for context, and that's at 4-bit quantization.

If you want to run unquantized, you definitely need 128GB.

Catloafdev6h ago

Nobody runs unquantized, there's literally no reason to. Q8 would be the largest anyone actually runs on consumer hardware for inference.

2 more replies

bitexploder5h ago

It also comes down to inference speed, not "can I run this". 8-bit quant is quite a bit slower on an M5 Pro.

Numerlor5h ago

nok22kon5h ago

a computer with 24 GB VRAM is at least $3000

sleepyeldrazi5h ago

1 more reply

daemonologist4h ago

A 7900 XTX is about $850, and the rest of the computer basically just needs to boot Linux. You could easily build such a machine for $1500.

Even that isn't strictly necessary - you can get perfectly acceptable performance by splitting a model between multiple older 12 or 16 GB cards.

Insanity6h ago· 8 in thread

But you have to factor in that this device will last you 5-10 years. That said, I wouldn't spend almost $7k USD on this macbook lol.

petilon6h ago

Memory requirements of newer models will increase, so while the hardware may last 10 years it won't be able to run the latest models for 10 years.

roadside_picnic6h ago

My experience working in the open model space pretty deeply (both LLMs and diffusion models) for years now is that it is not quite as simple as that.

Similarly the standard "big" but localish open model for LLMs back in the day was Llama 3 70B, this was both a much worse and much larger model than Qwen 3.6 27B

The biggest thing to watch out for is not just RAM/VRAM but memory bandwidth. You can try to "future proof" yourself with lots of RAM, but if it's 400 GB/S you're still constrained to smaller models.

2 more replies

Insanity6h ago

You raise a fair point, but I'm not convinced it'll offer a meaningful difference in performance as long as we're stuck with the current AI paradigm.

bluGill6h ago

Will they? Or will we find ways to optimize models and need less? Only time will tell.

naikrovek1h ago

Available models aren’t really trending upward in size. Not like I thought they would, anyway.

They’re trending to be the right size to be good.

cyanydeez6h ago

I think you have too much faith in context AGI.

at 128GB, you can find almost it's entire context for Qwen3.6 35B MoE.

Again, I think you have too much faith in extrapolation. It's like you got a baby at 0 months, then measured it at 12 months and expect it to be a giant.

simonw6h ago

It can't run the latest models today - GLM-5.2 class models already need 1TB+ of RAM.

... but, the models that WILL run on 128GB (or 64GB or even 32GB) models today are a huge improvement on the best models that would run in the same amount of memory six months ago.

2 more replies

someperson6h ago

In 5-10 years, incremental cloud tokens will be far cheaper (likely but not guaranteed).

porphyra6h ago· 4 in thread

You can also run Qwen 3.6 27B dense model on DGX Spark with comparable performance [1][2] for about $4000 (Asus Ascent GX10 is $3999 at various retailers).

In theory you can also get 48GB of VRAM with, say, two 3090s, but it will take up a lot of space and generate a lot of heat compared to the Macbook Pro and GB10.

[1] https://x.com/MiaAI_lab/status/2070859135399182444

[2] https://github.com/MiaAI-Lab/Qwen3.6-27B-NVFP4-vLLM

esperent6h ago

> 48GB of VRAM with, say, two 3090s

So like... $2000+ just for the used GPUs? Plus I assume it's considerably more effort to get it working.

fluoridation6h ago

>Plus I assume it's considerably more effort to get it working.

Nah, not really. It is a little annoying in terms of space and power, though. Not every case and motherboard can support cards that big.

lee_ars3h ago

porphyrarecent14m ago

I think Atlas might also be slightly faster than vLLM:

https://flowtivity.ai/blog/120-tok-s-1m-context-private-ai-d...

throw12345678915h ago· 4 in thread

Edit: it’s not just “data privacy”, when you are using Claude, you are shipping EVERYTHING to Anthropic. It’s crazy.

wilsonnb35h ago

Companies are already shipping everything to Microsoft or Google and 17 other companies, just the cost of doing business.

throw12345678915h ago

Sure, but no one gets everything. Just that one.

DANmode5h ago

That’s at today-prices.

If the cost doubles, or 4x, which is seems to need to for them to go profitable, what then?

wahnfrieden5h ago

It's much slower, and often quantized

stymaar6h ago· 3 in thread

> The article is based on running Qwen 3.6 on a 128GB MacBook Pro. For reference, a 128GB MBP currently starts at $6699 USD [0]

Qwen3.6-27B would be faster on a 3090 that costs around $1000-1200 though so I don't think it's a good counter-argument.

Op just happened to have that MacBook, but it doesn't mean it's necessary to run the model.

boutell5h ago

That 3090 is going to burn 750W and it will still cap you at a 4 bit quant and ~48K context. Here's someone who worked through it:

https://github.com/noonghunna/qwen36-27b-single-3090

Flies though (50-70tps is impressive for a model this smart)

I went through roughly the same process to get it working on my M2 Macbook Pro... at awful speeds of course, since models like this one are mostly bound by memory bandwidth.

stymaar5h ago

> That 3090 is going to burn 750W

1 more reply

hughw4h ago

My eyes glaze over reading all the AI produced verbiage.

I did find a few useful parameter settings I've already discovered using my single 3090 and ollama.

I'm just remarking that the LLMs overwhelm me with minutiae, especially as I'm working on code design. I frequently ask it to restate concisely, and that helps.

[edited to mention ollama as a nice alt]

dannyw6h ago· 3 in thread

I’m running the same model on a 48GB MBP with a q4 quant and it’s pretty decent. You definitely don’t 128GB. That’s the scale for 70B models at q8 or something.

dom966h ago

I've been running it on my 48GB MBP too and it's not particularly great. Super slow and not near enough to the quality provided by even Claude Sonnet.

doodlesdev6h ago

dannyw6h ago

I purchased mine for approximately $4400 AUD before the price hikes. That unit is now ~$5100 AUD.

organsnyder6h ago· 3 in thread

andy996h ago

I get ~55 Tok/s on my framework desktop with the 35B A3B q8 model, and so far am also very happy with the coding performance.

cyanydeez6h ago

did you upgrade to MTP?

bityard3h ago

h4ny6h ago· 3 in thread

[flagged]

dang3h ago

Yikes, you broke the site guidelines badly with this post. Could you please review https://news.ycombinator.com/newsguidelines.html and stick to them?

You're welcome to make your substantive points thoughtfully, just not aggressively.

kllrnohj6h ago

> maybe tell us how much a non-Apple system that you can run that (probably similarly or faster) would cost?

Ryzen AI Max 395+ with 128GB of unified memory can be found around $3-4k.

h4ny6h ago

cyanydeez6h ago· 2 in thread

AMD started their 128GB Halo Strix at a pretty damn good point at ~2.5k; I got mine after the first memory bump at $3k.

I think you might be a little to into the stew here.

zdragnar5h ago

I haven't tried it with https://lemonade-server.ai/ yet but I just might give it a shot.

organsnyder5h ago

I'm running Lemonade on Nixos on my Framework Desktop. I had been trying other tools out before finding Lemonade, but Lemonade really made it plug-and-play.

colinsane5h ago· 2 in thread

r_lee4h ago

if you need to ensure that, then just back up the model and buy hardware if the need arises

colinsane4h ago

that's somewhere between saying "use Android, just switch to Graphene if/when they lock it down", and saying "just switch to postmarketOS/Ubuntu Touch/whatever flavor of Linux takes off".

1 more reply

AnimalMuppet6h ago· 2 in thread

How many credits would it buy? How long would it take to use them up? What's the payback period?

dminik5h ago

Disclaimer: There's a 35% sale from Alibaba right now. And I'm not accounting for input tokens going faster than output tokens.

eli6h ago

Are you comparing the cost of hosted Opus to running Qwen 3.6 locally? That doesn't really seem fair.

acchow4h ago· 1 in thread

That $6700 is a $5000 upgrade over a base model Macbook Pro.

neonstatic3h ago

I think the argument isn't that local is cheaper - it's that local is doable and delivers unparalleled privacy.

jeffybefffy5191h ago· 1 in thread

andrekandre33m ago

am i right in assuming your code is closed-source?

i'd expect anything on github for example to be already in their training set or is training on actual usage more useful to them?

georgeven6h ago· 1 in thread

I have a 1500 dollar machine that can run it at 50 tok/s (3 V100s)

Dig1t6h ago

How did you buy 3 V100's for $1500??

nozzlegear6h ago

Just putting it out there: I run Qwen 3.6 on my M1 Mac Studio with 64gb. It's quantized and all that, but I agree with TFA: it's the sweet spot for local development right now.

montebicyclelo5h ago

dmayle6h ago

For that price you can put together a PC with 128GB of ram ($2000) and an RTX 5090 ($3600) and get 70-100 tokens per second instead of 45

stared2h ago

All experiments with Qwen 3.6 required no more than 48GB Apple Silicon. I believe you can go even further with more aggressive quantizations - one can go down even further.

In any cases, from the economic point of view, running models on laptops make little sense. Even at the pure cost of energy consumption, it might be hard to beat pricing at tokens generated at scale.

At the same time, it is a breaktrough, that will change the game. Previously such vibe coding on consumer device was not hard or costly - it was impossible.

redox994h ago

dvduval6h ago

elorant5h ago

You can get an AMD Strix Halo with half that price even after hardware price adjustments. Besides you don't need 128GB of RAM to run a 27B model.

ricardobayes4h ago

Oh definitely. I've seen GLM 5.2 go for around $4 per million output tokens.

oldfuture6h ago

a lot of credits? we can’t predict any price change for them

trentor5h ago

Runs fine on 2x4080s or on two 5060/5070s with 16GBVRAM... and faster than on the mac.

j / k navigate · click thread line to collapse