undefined | Better HN

0 pointsiagooar5h ago0 comments

I love my MacBook Pro M5 128GB RAM and I love qwen3.6.

BUT DO NOT buy this MacBook if you plan on doing serious coding using local LLMs with it. The reason is simple: your fingers will burn and your head will explode from the noise.

Running any kind of sophisticated job on the very laptop you are using is just not viable. Sure you can use it in clamshell mode, but forget touching it while working with AI coding or agents.

If you want to run Qwen3.6 27B / 35B at its best, get a MacMini M4 with 64GB of RAM and put it in the basement - or at least a few meters from your desk. Connect to it over LAN or Tailscale. The MacMini will also cost you almost 1/3 of the MacBook Pro.

Thank me later.

0 comments

87 comments · 33 top-level

acters5h ago· 9 in thread

Would the new upcoming AMD AI ryzen halo desktop be a better value offer? or dgx spark?

You would have to get a third party reseller/scalper or refurbished mac mini to get 64gb of ram ever since apple stopped selling it.

c7b3h ago

My 2c: you don't need the Strix Halo desktop, the chip comes in many rigs, most of them cheaper, the performance difference isn't worth it. It used to be half the price of a DGX Spark or a Mac with 128GB RAM. If you can still find it at that price I'd say it's the best bang for your buck. Otherwise, Macs have 2-3x the memory bandwidth of the DGX Spark, depending on the chip, so I'd prefer them. Unless you're planning on building a cluster. The DGX Spark has two 100GB/s connectors, ideal for clustering. But I haven't checked what else you could get for the price of two DGX Sparks.

girvo3h ago

My GB10 Spark-alike is absolutely amazingly fun… but it is not cost effective. Step 3.7 Flash is shockingly capable (IQ4_XS and used for web dev mainly), but it cost me $6800 AUD. They’re even more expensive now. The numbers just don’t make sense: with proper triple head MTP I can get it up to ~40tk/s decode and it runs at around 1000+ tk/s prefill.

$6800 is a lot of API credits for GLM, for example, on any provider you want to use.

Now being able to run models uncensored and with privacy has value! But the cost for these is rough today.

I still am going to buy a second one haha

lee_ars4h ago

I'm currently fiddling with a DGX Spark and Qwen3.6-35B-A3B (specifically Qwen3.6-35B-A3B-NVFP4 under vLLM, with EAGLE3 speculative decoding via eagle3-dogacel-vllm), and it's pretty okay in terms of smarts. The speed is relatively usable at about 50 tok/sec with a 256k context window, and it's definitely smart enough to one-shot some basic coding tasks. I had it doing reverse engineering/disassembly of some ancient MS-DOS assembly language games from the 80s and it handled the task well and produced good outputs.

But it's also really easy to trip up. I fed it some of my Ars pieces and asked it to analyze themes and composition, and it got into a looping argument with me over how it was unable to analyze "my" writing because "the user cannot be the article author, the user is the user, the user did not write the article, the article author wrote the article." I was utterly unable to convince it that I was in fact me.

Qwen3.6-35B-A3B hums along at about 50GB of RAM used with --gpu-memory-utilization=0.42. I haven't tried Qwen3.6-27B (I'd likely grab Qwen3.6-27B-FP8, I think), but I'm curious to see if it makes much of a difference.

coder5431h ago

Compared to a dynamic quant like Unsloth's UD-Q4_K_XL, which keeps some important parameters in higher precision, a basic NVFP4 quant seems to do a lot more damage to the model unless it is carefully calibrated.

I would recommend using llama-server if you're just on a single Spark. You get access to dynamic quants like that more easily, the performance is not that different from vLLM most of the time these days, and it is much faster and easier to switch between models.

As far as intelligence goes, Qwen3.6-27B is much smarter than the 35B-A3B model, but that's also not the sort of thing to argue with an AI model about in the first place. Just open a new chat and try again.

Gemma-4-31B is not as good at agentic use cases as Qwen3.6-27B, but it is a fairly balanced model overall, and worth trying out too. Its MTP can nearly triple the performance of the model, where the benefits of MTP or Eagle seem more limited for Qwen3.6-27B in my testing, maybe doubling the speed.

cpburns20091h ago

Looping is a common problem with the Qwen models. I've had good luck using --repeat-penalty=1.1 with llama.cpp and 27B. vLLM should have a similar option.

rnxrx3h ago

There are also nvfp4 quants of Qwen 3.6 27/35 floating around. I've done benchmarks of both and the quality difference vs fp8/bf16 was barely notable. Honestly the nvfp4 capability is the most interesting feature of the Spark (at least for me).

anon3738392h ago

I use Qwen 3.6 35B-A3B constantly, but I don’t see the type of behavior you mentioned. I’m using Unsloth’s Q8_K_XL quant.

gnerd0041m ago

`llama-server` looping mitigations --repeat-penalty something greater than 1.0, set reasoning/thinking OFF explicitly, prefer a gguf with more than 4bit quant

pkroll4h ago

Check the LLM benchmarks once it's out: it's such a common use case for these kinds of machines, you won't be waiting long.

oceanplexian5h ago· 8 in thread

If you want to do coding with a local LLM your best bet is a 6 year old Nvidia 3090 which is substantially more powerful than the highest end overhyped Apple product for 1/5th the price.

chorizo4h ago

That’s 24GB VRAM. Not enough to run a 27B model at a useful quant+context size.

nsbk3h ago

I beg to differ. Have a look at this repo with single/double 3090 optimized configs for Qwen and Gema models: https://github.com/noonghunna/club-3090

sanderjd4h ago

Yeah seems to me like the mac studios with the unified memory architecture are genuinely good bang for the buck at the moment, because of this memory size consideration?

SkitterKherpi4h ago

You can run 8bit 27B models at 24GB, it's definitely enough for the model size.

4 more replies

iagooarOP4h ago

My problem is I won't accept anything lower than the 96GB the RTX Pro 6000 Blackwell has. My dream is a workstation with 2x Pro 6000 to run DeepSeek v4 Flash comfortably, possibly qwen 3.6 / ornith on turbo speed.

But man, I have never purchased a computer which is more expensive than a decent family car.

jnovek4h ago

An M1 Ultra has 800gbps unified memory. It’s nothing to do with Apple, it’s their microarchitecture. They’re just about the only game in town with high-bandwidth memory if you want >24GB (for less than $10k, anyway).

murderfs2h ago

A 5090 gets you 32GB with 1.8 TB/s of memory bandwidth for ~$4k, RTX A6000 gets you 48GB at 768 GB/s for ~$3.5k, 2x 3090 gets you 48GB for $2000 or so, and if you're willing to go into the wilderness, there are much cheaper options like the AMD MI50.

dheera3h ago

32GB V100

SkitterKherpi5h ago· 7 in thread

I am considering getting something like NVIDIA's RTX Spark when it comes out, though even that will be limited to 128GB.

jazzyjackson4h ago

They’ll sell you a bundle, either a pair or a quartet so you can have 256 or 512GB over a 400GB/s network link

I can’t figure out when it makes sense to pay 10k up front for a quantized Llama 3.1 but it’s an interesting option

c7b3h ago

You could fit a Q4 GLM5.2 in 512GB and still have some space for context (372-475GB for the model): https://unsloth.ai/docs/models/glm-5.2

But yeah, there's a bit of a dearth of models that could fully utilize memory in the 128-256GB bracket at the moment. But things move so fast in this space, I wouldn't base my decision on a generation of models that's just a few months old.

1 more reply

girvo3h ago

Not Llama 3.1, but Step 3.7 Flash is one of the few new high quality models in this size bracket. DeepSeek v4 Flash too

SkitterKherpi4h ago

10k is rather a lot yes. For LLMs you can use a lot of tokens with 10k with less hassle without the machine (and also it's not like electricity is free), but for some other things like video models 10k would get burned very fast. I am looking for something more in the 5k range though.

awesomeusername5h ago

It's out, I'm daily driving one. It's great

SkitterKherpi4h ago

I assume you have the dgx spark? At this point I am not 100% on the difference other than Linux and Windows. The RTX spark should come around Q4, unless I am mistaken.

vikingcat4h ago

Are you running a local LLM on it? Did you buy a whole laptop?

SwellJoe3h ago· 3 in thread

I opted to buy a normal 32GB laptop for this very reason. I know how loud and hot the GPUs in my desktop run when running even smallish models like Qwen 27B or Gemma 4 31B (which is a better model for most than Qwen 3.6, despite the benchmarks). I also have a Strix Halo which doesn't get loud, because it has a single huge fan, but it does get hot. So, there's no way a laptop could work as hard as models make them work, and not be unbearable. Tiny fans trying to remove all that heat? They gotta be screaming. No reason to spend all that money on a laptop that I couldn't realistically make use of. I do run a lot of VMs on my desktop, but I can get to those on a VPN.

It's a nice idea to run a model on a laptop so you can work anywhere...but, that's a job for models in the cloud. Not much data has to traverse the network, so it's not a big deal. Or one could also setup a VPN so you can reach a self-hosted model on a big box at home for things that require data privacy.

All that said, there are models that work great on very small devices for some tasks and won't work it to death. Gemma 4 12B QAT 4-bit runs on a 16GB device, maybe even smaller, including a tablet. It's the best self-hostable vision model I've tested for my purposes (categorization, identification, labeling, type stuff), beating much larger models. It's also a decent conversationalist with good prose but it doesn't know much of anything (not a lot of the world fits in 7GB), so it needs search if you want to use it for research. It's a pretty good tool user. I definitely wouldn't want to use it for code, though, beyond very simple stuff.

girvo3h ago

Gemma is better than Qwen at everything except coding, in all my evaluations. Which is a shame because that is what I use them for!

UncleOxidant1h ago

It would be great if the Gemma folks would release a code-focused model. Probably won't happen, but it's fun to dream.

1 more reply

ekianjo42m ago

gemma is also worse for tool calling. not just coding

jarjoura4h ago· 3 in thread

TBF, I just recently picked up this same model, and it's reminding me of the last gen Intel i9 MBP. Just visiting any non-basic website spins up the fans and battery life isn't great either. Yes, this thing is fast, but damn it gets hot just using it for normal tasks.

Still, I don't agree. I think this machine is meant to use local models. You just have to wear pants if you want to keep it directly on your lap. I rarely use it that way anyway. I prefer it plugged into an external display and comfortably sitting on a laptop stand.

y1n02h ago

Is there something wrong with the m5s? I have an m4 pro and I’ve never heard the fan on it. I don’t do much with local llms, but I naturally use the web and play games (windows games at that with wine/crossover).

inventor77772h ago

That seems very unusual for modern Apple Silicon. Our family has:

- M3 Pro MacBook Pro 36GB

- M2 Pro MacBook Pro 16GB

- Mac Studio M4 Max 48GB

and I have not heard the fans on any of them with normal use. The only time I've ever heard automatic fans was when I was using a local 12B model on the M3 MacBook Pro, and when running 70B models on the Studio.

You should consider checking Activity Monitor and making sure that the usual suspects are not causing issues with sustained high CPU. And you can use an app like [Stats](https://mac-stats.com) if you want to see that info while actively using the computer.

lowbloodsugarrecent13m ago

This is not normal. You have a broken Mac. Make an appointment.

andai3h ago· 2 in thread

> The reason is simple: your fingers will burn and your head will explode from the noise.

So, just buy a mac mini and put it in the other room? ( Like everyone was doing in February? :)

I've been running coding agents on my laptop in yolo mode for the past half year or so (though mostly not local ones, laptop too slow!) and the way I'm doing that without terror is that I just gave them their own Linux user "agent". They're free to nuke their homedir /agent, and they can't touch (or even read) mine.

There's some slight ergonomics issues (I need to sudo into the user to do anything, but I set up an alias for it), sometimes I get issues with permissions or ownership (gave up on "sticky bits" and just made a function I can run once a day when it breaks).

There's enough hassle that I wish I just had a dedicated machine for it, and then I'd just give them root on it. (For giggles I gave claude root on a $3 VPS and that's going just fine...)

But yeah after months of trial and error I reinvented "just buy a mac mini" from first principles...

iagooarOP3h ago

Just buy a Mac Mini really is good advice if you want to get into real, always-on convenient agentic work.

Soon it is going to be good even for coding using local LLMs. Until then, just run API models on it for coding, local LLMs for "knowledge" work or daily driver agent like Hermes.

marcuskaz3h ago

Except they're not available, 3-4 month wait time.

2 more replies

somewhatrandom92h ago· 2 in thread

Try using DwarfStar 4 and use the --power flag: https://github.com/antirez/ds4#reducing-heat-power-usage-and...

boomskats2h ago

Can you run Qwen 3.6 27B on antirez/ds4 now? I thought it was all about the DeepSeek models.

somewhatrandom92h ago

No, I don't think Qwen, but I believe he may try and put some version of GLM in it.

xd19364h ago· 2 in thread

Apple does not currently sell a Mac Mini with 64GB RAM.

iagooarOP4h ago

Get a 2nd hand one. I was lucky enough to get a new one first, last week I get a 2nd hand one in order to run one of my Hermes minions at work.

stevenaenns4h ago

how many tokens/s generation do you get?

1 more reply

Arubis4h ago· 2 in thread

Don't forget that your OLED screen will start to color-shift as the heat cooks the panel!

manmal4h ago

There is no MacBook Pro with OLED (yet).

Arubis4h ago

My mistake on tech; it’s a beautiful display. Alas, I speak from experience when it comes to the thermally-caused color shift. Hopefully it’ll be AppleCare covered.

Matl4h ago· 2 in thread

> If you want to run Qwen3.6 27B / 35B at its best, get a MacMini M4 with 64GB of RAM and put it in the basement - or at least a few meters from your desk.

Can confirm this works rather well, most things that integrate with LLMs, (agents, editors), support providing a remote (LAN) URL for Ollama, LM Studio etc.

But you do need a fast LAN connection, otherwise working with agents will be a pain.

Retr0id3h ago

> you do need a fast LAN connection

Huh, how come? Low-latency I can understand, but I was under the impression that token throughputs were still barely exceeding dialup bandwidths.

iagooarOP3h ago

I disagree LAN connection is the bottleneck. I do even work with it remotely via Tailscale on shaky hotel WIFI and it works fine (or as fine as any other API-based model).

codazoda3h ago· 2 in thread

Today the Mini tops out at 48GB. Gotta go to the Studio to get 64GB.

aurareturn3h ago

Don't buy the Mini or Studio. Both have the M4 which lacks the Neural Accelerators, making prompt processing ~3-4x slower.

mortenjorck3h ago

I assume those don't just work automatically with an off-the-shelf gguf. What do you need in your local inference stack to take advantage of M5's neural accelerators?

1 more reply

seanmcdirmid4h ago· 2 in thread

What sort of M5 are you running? A max? MacMini's don't offer max CPUs.

iagooarOP4h ago

M5 Max. But I also have a MacMini M4 Pro 64GB. Qwen3.6 runs on the M4 just fine - sure the M5 is at least 2x the speed. If Apple launches a MacMini with an M5, I will be the 1st one to get it.

kristianp3h ago

You're only going to get an incremental improvement with an M5 Pro mini compared to an M4 Pro mini. Memory bandwidth goes from 273GB/s to 307GB/s, about 12.5% improvement for LLMs.

2 more replies

ActorNightly3h ago· 2 in thread

>If you want to run Qwen3.6 27B / 35B at its best, get a MacMini M4 with 64GB of RAM and put it in the basement

Im sorry, but its time to start calling Apple sycophants out. Stop trying to push your tech jewelry on other people. You only buy those computers because they are Apple, you don't know anything about computing or running LLMs, you don't do any real work, so you should probably not give advice on what to buy.

A single 3090 will run Qwen3.6 27b fine, and its VRAM speed is twice of what the best Mac has. And the build will be cheaper. Decent CPU/Motherboard, 32gb of DDR4 ram, an SSD and a Single 3090 should run max about $4grand. Mac m4 mini is 6grand.

Then, when gpu prices come down (or you find one on a deal), you can upgrade the card, or stick a second one, and benefit from more speed. You can't do that with the trash Apple produces.

Flag me if you want, I don't care. Its embarrasing for the tech community to give advice this bad.

iagooarOP3h ago

I am not going to flag you, I am much OK with having good arguments.

I just purchased a Mac Mini M4 Pro 64GB for $3k - 2nd hand of course.

I am not a hater of Nvidia and I am planning on building a workstation based on RTX cards. You clearly do not seem to understand how convenient the MacMini actually IS - the form factor, how quiet it is, how durable it is, how well it integrates with other Macs, how well it works as a bridge to a personal agent like Hermes (integration with iMessage, Calendar, Reminders, iCloud, etc).

I am pretty sure I know a thing or two about computing, I have been in the trenches for many, many years and I have had machines of all kinds, shapes and colors. It just so happens that Macs are very capable, very convenient machines that happen to work great in the era of LLMs, too.

But you do you.

ActorNightly2h ago

>You clearly do not seem to understand how convenient the MacMini actually IS - the form factor, how quiet it is, how durable it is, how well it integrates with other Macs, how well it works as a bridge to a personal agent like Hermes (integration with iMessage, Calendar, Reminders, iCloud, etc).

If you are that locked in to Apple, its pretty easy to buy a used Mac Mini older gen for all the non AI stuff.

But this is a discussion about inference. Buying a Mac anything for any sort of local inference is a COLOSSAL waste of money.

astrostl1h ago· 1 in thread

> MacBook Pro M5 128GB RAM

614 GB/s of memory bandwidth

> MacMini M4 with 64GB of RAM

273 GB/s of memory bandwidth (also only currently available with 48GB)

When it comes to inference speed, you want your model to fit in memory, and then to have as much memory bandwidth as possible. In this case a hypothetical Mini with 1TB of memory would still be over 2x slower with 27-35B models.

And FWIW I have an M4 Max MBP 128GB that I keep on a Roost laptop stand, with a separate keyboard/mouse/video. It does fire up the cooling jets when running local LLMs, but stays within tolerance for me on noise. I haven't heat-tested it on longer runs, but I imagine the risen airflow helps a ton.

bigyabairecent10m ago

> When it comes to inference speed, you want your model to fit in memory, and then to have as much memory bandwidth as possible.

This is only true when your GPU isn't bottlenecked building a KV cache, which it usually will be on Apple Silicon. The Achilles heel of the M-series chips are their weak, SOC-grade GPU that holds back the Max and Ultra models from having interactive TTFTs on larger models and contexts.

Arch-TK1h ago· 1 in thread

It's okay, completely wrong thread for this statement, but I wouldn't voluntarily use current MacOS (no idea if the older variants weren't terrible) over anything but ssh. Worse than Windows 11.

braeborecent4m ago

I could not disagree more.

swang4h ago· 1 in thread

I have an M4 Max and when I was trying out local LLM work with pi it has probably felt like the hottest I've ever felt any kind of Macbook be. I could feel the radiated heat off it even a few inches away. Honestly felt hotter than any Intel Macbook I've used. Because of that I stopped as I didn't want to harm my laptop in case I need to hold it for 10 years due to all the supply issues/price increases.

dimitrios14h ago

I tried to run it on a M4 Air for shits and giggles.

After about 1 minute the entire machine basically bricked and I had to hard reset :D

bilekas2h ago· 1 in thread

Can you define "serious programming"? Because I use it to implement things I COULD go and figure out like algorithms or test generation or evaluations etc, the "serious" programming I tend to do myself. That is what I'm paid for.

overgardrecent1m ago

Serious programming is using as many agents and loops as possible because anthropic needs you to spend more on tokens

cosmic_cheese4h ago· 1 in thread

They really need to release those updated Studios already.

DennisP2h ago

Since they've reduced the max RAM on current Studios from 512GB to 96GB, I'm not holding my breath.

c7b3h ago· 1 in thread

This. Do consider local LLMs, but set aside a dedicated machine for it. Connect via VPN or reverse proxy. If it's not a Mac them I'd also put a server distro on it. No need for a desktop environment, save your RAM.

tedivm3h ago

I have a Linux box with two 3090s and it's been great for running Qwen3.6 27b. I lowered the power on each card down to 250w, and then built a small ducting/fan system to vent the waste heat outside. The machine is pretty much silent, and I'm still getting 110 tokens per second out of it for coding tasks.

https://github.com/tedivm/qwen36-27b-docker

Fr0styMatt884h ago· 1 in thread

What kind of speed in tk/s do you get with the MacBook?

iagooarOP3h ago

qwen3.6 27B MLX 8bit -> 15 tok / sec. A bit slow but it is a delightful model to use, and smart too.

qwen3.6 35B A3B MLX 8bit -> 85-90 tok / sec! It is impressively fast and roughly 90% as good as 27B (in my opinion).

dzonga3h ago· 1 in thread

why not buy one of those "a.i" desktop kits being sold by Nvidia/AMD and just connect to them via network ?

to me that's cheaper than paying an LLM provider such as Anthropic spreading FUD around open weight models & more sustainable too.

Gigachad1h ago

It's still currently way cheaper to pay open router to run qwen for you. And you have the option to use much bigger better models like DeepSeek v4 flash.

jtbaker43m ago

Nope, have both these machines, can confirm the M5 max blows the M4 mini away. It does get hot, but I use it mostly with an external monitor and keyboard. Conceptually I like the headless model better with a workstation, but work was buying the M5 and can't get it in any other form factor at the monute.

roadside_picnic1h ago

In general if you're setting up a local LLM you should assume it's going to be primarily working as a server and talking to various clients. I use my MBP, but that's because I don't travel much anymore so it can happily work as a server at all times. With the right agent setup you can probably manage most things from your phone even if you don't have a seperate machine to use as a client.

I have an older laptop I run a hermes agent on backed by an API based open (non-local) model and Macbook Pro M4 for running another model locally (also using hermes). The agents have a Mattermost (open source version of slack) server they run and I run Mattermost on my phone so I can talk to them and task them with things. In fact, it was through the hermes WhatsApp endpoint that I got the first agent (non-local) to setup the Mattermost server and unboard the second agent (local mbp).

Then I can just chat with them through Mattermost when I need work done. Whenever I need something done I just hope on the Mattermost server and chat with them. I've had them build me multiple research reports (the fully local agent did awesome at this), learn how to use Stable Diffusion on my desktop to generate images, install and perform maintenance on various local services I run (including Open WebUI).

geophile4h ago

That's exactly what I'm doing -- Mini M4 Pro 64GB, qwen3.6.

My hearing is not great, but I think I would have noticed the fan, and I have never heard it. In fact, I had to google to find out if it even has a fan.

toephu23h ago

I just checked apple's website and configured them:

Mac Studio: Ships: 16–18 weeks

Mac mini: Ships: 10–12 weeks

overgard3h ago

I'm running an M5 Max 128GB with Qwen 3.6 and unreal engine in the background and it seems to be ok for me. Quite a power drain if it's not plugged in but I haven't seen any thermal issues.

stared2h ago

Yes, it gets really hot really fast.

As much as I was tempted to use it on longer projects, I had some reservations about whether it would put too much strain on my MacBook.

cmgbhm4h ago

A local model on my m2 made me come to that conclusion but I definitely was having “that config is $2k more” regret. Thanks for posting this!

samtheprogram3h ago

Are you sure you're running it with MLX?

busymom04h ago

Also look into buying the Mac mini refurbished from Apple. They come almost brand new, same warranty and you save money.

gigatexal1h ago

Same. And your M5 has acceleration that I don’t with my M3 max. I can’t do anything local it gets hotter than an Intel Mac trying to run docker from back in the day.

singpolyma33h ago

With 128 you can run 122b ;)

verdverm5h ago

Get an OEM Spark instead, mine are silent and can fit 2 qwen/gemma at 8bit or give you room for a bunch of other, smaller models (embed,rerank,etc)

j / k navigate · click thread line to collapse

0 comments

87 comments · 33 top-level

acters5h ago· 9 in thread

Would the new upcoming AMD AI ryzen halo desktop be a better value offer? or dgx spark?

You would have to get a third party reseller/scalper or refurbished mac mini to get 64gb of ram ever since apple stopped selling it.

c7b3h ago

girvo3h ago

$6800 is a lot of API credits for GLM, for example, on any provider you want to use.

Now being able to run models uncensored and with privacy has value! But the cost for these is rough today.

I still am going to buy a second one haha

lee_ars4h ago

coder5431h ago

cpburns20091h ago

Looping is a common problem with the Qwen models. I've had good luck using --repeat-penalty=1.1 with llama.cpp and 27B. vLLM should have a similar option.

rnxrx3h ago

anon3738392h ago

I use Qwen 3.6 35B-A3B constantly, but I don’t see the type of behavior you mentioned. I’m using Unsloth’s Q8_K_XL quant.

gnerd0041m ago

`llama-server` looping mitigations --repeat-penalty something greater than 1.0, set reasoning/thinking OFF explicitly, prefer a gguf with more than 4bit quant

pkroll4h ago

Check the LLM benchmarks once it's out: it's such a common use case for these kinds of machines, you won't be waiting long.

oceanplexian5h ago· 8 in thread

If you want to do coding with a local LLM your best bet is a 6 year old Nvidia 3090 which is substantially more powerful than the highest end overhyped Apple product for 1/5th the price.

chorizo4h ago

That’s 24GB VRAM. Not enough to run a 27B model at a useful quant+context size.

nsbk3h ago

I beg to differ. Have a look at this repo with single/double 3090 optimized configs for Qwen and Gema models: https://github.com/noonghunna/club-3090

sanderjd4h ago

Yeah seems to me like the mac studios with the unified memory architecture are genuinely good bang for the buck at the moment, because of this memory size consideration?

SkitterKherpi4h ago

You can run 8bit 27B models at 24GB, it's definitely enough for the model size.

4 more replies

iagooarOP4h ago

But man, I have never purchased a computer which is more expensive than a decent family car.

jnovek4h ago

murderfs2h ago

dheera3h ago

32GB V100

SkitterKherpi5h ago· 7 in thread

I am considering getting something like NVIDIA's RTX Spark when it comes out, though even that will be limited to 128GB.

jazzyjackson4h ago

They’ll sell you a bundle, either a pair or a quartet so you can have 256 or 512GB over a 400GB/s network link

I can’t figure out when it makes sense to pay 10k up front for a quantized Llama 3.1 but it’s an interesting option

c7b3h ago

You could fit a Q4 GLM5.2 in 512GB and still have some space for context (372-475GB for the model): https://unsloth.ai/docs/models/glm-5.2

1 more reply

girvo3h ago

Not Llama 3.1, but Step 3.7 Flash is one of the few new high quality models in this size bracket. DeepSeek v4 Flash too

SkitterKherpi4h ago

awesomeusername5h ago

It's out, I'm daily driving one. It's great

SkitterKherpi4h ago

I assume you have the dgx spark? At this point I am not 100% on the difference other than Linux and Windows. The RTX spark should come around Q4, unless I am mistaken.

vikingcat4h ago

Are you running a local LLM on it? Did you buy a whole laptop?

SwellJoe3h ago· 3 in thread

girvo3h ago

Gemma is better than Qwen at everything except coding, in all my evaluations. Which is a shame because that is what I use them for!

UncleOxidant1h ago

It would be great if the Gemma folks would release a code-focused model. Probably won't happen, but it's fun to dream.

1 more reply

ekianjo42m ago

gemma is also worse for tool calling. not just coding

jarjoura4h ago· 3 in thread

y1n02h ago

inventor77772h ago

That seems very unusual for modern Apple Silicon. Our family has:

- M3 Pro MacBook Pro 36GB

- M2 Pro MacBook Pro 16GB

- Mac Studio M4 Max 48GB

lowbloodsugarrecent13m ago

This is not normal. You have a broken Mac. Make an appointment.

andai3h ago· 2 in thread

> The reason is simple: your fingers will burn and your head will explode from the noise.

So, just buy a mac mini and put it in the other room? ( Like everyone was doing in February? :)

There's enough hassle that I wish I just had a dedicated machine for it, and then I'd just give them root on it. (For giggles I gave claude root on a $3 VPS and that's going just fine...)

But yeah after months of trial and error I reinvented "just buy a mac mini" from first principles...

iagooarOP3h ago

Just buy a Mac Mini really is good advice if you want to get into real, always-on convenient agentic work.

Soon it is going to be good even for coding using local LLMs. Until then, just run API models on it for coding, local LLMs for "knowledge" work or daily driver agent like Hermes.

marcuskaz3h ago

Except they're not available, 3-4 month wait time.

2 more replies

somewhatrandom92h ago· 2 in thread

Try using DwarfStar 4 and use the --power flag: https://github.com/antirez/ds4#reducing-heat-power-usage-and...

boomskats2h ago

Can you run Qwen 3.6 27B on antirez/ds4 now? I thought it was all about the DeepSeek models.

somewhatrandom92h ago

No, I don't think Qwen, but I believe he may try and put some version of GLM in it.

xd19364h ago· 2 in thread

Apple does not currently sell a Mac Mini with 64GB RAM.

iagooarOP4h ago

Get a 2nd hand one. I was lucky enough to get a new one first, last week I get a 2nd hand one in order to run one of my Hermes minions at work.

stevenaenns4h ago

how many tokens/s generation do you get?

1 more reply

Arubis4h ago· 2 in thread

Don't forget that your OLED screen will start to color-shift as the heat cooks the panel!

manmal4h ago

There is no MacBook Pro with OLED (yet).

Arubis4h ago

My mistake on tech; it’s a beautiful display. Alas, I speak from experience when it comes to the thermally-caused color shift. Hopefully it’ll be AppleCare covered.

Matl4h ago· 2 in thread

> If you want to run Qwen3.6 27B / 35B at its best, get a MacMini M4 with 64GB of RAM and put it in the basement - or at least a few meters from your desk.

Can confirm this works rather well, most things that integrate with LLMs, (agents, editors), support providing a remote (LAN) URL for Ollama, LM Studio etc.

But you do need a fast LAN connection, otherwise working with agents will be a pain.

Retr0id3h ago

> you do need a fast LAN connection

Huh, how come? Low-latency I can understand, but I was under the impression that token throughputs were still barely exceeding dialup bandwidths.

iagooarOP3h ago

I disagree LAN connection is the bottleneck. I do even work with it remotely via Tailscale on shaky hotel WIFI and it works fine (or as fine as any other API-based model).

codazoda3h ago· 2 in thread

Today the Mini tops out at 48GB. Gotta go to the Studio to get 64GB.

aurareturn3h ago

Don't buy the Mini or Studio. Both have the M4 which lacks the Neural Accelerators, making prompt processing ~3-4x slower.

mortenjorck3h ago

I assume those don't just work automatically with an off-the-shelf gguf. What do you need in your local inference stack to take advantage of M5's neural accelerators?

1 more reply

seanmcdirmid4h ago· 2 in thread

What sort of M5 are you running? A max? MacMini's don't offer max CPUs.

iagooarOP4h ago

M5 Max. But I also have a MacMini M4 Pro 64GB. Qwen3.6 runs on the M4 just fine - sure the M5 is at least 2x the speed. If Apple launches a MacMini with an M5, I will be the 1st one to get it.

kristianp3h ago

You're only going to get an incremental improvement with an M5 Pro mini compared to an M4 Pro mini. Memory bandwidth goes from 273GB/s to 307GB/s, about 12.5% improvement for LLMs.

2 more replies

ActorNightly3h ago· 2 in thread

>If you want to run Qwen3.6 27B / 35B at its best, get a MacMini M4 with 64GB of RAM and put it in the basement

Then, when gpu prices come down (or you find one on a deal), you can upgrade the card, or stick a second one, and benefit from more speed. You can't do that with the trash Apple produces.

Flag me if you want, I don't care. Its embarrasing for the tech community to give advice this bad.

iagooarOP3h ago

I am not going to flag you, I am much OK with having good arguments.

I just purchased a Mac Mini M4 Pro 64GB for $3k - 2nd hand of course.

But you do you.

ActorNightly2h ago

If you are that locked in to Apple, its pretty easy to buy a used Mac Mini older gen for all the non AI stuff.

But this is a discussion about inference. Buying a Mac anything for any sort of local inference is a COLOSSAL waste of money.

astrostl1h ago· 1 in thread

> MacBook Pro M5 128GB RAM

614 GB/s of memory bandwidth

> MacMini M4 with 64GB of RAM

273 GB/s of memory bandwidth (also only currently available with 48GB)

bigyabairecent10m ago

> When it comes to inference speed, you want your model to fit in memory, and then to have as much memory bandwidth as possible.

Arch-TK1h ago· 1 in thread

It's okay, completely wrong thread for this statement, but I wouldn't voluntarily use current MacOS (no idea if the older variants weren't terrible) over anything but ssh. Worse than Windows 11.

braeborecent4m ago

I could not disagree more.

swang4h ago· 1 in thread

dimitrios14h ago

I tried to run it on a M4 Air for shits and giggles.

After about 1 minute the entire machine basically bricked and I had to hard reset :D

bilekas2h ago· 1 in thread

overgardrecent1m ago

Serious programming is using as many agents and loops as possible because anthropic needs you to spend more on tokens

cosmic_cheese4h ago· 1 in thread

They really need to release those updated Studios already.

DennisP2h ago

Since they've reduced the max RAM on current Studios from 512GB to 96GB, I'm not holding my breath.

c7b3h ago· 1 in thread

tedivm3h ago

https://github.com/tedivm/qwen36-27b-docker

Fr0styMatt884h ago· 1 in thread

What kind of speed in tk/s do you get with the MacBook?

iagooarOP3h ago

qwen3.6 27B MLX 8bit -> 15 tok / sec. A bit slow but it is a delightful model to use, and smart too.

qwen3.6 35B A3B MLX 8bit -> 85-90 tok / sec! It is impressively fast and roughly 90% as good as 27B (in my opinion).

dzonga3h ago· 1 in thread

why not buy one of those "a.i" desktop kits being sold by Nvidia/AMD and just connect to them via network ?

to me that's cheaper than paying an LLM provider such as Anthropic spreading FUD around open weight models & more sustainable too.

Gigachad1h ago

It's still currently way cheaper to pay open router to run qwen for you. And you have the option to use much bigger better models like DeepSeek v4 flash.

jtbaker43m ago

roadside_picnic1h ago

geophile4h ago

That's exactly what I'm doing -- Mini M4 Pro 64GB, qwen3.6.

My hearing is not great, but I think I would have noticed the fan, and I have never heard it. In fact, I had to google to find out if it even has a fan.

toephu23h ago

I just checked apple's website and configured them:

Mac Studio: Ships: 16–18 weeks

Mac mini: Ships: 10–12 weeks

overgard3h ago

I'm running an M5 Max 128GB with Qwen 3.6 and unreal engine in the background and it seems to be ok for me. Quite a power drain if it's not plugged in but I haven't seen any thermal issues.

stared2h ago

Yes, it gets really hot really fast.

As much as I was tempted to use it on longer projects, I had some reservations about whether it would put too much strain on my MacBook.

cmgbhm4h ago

A local model on my m2 made me come to that conclusion but I definitely was having “that config is $2k more” regret. Thanks for posting this!

samtheprogram3h ago

Are you sure you're running it with MLX?

busymom04h ago

Also look into buying the Mac mini refurbished from Apple. They come almost brand new, same warranty and you save money.

gigatexal1h ago

Same. And your M5 has acceleration that I don’t with my M3 max. I can’t do anything local it gets hotter than an Intel Mac trying to run docker from back in the day.

singpolyma33h ago

With 128 you can run 122b ;)

verdverm5h ago

Get an OEM Spark instead, mine are silent and can fit 2 qwen/gemma at 8bit or give you room for a bunch of other, smaller models (embed,rerank,etc)

j / k navigate · click thread line to collapse