You have dense models (qwen 27b, gemma 31b) who are pretty smart, but pretty slow
You have MoE models (gemma 26b, qwen 35b, north mini code 30b) who are pretty fast, but make a lot of mistakes
You need a lot of memory to run these well, quantization makes tool calling weaker, so most run at 4 bit quants and are wondering why it kinda sucks and that's because you've essentially lobotomized the model (I recommend unsloth quants, i recommend 6bit for MoEs and 5bit for dense)
So you need a lot of compute to make the pre-fill fast, you need bandwidth to make the decode fast, you need a lot of memory to hold everything - lot of ifs
On top of that, your laptop becomes a loud hot churning machine, it's uncomfortable to work with.
So are they good? not really. Do they work? yes
edit: just wanna clarify - i think open models are the future, i think they're super important, i'm contributing constantly to the ecosystem - i think people should play around with these models, i think people should use `pi` and learn how it all works - but don't download a model expecting it to be good out of the box, you will have to tune and configure a lot of stuff to replace a "coding agent" that most people are using models for
The best "free" experience I've found is using OpenCode with Big Pickle. It's not especially smart, so it often won't produce the correct result the first time, but the free tier is generous enough that I don't think I've hit the limit more than twice over around a month with frequent multi-hour sessions. If running locally is truly the goal, it's not going to fit the bill, but if the goal is just "get the best experience without having to pay for a sub or tokens", it's the least bad option I've found so far.
If you throw the right GPUs at the problem, they become much better - but still not quite in the realm of Sonnet or DeepSeek 4 Flash, let alone Opus / DeepSeek Pro or Mythos/Fable/GPT-5.5.
Given enough budget, power, and cooling, you can run some pretty good data pipelines, but for code, I think it still makes sense to shell out to an API provider most of the time.
It outperforms all the Qwen models (even 100B+) for rule following/automation style tasks in my experience. Its image interpretation is also very good, and out-benchmarks Opus.
Qwen seems to ignore instructions and consistently outputs incorrect formats (when token generation format is not explicitly constrained)
But yes, on the DGX Spark Gemma 31B Q4 with MTP runs around 20 tok/s and Gemma 26B A4B around 60 tok/s. Still quite slow. But on a high end Nvidia card would run significantly faster and still fit in memory.
I'd recommend for anyone getting into local models to focus on memory bandwidth over RAM. Models under 100B parameters are now sufficient and hugely useful for automation.
I agree that for coding/creation use cases, there's still not a compelling argument for local models.
But e.g. if you want to scan a list of stocks and interpret news/high pass filtering, interpreting logs, interpreting screenshots, the local models are more than sufficient already.
This is sadly also my experience. I wish we had some MoE models with a higher ratio of active parameters per total. My experience is that the newer MoE models that can run in a 64b laptop have too few active parameters to be useful outside narrower, specific tasks. Mixtral 8x7b was a 14b active parameter (56b total) MoE model a few years ago and was probably the best model one could run in that range for some time, but it is too old now.
I have been using the qwen 27b and it is great, but running a dense model like this in a macbook is a bit suboptimal, and i wish I could run sth faster than 15 tok/s.
Laptop?
OK, I've made that mistake before. I understand modern laptops are powerful, but nobody wanting to do serious AI/ML work should be using a laptop for anything other than SSH or similar low-performance access into a proper system.
Years ago I fried two laptops just doing finite element analysis work running 18+ hours per day. It was one of those "I'm giving you all she's got, Captain!" workloads. They fried, even with powerful fans cooling them. I should have known better. Such workloads belong on purpose built systems.
You are somehow assuming cloud-based models are not painful.
I can tell you my past experience. I was using GPT 5.5 and Claude Opus interchangeably and I prompted them to implement a feature. I paid attention to the agent window and it was literally screwing up implementations, causing tests to fail, and going into test-fail-fix loops to clean up after itself. After a few minutes, it finally called it done. That run cost $0.60.
I went to review the code and only half of the source files complied with the instruction files. I prompted the model to clarify why it failed to comply with the instruction file. The model outputs "you are right, I should have complied with the instruction files. That prompt cost $0.30.
I prompted the model to proceed and apply the instruction file prompts. It went ahead and applied changes. Success. It cost $0.16.
I reviewed the code again. Only half of the sloppy code was touched up. I prompted it to fix the whole mess, not just a couple of files. It complied. One coin less in my purse.
So, around a third of the cost of a feature is spent on the model cleaning the mess it left in it's wake.
And this was a tiny feature with a plan, a solid set of instruction files.
Very expensive.
Are costs going down? I doubt so. OpenAI seems to still be spending 3 times it's revenue already.
In comparison, local models sound very good.
In fact it really feels like OpenAI models have taken a nose dive this week compared with Claude. At least for my specific workloads (these things are so variable it’s like trying to compare Google results…)
This really depends on how and what you're using. e.g. I can't suffer through slowness of inference on Macbook but I have gaming rig with quite powerful GPU and I squeeze ~130 t/s on Gemma or ~70t/s on Qwen.
Tuning is not optional as well. Qwen on temperatures > 0.5 is unusable for coding and I found sweet spot around 0.32 for coding. Speculative decoding on Gemma4 26B is a 30t/s difference between non-speculative.
The worst thing with local models is that I can't just give you a recipe, because what's the best params depends on your use case.
In the nutshell I'd compare local models to running game rig on Windows vs Linux. Linux works great if not better than Windows gaming, but you need to embrace some tweaking in order to get there. Is it there? It's not SOTA, that's for sure, but it's working reasonably well.
1) a "programming desktop" with a $500 upper mid range Ryzen (idr exact), 8GB VRAM Radeon card I bought solely for RuneScape, and 64GB ram
2) a maxed out Alienware 16 Area51, so it's a 5090 with 24GB vram and 64GB system ram. I bought it for gaming, of course.
I run qwen 3.6 35B A3B Q6 with 200k context window. I compare this to Claude pro max or whatever that I use at work.
The main difference between the machines is that the one with the RuneScape gpu does 10 TPS while the Alienware does 30-40tps. Both are fine though the 30-40tps is obviously a lot snappier.
I find with both models that:
- they do really well at "be a 30GB zip file of reddit and stackoverflow answers"
- they do really well at point fixing random bullshit errors that would otherwise waste my time (this is related to above of course)
- they do quite well at, given a pretty good specification of what you want, figuring it out, even if you've specified several steps needed
- they both cannot really be given a large ish task and left to just drive it on their own
The main difference between the two is with that last one, Claude is somewhat better and figuring SOMETHING out, but if Claude is having to figure it out, it's probably because I don't know what I want and it's very likely to not make a sane choice, and will generally produce slop given even the slightest amount of leash still.
I've also found that the boundary between "well specified small to medium thing" and "idk just do thing and figure it out" is the difference between you keeping control of the code and losing control. There's an "escape velocity" of AI use that, when you hit it, you're doomed to slop forever. (Or you have to deorbit... enjoy that). And while claude might have slightly higher velocity allowed while remaining suborbital, it's very diminishing returns.
So, are these models "worse" than Claude? Yeah. Am I looking forward to continued improvements? Yeah. But I now also have no desire to pay anthropic any amount of money, which has the nice side effect that i won't be helping them end up with so much money that they can distort our democracy.
slowness doesn't matter a lot to me, at home. I will type up a prompt and submit it and let it run while I do other things around the house. I have all kinds of things to do, and most of them do not require sitting in front of a computer.
of course faster would be better, but it's not always a requirement. smart and slow is far better than dumb and fast or even nothing at all.
- Maximum intelligence per VRAM (you dont have much)
- Dense models can benefit from MTP to get an almost 2x speedup in decode (ie, a 27b dense model with mtp decodes at about the same speed as a MoE model with 14b active param model would). This is important because local llm rarely has parallel streams to batch together.
When running on large unified memory like Strix Halo or Spark Dgx, MoE models are usually best:
- You can get similar intelligence as a smaller dense model with fewer active params (to compensate for the slower memory) by throwing ram at the problem.
Our GPU computer server cost $110k.
If I can't customise it then I won't waste my time using it it getting use to it.
Claude code is trash, it's customisability is extremely shallow, open code, codex, copilot, Kiro, etc etc... all trash. Yes even open code..
If open code was so awesome then open claw would have been based on it... But it wasn't. That's should tell you everything you need to know.
i use it usecases like that latter and they are fine.
As painful as it is to admit, the future might be cloud inference from a trusted provider.
It is such a downgrade. I don't understand how that's even possible. The thing has so many strongly-held opinions I did not ever ask it for, talking just way too much and generally feeling somehow dumber.
Of course, being significantly larger, it will encode more knowledge, but that doesn't help me when I hate talking to it. And all that on top of the fact that talking with it costs real money.
I wonder what it might be that makes me hate it so much. Maybe because it doesn't see itself as a tool but almost an equal? As if its opinions would have weight.
Qwen too can act like an overeager intern, but if you tell it that it is an idiot, it will drop that ego. Not so much with Claude. In my experience, anyway.
Anyway, point is: full ack on that headline.
[0] - https://github.com/search?q=%22Assisted-by%22+user%3Aggml-or...
[1] - https://github.com/ggml-org/llama.cpp/blob/master/.pi/gg/SYS...
Plus, I never have to worry about rate limits, quotas, or sitting in a queue during peak time. And I can always see its full thoughts, don't have to worry about where my data is getting sent, and know it can't get secretly nerfed.
Running on 2x 3090, 500-1000tok/s prefill and 60tok/s output at Q6_K_XL with MTP on llama.cpp, 220k tokens context window (starts to get a bit dumb above 160k ish), no KV quantization
OMG this is such an annoying property, just shut the hell up please, and be concise.
I suspect that this is an artifact of the thinking property, but please just summarize the thinking process far more concisely, where a single sentence answer is more than sufficient the frontier models seem devoted to going on to a minimum of 5 paragraphs and offering 3-5 new directions.
And requests to please only offer a single step at once, or single option at once, or to even stop eagerly offering future directions is really hard to prompt correctly.
And look, there I did exactly what I was complaining about...
This is likely due to a combination of mass funding for the AI companies, but also they are trying to governmentally restrict which countries get access to these cards so certain countries get a head start. The only way to lock that down is to have them literally locked in their own GPU prisons (data centers). Third reason is it does make it possible to train the models faster by having them in the same data center connected directly. Having them distributed to everyone would slow down training considerably.
The current way to 'own' decent RAM and GPUs right now is through the stock market it seems.
FWIW Codex/GPT models are way less this way. Maybe to a fault.
I'm setting up my DGX Spark to try Qwen 3.6 27B again, as I'm hearing a lot of good reviews. When I tried it some time ago it was still early for support in llama.cpp.
[0] https://twelvetables.blog/comparing-claude-fable-5s-system-p...
It won't happen with AI models either.
It's almost ingrained in the American business model now. Outsource everything. Nobody wants to manage a room full of servers when they can spend 2-3x as much and outsource that headache along with the responsibility for it.
Same will happen with AI. Whether that means paying Anthropic that premium or paying AWS.
I'm in a relatively small business, we recently had an outage related to our local infrastructure.
I got pressure from the CEO saying it wasn't reliable to host our own infrastructure anymore even though our total internal down time over the last 5 years is significantly less than even a single of the larger recent AWS outages.
Everyone wants to shuck the chore and the responsibility.
If things change to token usage billing for everyone, maybe I'll be singing a different tune but on a subscription, I don't think it makes sense financially.
Fun? Yes. Financially sound? No.
What's interesting/exciting is that local models are _already_ quite good at tasks we never imagined AI _ever_ doing before ChatGPT hit the scene just a few short years ago.
We're also in an interesting point in time where companies are releasing the fruits of their research/labor (the LLMs) to the general public for free. For now, I think they see it in their best interest to gain mindshare and rapport, as well as advancing the state of the art in smaller LLMs ("a rising tide lifts all boats") but I fear and expect that these will dry up as the major players buy the minor players, and all will seek a return on their considerable investments in AI research.
And those are going to all be big enterprise companies that probably will set up LLM services entirely in-house, because they've got the headcount to utilize servers at 100%.
I wonder if there will be (or is currently) business in selling their compute while they're not working, to opposite time zones, etc.
What's left for the big providers will be the dregs of individual subscriptions and small businesses that at their least paranoid might let employees just use their own subscriptions for work.
The other question is how the middle ground (hetzner etc) is shaping up, because obviously so many orgs won't want to run servers.
so long as there's no algorithm breakthrough
LOL - some of us have a budget
If you're a professional that's confident in a positive return on the investment (optimal or not), or just a hobbyist with the luxury budget for a "shop" that cost is well within norms.
That's not everybody, of course, but it's not some inconceivable fantasy. A lot of people in the tech community here on HN, specifically, end up with pretty high discretionary budgets that they pour into stuff like this.
Besides those with effectively unlimited budgets for their personal compute, local models are still a long ways off.
Though, that shouldn't be conflated with the value of open-source models, which can be used by cloud providers to significantly reduce cost of intelligence.
In a way, it's absolutely amazing that we've went from "Playing 'Set a Timer' on Apple Music" intelligence to something that may pass the Turing Test, but in practical terms the small models are still far from what I'd call "good" for more than a tech demo.
To me, 7B models are just a fuzzy echo of Wikipedia. Gemma models at 4 bit are too clumsy to even reliably generate JSON for tool calls or copy a line of code to apply a patch.
Qwen needs so much detail and babysitting to stop it from doom looping or losing the plot, that the instructions that I need to give are usually longer than the code I end up keeping.
Is there some magic prompt that I don't know? Do other people just have a lot more patience, or way lower expectations?
qwen/gemma in the 27/35B range @fp8 are better than gemini-2.5, but less than gemini-3.1, you can run DS4-flash @fp8 on two DGX spark, and things keep becoming better. DiffusionGemma came out recently with 4x token gen speeds.
tl;dr - the models you appear to be trying with are too small or too quant'd
We aren’t wealthy enough to have the hardware that would make this good.
The people who have the money to buy a spare maxed out Mac mini just don’t get it. I see lots of folks with RTX 6000’s in threads like these. Or any RTX card that ends in “90”.
Cloud AI is what allows the proles to participate in the broader AI conversation, but not these AI conversations.
Why do you say 'may'? Just curious. Surely you've got something
The 27B is the smarter, more reliable one - but it is slower. The 35B is faster, still very smart but below 27B, a bit less reliable. The reason is the MoE - Mixture of Experts architecture, which only activates a subset of parameters, making the model much much faster.
I run the 27B on a MacBook Pro M5 Max + 40 GPU cores + 128GB RAM (well, on this beast I can have 27B + 35B in memory at the same time with headroom for all the other stuff). But because this is a laptop, it is not possible to run local LLMs all the time - it just gets too hot and too loud.
What excites me more: I run the 35B model on a MacMini M4 with 64GB RAM. It is fast, it gets a lot of work done (e.g. it scans, extracts and classifies my emails, it watches the mailbox all the time and does work). I also use it as my private Hermes assistant ("when is the next Starship launch?", "who is playing today at the World Cup? Give me some trivia").
Next step I am planning is a RTX Pro 6000 Blackwell workstation I can put in my basement. I want to run qwen really fast, with multiple threads / prompts / agents at once. And MAYBE if the budget allows, a 2x RTX Pro 6000 setup in order to run DeepSeek v4 flash on it (to run research on it).
But, for smaller more well-defined workflows, or as straight "edit this part to be like this exact" edits, they seem more than enough. Still waiting for them to become mature enough to be able to replace what we have as SOTA today, I'd say it's ready to be switched over then.
Speaking of local models, DiffusionGemma (and diffusion models in general) should not be slept on for local usage! Usually the problem locally is that the LLMs aren't efficiently making use of your hardware, unless you start batching requests and run many at the same time, but that require different approaches in general. Instead, diffusion models work much faster for individual prompts, and not by a small margin either.
Today I finally finished porting diffusiongemma-26B-A4B-it support from Transformers into Candle, and together with some optimizations I now have it basically flying with ~450 tok/s (~19 it/s) in Candle during inference, instead of ~180 tok/s (~11 it/s) from HF's Transformers library. Even using vLLM with similar sized LLMs, I don't think I've ever gotten past the ~250 tok/s threshold for single prompts, exciting stuff for local models :)
Diffusion models can't really be trained beyond low-to-mid size and have lower quality than an equally sized, plain one-token-at-a-time model.
It is quite astonishing to see how far local models have progressed, and I think that if you enjoy tinkering a bit, you can save a good bit of money (if you happen to have the hardware lying around anyways). Overall it’s still hard to beat the the cost/convenience combination of a cloud based model provider though.
[0] https://deepclause.substack.com/p/how-to-make-small-models-p...
Currently maxing out two Claude code accounts every x hours when working on large code migrations or setting up new iOS apps etc - most of time it’s fine but occasionally it’s mega frustrating!
With 96GB I'd start with the Gemma 4 and Qwen 3.6 models. Any of those should work fine.
Most other trades need to invest significantly in tools. If you want good tooling, you really want 64GB of GPU memory (e.g. 2x 5090) and 96GB of RAM. If I'm paying $200k for an expert engineer then $50k every other year for tooling seems pretty reasonable.
I posted this yesterday https://github.com/day50-dev/petsitter
I use it with https://github.com/day50-dev/simple-llm-cli
And modify the "tricks" until my evals get to good numbers. It's a model by model basis.
This is what the larger firms are doing - they have custom prompts per model
You can be far more ambiguous with your tasks with the larger proprietary models as opposed to the local models. You can achieve the similar results with local models but you need to be much more detailed in your prompt.
One of the biggest things about running these local models is that the harness matters almost just as much as the model too. Codex is optimized for GPT models, CC is optimized for Claude, Cursor has a great harness that works very well across these providers. It took me a couple of iterations of the different harnesses to find one that would work well with the smaller Qwen models to do local coding.
If you're resourceful, you can even run SOTA models. KimiK2.7, MiMo-V2.5/V2.5-Pro, MiniMax2.5/2.7/3, DeepSeekV3.1/v3.2/V4-Flash/V4Pro, GLM5.1, Step3.7-Flash, Qwen3.5-397B, Qwen3.5-122B, gpt-oss-120B
do you find Qwen3.5-122B to be SOTA-level? I moved from it to Qwen3.6-27B (both Q8), and I prefer 3.6-27B, and it leaves me room to spare for other small models
The cloud-based models are fine for big and complex tasks, but the pricing is ridiculous for small stuff—like summarizing a discussion or fixing a small bug. And cloud and privacy have never been a good match.
As an example, this comment itself was written with the help of Qwen3.5-4B running locally with an extension on top of llama.cpp default web UI [1]. The extension injects my browser's context directly into the conversation, which allows me to summarize things and draft up comments quickly. Speed is pretty acceptable for the size: ~5s TTFT and ~100 t/s generation, all running on a Macbook M5.
And when I want to run bigger tasks, I don't just stick to one provider. Apart from well-known closed-weight providers like OpenAI or Anthropic, I also experiment with open-weight models like GLM-5.1, DeepSeek V4, and Qwen3.6-27B, which provide quite good results for the price.
I'd argue both have value, and I don't see why anyone needs to choose one exclusively. Anyone else doing this?
These models are very capable, and use around 20-30GB of RAM while they are running.
Provided you have 64GB of RAM that leaves space for running other applications at the same time.
In theory, other countries should be able to replicate that effort and improve it.
Claude code supports this by setting the model to "opusplan"- it will automatically use Opus for planning and sonnet for implementation. This was completely necessary with the fable release. I was able to do this with fable and it was necessary to avoid getting quickly rate limited. In settings.json:
"env": { "ANTHROPIC_DEFAULT_OPUS_MODEL": "claude-fable-5" },
Obviously have that set to "claude-opus-4-8" now.
The most "affordable" option is red v2 with 64GB GPU ram and costs $12,000. This is only ("only") 1.5x-3x the price of a beefy desktop (https://pcpartpicker.com/builds/), and could crush inference work even on bigger models. It could support coding tasks for a small team of developers, or run an AI agent for every person in your household...
If you have $12K to spend, you may be better off with DGX Spark or a Mac with 128GB VRAM. That can (barely) fit DeepSeek V4 Flash.
> 64 GB RAM and 1TB storage
Ah ok, not something regular joe and jane happen to have lying around at home.
Additionally the whole configuration is still very much low level, bunch of CLI commands, and if the model doesn't fit for the task at hand, it starts allucinating, generating gibberish, whatever.
Is there any truth to this claim? LM Studio uses llama.cpp to run the models. I guess the overhead of LM Studio should be minimal.
After all LM Studio is a really easy way to host models, are there really major drawbacks?
> Because AI generates pixels based on visual patterns rather than mathematical geometry, it creates the illusion of a sewing pattern without any of the functional blueprints required to actually drape and construct a real garment.
If you want the illusion of a meaningful research question then sure, local models will give you that.
I've often wondered why the hype around apple neural core when 99% of software doesn't use them.
https://github.com/ml-explore/mlx-lm
Having used half the systems that Vicki mentioned, mlx was the best balance between power and ease of use. Just a pip install away.
Make some tests, and its 8 bit version runs at 30tok/s when using llama.cpp with MTP and run on Macbook Max M5. I have 128 GB, but but 64 GB is well enough. https://github.com/stared/benching-local-llms-on-apple-silic...
When using benchmarks, it gives more-or-less the level of SotA mid-late 2025.
I don't vibe-code, but I do decide what to implement and what patterns to use (perhaps asking the model to analyze and give advice on this first), then I have it handle the nitty-gritty of the implementation itself. For this usage style, the latest local models are as good as having Claude at home.
I won't say it's been _easy_ (I ended up implementing my own harness to accommodate the idiosyncrasies of local models), but I will say that for the effort, having a coding agent that's essentially free to query as much as I want has been life-changing as a dev, especially when it comes to working on side projects. Knowing that my agent will never get worse in quality, suddenly cost more than it does now, or be suddenly made unavailable by external factors, was absolutely worth the trouble. And on top of all that, I can't believe it's as good as it is.
The other thing that people tend to gloss over is that you really do need to spend some $$$ on decent hardware. Yeah, you CAN run some 4-bit quant with heavily quantized cache on your 16GB card, but it's not going to be a great experience (I think this is where a lot of the "if you think it's gonna be any good, you're going to be disappointed" stuff comes from). Yes it's a lot of $$$ upfront but it's very much unknown when hardware prices are going to come back to reality. There's a lot of hopes and dreams that any minute now an H100 will be worth pennies because "that's how it's always been" w.r.t. computer hardware, but we are living in interesting times. So you can't just make the tired old assumptions that a Claude subscription over three years time will work out to be dramatically less than the value of some card three years from now. We STILL have basically anything with >=24GB VRAM appreciating in value, which is absolutely wild. What I'm saying is, the depreciation curve may very well be a lot less dramatic and fast than it used to be, going forward.
Edit: Obviously you'll be using more tokens but this is the trade off for running a smaller model and running locally. Similar to time memory trade off but in token economics. Sorry I need more coffee
But for coding in a harness? In my experience it's unusable even for small projects. It just gets hard stuck at every little problem, wasting hundreds of thousands of tokens trying to make a convoluted solution work instead of doing the obvious thing. Or it will spend hours trying to reason through a fairly simple code flow, incrementally adding debug print statements, only to get confused by the output and then editing completely unrelated code that it convinced itself is the problem.
I've tried instead giving Sonnet the problem description and code and have it come up with a detailed plan that Qwen should implement, but doing that actually consumes a significant amount of tokens compared to just telling it to implement everything, and the results are honestly not that much better. There are just too often subtle issues with the plan that Qwen doesn't recognize when implementing, but make the resulting solution it comes up with unusable.
There’s a lot of things we could use even quite small models for, which would not need an insane amount of computing power and memory, but too few of us is really researching them.
Specs: qwen3_17b_base.Q6_K.gguf selora-v047-answer.f16.gguf selora-v047-automation.f16.gguf selora-v047-clarification.f16.gguf selora-v047-command.f16.gguf
The full base model and LoRA adapters are only 3.5GB
Capabilities include configuring for smart home setup to help with answers, clarifications, commands, and creating automations in Home Assistant. The models with the LoRA adapters were made with lean scripted data made specifically for Home Assistant. A lot of work was put into this, feel free to give it a try and happy for any feedback!
How does that work? The script in the post references the file "docker-compose.sandbox.yml", but I don't anything about what that file does.
The post that this one links to, that it's based on, says that Pi doesn't do proper sandboxing.
Presumably bash can still execute other binaries, otherwise it would be fairly useless. What stops it from executing Python? Or opening a network connection and downloading Python?
ROCm stack is not for people though who aren’t willing to dig in and patch things themselves.
I really wonder when companies will start hosting theire model for everday tasks on prem, cause its good enough (and realative cheap), instead of paying subscriptions for all devs.
In an ideal world, yea you can run local models, but I need a powerful always on device for that, or the latest gear, and it will never be as fast as what I can use from google, anthropic, or through an API call. I really wish it was different but I have to shell out a ton of money for that, and I guess it's usecase specific right. Maybe if my phone was super powerful and could run models that would be great, but then I have this issue with cloud sync and using things anywhere else. There will be a world in which local models and self deployed models make sense, this is going to be a core experience, but I personally can't run them.
The good news might be: opensource models are now good (enough) for day2day usage. But is it really? I feel that companies will always naturally strive for the best and use the SOTA (as long it is not too expensive).
I see OSS models being a good backbone for companies in the future that have validated workflows and could use those for privacy or to spare costs.
IDK, might have gone a little bit off-topic here.
Running the same prompt on both with the same .md memory state...
Gemini3.5 is more "intelligent" but Antigravity gets it to decide to go on tangents that are quite time and token-consuming I think. Nice casino machine.
Pi+Qwen3 (~80GB, llama.cpp) is like vibecoding about 1.5 years ago, when you had to babysit, structure your program to have self-contained chunks, and keep an eye on all the cross-cutting concerns to not trip it up. When it works it works fine and when it fails it's my job to ensure it fails fast.
The code is about 10'000 lines of Kotlin in total so it already takes some effort to keep it simple for the AI. It's not a slopped quantity of code, i got solid feature creep :^)
https://play.google.com/store/apps/details?id=com.sixteenam.... ...hat tip to the recent copycat squatter btw it's an honor!
1. Memory bandwidth
2. VRAM size, which limits the size of a model you can use effectively. Yes you can swap but then you're taking a performance hit;
3. Raw FLOPS, including quantization.
Apple here is interesting because they have a shared memory model and you can buy Macs currently with up to 128GB of RAM (previously 256/612GB on Mac Studios, both discontinued). New M5 Mac Studios are expected in Q3 but that's not guaranteed. It may take until next year
Depending on the chip, Macs top out at ~900GB/s. A 5090 or 6000 Pro has 1800GB/s. A B100 is at like 3.2TB/s. A 5090 has, depending on how you count, 5-7x the FLOPS of a M5 Pro so a 5090 is still better than any current Max... except for the 32GB limit.
NVidia aggressively segment the market by limiting VRAM. The RTX 6000 Pro is basically a 5090 with slightly more CUDA cores and 96GB of VRAM instead of 32GB for $10-11k instead of $3k.
So let's project this into the future a little. The M6 Ultra/Max may well be 1TB+/s memory bandwidth with much higher FLOPS and thus actually be competitive for larger models. A 6090 in the current market will probably still have 32GB of VRAM if I had to guess. Maybe it goes up to 48GB.
But anyway I think we're only 2-3 years away from sub-$5000 hardware that does 100-300+tok/s on models larger than 31B. And that's going to be a game changer.
It really is better than I would expect it to be. But it requires a special treatment. Since the model is smaller it needs a smaller and simpler tasks. I use smarter model to decompose the task into primitive subtasks, write good description, submit to worker with qwen3.6, review completion and create new task to fix if required (20% of cases). This workflow works fine.
What is true is that it gets easier and faster to run local models. With QAT (quantization aware training), turboquant (or similar) K/V compression; what used to be impossible to run is now fairly easy.
I can run gemma4:26b-a4b-qat on my laptop with 20-30 tokens/s with a 256k context window. That was unthinkable just 6 months ago.
So the local models are "OK" for small'ish projects.
But it does not at all(!) compare to the frontier models. For a large project Claude's Opus 4.6+ just work, whereas local gemma tangles itself up, makes weird mistakes, and just can't handle it (for those cases it is faster if I do it myself).
If the trends continues, with 1.58bit QAT models, even better K/V compression, faster multi-token prediction et al, maybe soon it will be comparable.
To be fair, I think the labs are also interested in this (e.g OpenAI parameter golf). But the incentives are tricky. When the subsidies and tokenmaxxing era ends, local models will be essential.
Most of those models are also available via Openrouter and many other platforms. Dirt cheap, and much faster than on consumer GPUs. Perfect to try and compare the different options.
However, like many commenters, I don't really believe in vibe-coding, long-horizon agentic one-shot agentic coding, etc. and do not use LLMs for huge generation tasks that involve designing things end-to-end.
I also have an MBP with 128 GB of unified memory and do quite a bit of Qwen3.6-35B-A3B. No, it's not as smart as the aforementioned models, to say nothing of frontier, but many people seem pleasantly shocked by the number of banal tasks that do not require these.
Plenty fast for coding work and for sharing with my OpenClaw setup.
Currently in the process of adding another external GPU (RTX 4090 with pipeline parallelism) via thunderbolt 5 to the Olares One box, for higher quantization, possibly 8-bit, larger context, better concurrency, more kv cache.
The good old butt dyno!
I’ve been eyeing local models more and more with Anthropic squeezing more and more on the subscriptions. A few comments on HN had me waiting until they improved more but this article makes me wonder if I should reconsider that.
I’ve been doing some pretty niche development using a game and a script extender for said game. If these models can handle that, I’d feel good about switching.
Having a local Qwen check another Qwen's work increases the accuracy quite a bit at the cost of more latency. You can't have your cake and eat it too.
In benchmarking local models, I'm having success increasing even a 9B qwen's score on terminal-bench adjacent problems, just by asking it to plan and handing the plan back to qwen with a fresh context. Try it with Qwen3.5, unsloth Q4+, and a thinking budget of around 1024 tokens.
Gemma 4 and Qwen3.6 27B aren't perfect, yet they are such a step forward from the previous generation that it's both feasible to get stuff done locally with patience and very likely that future releases will subvert cloud capabilities entirely.
Plus, they have definite reliability advantages over cloud models that can be wiped out by a government order or lobotomized to handle traffic surges.
One caveat, I have absolutely no patience for a lot of subagent systems, like opencode, where the subagent is walled off and incommunicatable. My subagents really should be their own session, that i can deal with as I please, with some MessageChannel like offerings/tools available to them. Ideally with modes where messages auto-flow in and out, and modes where I can be a gate-monitor. https://developer.mozilla.org/en-US/docs/Web/API/MessageChan...
Not really super related but MCP has been working on Events for a while. That ability to respond fast would be great. https://github.com/modelcontextprotocol/experimental-ext-tri...
Asking local to be fast feels like an obvious folly, but given how much better small models have got, and seeing these models tune themselves for speed: I want to hope!
> “Our goal is to deliver unmetered intelligence to every home and every desk with Windows,” said Satya Nadella, chairman and CEO of Microsoft. “RTX Spark marks a real breakthrough towards that vision.”
Makes me optimistic that those two companies are going to keep investing in quality local models.
I have been considering getting the 58gb Mac Mini but that is a decent amount of money to spend without confirmation on a) how fast is it and b) will it work for well-defined tasks.
I’d love it if model providers just let old models run and let us pay less, but the deprecation makes me want to look into local models.
Running locally is the bar; it's hard to make these things a service which scales.
I'd assume a Mac with 32-64GB memory would get some reasonable results.
After the recent changes to usage, I've spent an annoyingly long number of hours trying to get this to work.
Does it really needs a GPU at 300Watts to do all that tasks?
I wouldn't rely on it for large stuff like codex though. I haven't tried out deepseek/kimi, if we could run those locally it would be great.
I closed the article after that.
The author has no idea what a privilege it is to have a machine like that for personal use, and how 99% of the population are not going to afford a setup like that.
Just some back-of-the-envelope maths will tell you that a $20/month Claude subscription makes much more sense financially.
Larger models just do more complex reasoning. But if you want them to be really good, you need a beefy Mac. They have the best combination of memory bandwidth and RAM to allow medium-sized models to run at speed. GPUs have less memory but more bandwidth, and AMD iGPUs have more memory but less bandwidth. The Mac is the best compromise on the market today.
Once you do have a beefy Mac, you want to run a dense model. This gives you the best possible result with the system you have. You can go MoE for faster results, use cutting-edge inference techniques, parameter tweaks, etc. But a basic dense model (at Q6 quant) on a big-ass mac will serve 90% of your coding needs.
You do need to use sutable hardware.
I get 50tok/s from Qwen 3.6 27b with Q8 & MTP (I can get more aggregate tok/s in parallel rather than using MOE, but don't have enough memory for too many full sized contexts) and 100 tok/s with 35B-A3b Q8 (no MTP as it's not that useful with MOE) on a single workstation gpu that I spent 3k on a couple years ago.
These speeds are somewhat faster than what I've seen from commercial SOTA models, they're plenty fast for many applications.
His program uses quantization, but is very optimised and has builds that can fit into 96GB of memory with great results.
DS4 Flash is usually my go-to for a lot of things these days, and I don't have to worry about a cloud model stopping or telling me it's concerned about my usage.
So i've tested a couple, and the speed is finally impressive. My colleague uses paid tiers of claude and GPT, and the speed is comparable. Maybe even slightly faster on my end.
The problem is: i'm running the model on my work laptop, a 12th gen i5 with 16GB of RAM (which, you know, i asked to upgrade to 64, but that was right at the time of the great RAM shortage of the '20s) so i'm pretty limited in what i can use. And this is running alongside the usual suspects: Web browser hugging 1.5GB, MPLABX hugging 3, windows taking at least 5 just to sit idle, thermal throttled to 1GHz ... And yet its speed is comparable to a paid service. A lunch's worth of tokens vs a few cents of power.
So, what i found, what i fount... What i found is that i need AT LEAST 16k of context window, otherwise they will halt when i pass a small C file for analysis. And coding models will shit the bed with 4k. But we all know that, context size is King.
I found out that Qwen will keep looping while thinking, but that's not a surprise to you, either. But give it enough time and you will get an useful answer. I was hoping to using it as a better warning system for some languages, but i fear i need muuuch more context size, because i tried to feed a file that had a function with an endless loop:
At 4k context it almost shit the bed if i gave it just the offending function, then told it where to look at. At 16k context, with the whole file, it needed some guidance to what the problem was, and after 10-15 minutes of thinking it found the issue. Problem is, it kept second guessing itself for another 20 minutes on the same unrelated thing before giving the output. For which the fix was wrong, but the semanthic was correct. Good enough. Maybe it will be faster if i don't ask for a fix (which i didn't i just asked to look for a specific issue)
Wish i had 3 times the RAM so i can see what happens with more context.
Then i gave it the task to analyze a C file to make an API document. It took half an hour, but then i had a good starting point, which i had to keep changing because it would confuse commands with IDs and things like that.
This was the Qwen 3.5 9B model.
I then tested Gemma 4, being impressed at the tokens per second it gives on my Pixel 8A. Same tasks: same issues with short context, with long context it gave absolutely useless answers when looking at code, but it took 1/3 the time of qwen.
In producing documentation, instead, it was much faster, and it never hallucinated data. Good. in 15 minutes i had everything done.
Not bad for stuff running on a business laptop, while doing actual work.
Tomorrow i will try Qwen 3.6, let's see how it goes..