Running local models is good now (opens in new tab)

(vickiboykis.com)

1551 pointsjfb7d ago595 comments

595 comments

231 comments · 119 top-level

c0rruptbytes7d ago· 32 in thread

I don't know about good, I use a lot of local models and they're still pretty painful to run locally

You have dense models (qwen 27b, gemma 31b) who are pretty smart, but pretty slow

You have MoE models (gemma 26b, qwen 35b, north mini code 30b) who are pretty fast, but make a lot of mistakes

You need a lot of memory to run these well, quantization makes tool calling weaker, so most run at 4 bit quants and are wondering why it kinda sucks and that's because you've essentially lobotomized the model (I recommend unsloth quants, i recommend 6bit for MoEs and 5bit for dense)

So you need a lot of compute to make the pre-fill fast, you need bandwidth to make the decode fast, you need a lot of memory to hold everything - lot of ifs

On top of that, your laptop becomes a loud hot churning machine, it's uncomfortable to work with.

So are they good? not really. Do they work? yes

edit: just wanna clarify - i think open models are the future, i think they're super important, i'm contributing constantly to the ecosystem - i think people should play around with these models, i think people should use `pi` and learn how it all works - but don't download a model expecting it to be good out of the box, you will have to tune and configure a lot of stuff to replace a "coding agent" that most people are using models for

saghm7d ago

This is basically my experience as well. I have a moderately recent but high spec desktop (Radeon 6900 XT with 16 GB VRAM, Ryzen 9 7900X 12-core, 64 GB system RAM), and I tried out some recommended models with ollama a month or two ago. Anything not geared specifically towards coding seemed to struggled with actually making tool calls instead of just stating the actions they would take without making them (and trying to get help from them to explain what I needed to configure to change that behavior was useless; qwen refused to believe that it was running in ollama and insisted that it was running from the Alibaba cloud without access to my local system), and the models intended for coding were barely thinking faster than I could type (if they had any ability to show thinking at all).

The best "free" experience I've found is using OpenCode with Big Pickle. It's not especially smart, so it often won't produce the correct result the first time, but the free tier is generous enough that I don't think I've hit the limit more than twice over around a month with frequent multi-hour sessions. If running locally is truly the goal, it's not going to fit the bill, but if the goal is just "get the best experience without having to pay for a sub or tokens", it's the least bad option I've found so far.

8 more replies

aftbit7d ago

IMO running local models "well" still requires an expensive hardware investment. You really want 96GB of VRAM on a modern Blackwell arch to run these models with decent KV cache. Trying to run them on a unified memory Mac, an AI Max AMD processor, or a DGX Spark-alike is really just asking for trouble. Prefill kills perf.

If you throw the right GPUs at the problem, they become much better - but still not quite in the realm of Sonnet or DeepSeek 4 Flash, let alone Opus / DeepSeek Pro or Mythos/Fable/GPT-5.5.

Given enough budget, power, and cooling, you can run some pretty good data pipelines, but for code, I think it still makes sense to shell out to an API provider most of the time.

10 more replies

zozbot2347d ago

Maybe we shouldn't be running these models on laptops with their thermally constrained form factor, and we shouldn't expect quick inference on a par with a large cloud-based platform either, at least not for near-SOTA model quality. It's still worth it to avoid becoming massively reliant on centralized services.

4 more replies

adam_arthur7d ago

Gemma 4 is particularly good at pipeline/automation tasks.

It outperforms all the Qwen models (even 100B+) for rule following/automation style tasks in my experience. Its image interpretation is also very good, and out-benchmarks Opus.

Qwen seems to ignore instructions and consistently outputs incorrect formats (when token generation format is not explicitly constrained)

But yes, on the DGX Spark Gemma 31B Q4 with MTP runs around 20 tok/s and Gemma 26B A4B around 60 tok/s. Still quite slow. But on a high end Nvidia card would run significantly faster and still fit in memory.

I'd recommend for anyone getting into local models to focus on memory bandwidth over RAM. Models under 100B parameters are now sufficient and hugely useful for automation.

I agree that for coding/creation use cases, there's still not a compelling argument for local models.

But e.g. if you want to scan a list of stocks and interpret news/high pass filtering, interpreting logs, interpreting screenshots, the local models are more than sufficient already.

5 more replies

freehorse7d ago

> You have MoE models (gemma 26b, qwen 35b, north mini code 30b) who are pretty fast, but make a lot of mistakes

This is sadly also my experience. I wish we had some MoE models with a higher ratio of active parameters per total. My experience is that the newer MoE models that can run in a 64b laptop have too few active parameters to be useful outside narrower, specific tasks. Mixtral 8x7b was a 14b active parameter (56b total) MoE model a few years ago and was probably the best model one could run in that range for some time, but it is too old now.

I have been using the qwen 27b and it is great, but running a dense model like this in a macbook is a bit suboptimal, and i wish I could run sth faster than 15 tok/s.

1 more reply

robomartin7d ago

> On top of that, your laptop becomes a loud hot churning machine, it's uncomfortable to work with.

Laptop?

OK, I've made that mistake before. I understand modern laptops are powerful, but nobody wanting to do serious AI/ML work should be using a laptop for anything other than SSH or similar low-performance access into a proper system.

Years ago I fried two laptops just doing finite element analysis work running 18+ hours per day. It was one of those "I'm giving you all she's got, Captain!" workloads. They fried, even with powerful fans cooling them. I should have known better. Such workloads belong on purpose built systems.

locknitpicker6d ago

> I don't know about good, I use a lot of local models and they're still pretty painful to run locally

You are somehow assuming cloud-based models are not painful.

I can tell you my past experience. I was using GPT 5.5 and Claude Opus interchangeably and I prompted them to implement a feature. I paid attention to the agent window and it was literally screwing up implementations, causing tests to fail, and going into test-fail-fix loops to clean up after itself. After a few minutes, it finally called it done. That run cost $0.60.

I went to review the code and only half of the source files complied with the instruction files. I prompted the model to clarify why it failed to comply with the instruction file. The model outputs "you are right, I should have complied with the instruction files. That prompt cost $0.30.

I prompted the model to proceed and apply the instruction file prompts. It went ahead and applied changes. Success. It cost $0.16.

I reviewed the code again. Only half of the sloppy code was touched up. I prompted it to fix the whole mess, not just a couple of files. It complied. One coin less in my purse.

So, around a third of the cost of a feature is spent on the model cleaning the mess it left in it's wake.

And this was a tiny feature with a plan, a solid set of instruction files.

Very expensive.

Are costs going down? I doubt so. OpenAI seems to still be spending 3 times it's revenue already.

In comparison, local models sound very good.

chrsw6d ago

The very understandable desire to not have to rely on huge, centralized companies or powers for tokens has clouded people's judgment on how well these local models actually perform. They've improving, which is great, but for real work I use the best models available right now because they're so much better than local models.

hnlmorg7d ago

To be honest even the cloud models are a hot mess at times. This week I’ve spent more time rejected code from OpenAI models than I have approving it.

In fact it really feels like OpenAI models have taken a nose dive this week compared with Claude. At least for my specific workloads (these things are so variable it’s like trying to compare Google results…)

segmondy6d ago

I run 27B at Q8 with fp16 KV cache at 50tk/sec on 2 3090s. Not 4090, Not 5090. 6 years old GPUs.

Stagnant7d ago

I've been using unsloth/gemma-4-31B-it-qat-GGUF daily for various small parsing and programming tasks using opencode and llama-server's front end. The past couple of weeks have made a big difference after google released the QAT variant and llama.cpp got support for MTP which means it is possible to now get 60-80 Tok/s with RTX 4090. The model fits in VRAM comfortably enough to keep it loaded even while browsing and having multiple programs.

2 more replies

xlii6d ago

> I use a lot of local models and they're still pretty painful to run locally.

This really depends on how and what you're using. e.g. I can't suffer through slowness of inference on Macbook but I have gaming rig with quite powerful GPU and I squeeze ~130 t/s on Gemma or ~70t/s on Qwen.

Tuning is not optional as well. Qwen on temperatures > 0.5 is unusable for coding and I found sweet spot around 0.32 for coding. Speculative decoding on Gemma4 26B is a 30t/s difference between non-speculative.

The worst thing with local models is that I can't just give you a recipe, because what's the best params depends on your use case.

In the nutshell I'd compare local models to running game rig on Windows vs Linux. Linux works great if not better than Windows gaming, but you need to embrace some tweaking in order to get there. Is it there? It's not SOTA, that's for sure, but it's working reasonably well.

atomicnumber37d ago

I largely don't disagree with you but come to a different conclusion. I have two systems:

1) a "programming desktop" with a $500 upper mid range Ryzen (idr exact), 8GB VRAM Radeon card I bought solely for RuneScape, and 64GB ram

2) a maxed out Alienware 16 Area51, so it's a 5090 with 24GB vram and 64GB system ram. I bought it for gaming, of course.

I run qwen 3.6 35B A3B Q6 with 200k context window. I compare this to Claude pro max or whatever that I use at work.

The main difference between the machines is that the one with the RuneScape gpu does 10 TPS while the Alienware does 30-40tps. Both are fine though the 30-40tps is obviously a lot snappier.

I find with both models that:

- they do really well at "be a 30GB zip file of reddit and stackoverflow answers"

- they do really well at point fixing random bullshit errors that would otherwise waste my time (this is related to above of course)

- they do quite well at, given a pretty good specification of what you want, figuring it out, even if you've specified several steps needed

- they both cannot really be given a large ish task and left to just drive it on their own

The main difference between the two is with that last one, Claude is somewhat better and figuring SOMETHING out, but if Claude is having to figure it out, it's probably because I don't know what I want and it's very likely to not make a sane choice, and will generally produce slop given even the slightest amount of leash still.

I've also found that the boundary between "well specified small to medium thing" and "idk just do thing and figure it out" is the difference between you keeping control of the code and losing control. There's an "escape velocity" of AI use that, when you hit it, you're doomed to slop forever. (Or you have to deorbit... enjoy that). And while claude might have slightly higher velocity allowed while remaining suborbital, it's very diminishing returns.

So, are these models "worse" than Claude? Yeah. Am I looking forward to continued improvements? Yeah. But I now also have no desire to pay anthropic any amount of money, which has the nice side effect that i won't be helping them end up with so much money that they can distort our democracy.

andy_ppp7d ago

I wonder if it is better to have a machine somewhere running a model for you maybe shared with a few others. I could probably justify a M6 Mac Studio with hopefully 256gb RAM and have a few people all with access to one agreed upon model. I think maybe laptops are too warm and clunky for this.

1 more reply

iwontberude7d ago

They are good if you were clever enough to buy a powerful enough rig before memory went up. For everyone else I say just wait. M1 Ultra 128GB and higher is sufficient to run gemma4:31b-mlx or qwen3.6:35b-mlx with subagents. It’s only slow if you don’t know how to plan your work effectively.

heipei7d ago

Depends on what you mean by "local". On your Macbook, large dense models like Qwen 3.6 27B will be slow, sure. On a local workstation with a dedicated RTX card you can get > 100 tps, which is more than good enough to work with it, and faster than cloud models in many cases.

2 more replies

devilsdata7d ago

Just to piggyback onto this comment; has anyone tried running multiple of these in conjunction? For example, having a Python script that has one of these orchestrate others, and offloads certain tasks to better/more powerful models, or even cloud models?

1 more reply

naikrovek5d ago

> You have dense models (qwen 27b, gemma 31b) who are pretty smart, but pretty slow

slowness doesn't matter a lot to me, at home. I will type up a prompt and submit it and let it run while I do other things around the house. I have all kinds of things to do, and most of them do not require sitting in front of a computer.

of course faster would be better, but it's not always a requirement. smart and slow is far better than dumb and fast or even nothing at all.

EnPissant6d ago

When running on a GPU, dense models are shaping up to be the best way due to two things:

- Maximum intelligence per VRAM (you dont have much)

- Dense models can benefit from MTP to get an almost 2x speedup in decode (ie, a 27b dense model with mtp decodes at about the same speed as a MoE model with 14b active param model would). This is important because local llm rarely has parallel streams to batch together.

When running on large unified memory like Strix Halo or Spark Dgx, MoE models are usually best:

- You can get similar intelligence as a smaller dense model with fewer active params (to compensate for the slower memory) by throwing ram at the problem.

2 more replies

not_kurt_godel7d ago

I had some local model FOMO, trialed for a few days, and tentatively arrived at the same conclusion. I can get a better ROI on the time I spent waiting and dealing with poor quality by just programming by hand myself instead.

FuriouslyAdrift7d ago

Kimi 2.6 or 2.8 is what we are playing with locally. They need 512GB to 1TB to run with full capabilities so that's not exactly "desktop"

Our GPU computer server cost $110k.

1 more reply

smcleod7d ago

Those dense models are pretty fast with MTP now. 40-70TK/s depending on your machine, that's faster than cloud models (although not as smart obviously).

NamlchakKhandro6d ago

Pi mono is king. Everything else is hypetrash.

If I can't customise it then I won't waste my time using it it getting use to it.

Claude code is trash, it's customisability is extremely shallow, open code, codex, copilot, Kiro, etc etc... all trash. Yes even open code..

If open code was so awesome then open claw would have been based on it... But it wasn't. That's should tell you everything you need to know.

greenavocado7d ago

4 bit unsloth quants are good if you never ask for more than 20k context, use it as autocomplete on steroids, and never delegate serious questions to it

beadw7d ago

I think you’re spot on. In my experience people confuse a models ability to solve some benchmark as a sign of its usefulness. Token throughput is often just as important from my personal usage. I am excited for more diffusion models to see how progress happens there.

1 more reply

ridiculous_leke7d ago

A median laptop is no bueno for running a reliable model(which will be qwen 27b as per my reading here and r/localllama). Powerful macs would be prevalent in certain areas of the world but in rest of the world personal machines aren't always that powerful.

dominotw7d ago

maybe painful if you are using it like a chatbot. you are sitting there waiting for response. vs ambient ai like automatically classifying your family pics and discarding random things like parking floor number pic.

i use it usecases like that latter and they are fine.

markdog126d ago

100% agree. I've spent many hours testing out local models/harnesses. So far, they're very much not worth the tradeoff. Obviously, I hope that changes.

everdrive7d ago

What counts as a lot of memory? What could someone do with 16 GB of RAM?

6 more replies

onel6d ago

Agree with this. Open models are the future but currently they are a pain to run locally.

As painful as it is to admit, the future might be cloud inference from a trusted provider.

adam_patarino6d ago

I always find it amusing when people would rather spend $200 / mo than let their laptop fan turn on.

citizenpaul7d ago

They are still terrible at tool usage which loses 99% of the effectiveness of the agent. I've had to concede and use paid frontier models that can use tools or its not worth using agents....copy...paste....copy....paste....

1 more reply

hypfer7d ago· 21 in thread

After having been a happy user of Qwen3.6-27B for a few weeks, due to being away from the hardware, I'm currently forced to use Claude Sonnet 4.6

It is such a downgrade. I don't understand how that's even possible. The thing has so many strongly-held opinions I did not ever ask it for, talking just way too much and generally feeling somehow dumber.

Of course, being significantly larger, it will encode more knowledge, but that doesn't help me when I hate talking to it. And all that on top of the fact that talking with it costs real money.

I wonder what it might be that makes me hate it so much. Maybe because it doesn't see itself as a tool but almost an equal? As if its opinions would have weight.

Qwen too can act like an overeager intern, but if you tell it that it is an idiot, it will drop that ego. Not so much with Claude. In my experience, anyway.

Anyway, point is: full ack on that headline.

ggerganov7d ago

I haven't spent a dime on cloud inference, so cannot make a direct comparison like you. But I can 100% attest to the fact that Qwen3.6-27B is a very capable local model for coding tasks. Over the last month and a half I've been using it almost daily, either on my M2 Ultra or on my RTX 5090 box. I use it for small mundane tasks at ggml-org [0] - nothing really impressive, but definitely a helpful tool for a maintainer. I think I would be using it much more, if I didn't have to spend a lot of my time on reviewing PRs. Currently, I have a very lightweight harness - the pi agent with everything stripped (`pi -nc --offline`) and a short system prompt [1] to align it a bit with my style. About the generation speed: ~100-150 t/s on the RTX 5090 and ~40 t/s on the Mac. I definitely prefer running it on the RTX machine - it's so much faster. But for the sake of testing and getting wider experience with local configurations, I often run it on the Mac too.

[0] - https://github.com/search?q=%22Assisted-by%22+user%3Aggml-or...

[1] - https://github.com/ggml-org/llama.cpp/blob/master/.pi/gg/SYS...

7 more replies

StevenWaterman7d ago

Yep, I daily drive Qwen3.6-27B (including for work), have done pretty much since it came out. IMO it's the only (small-ish, local) model worth using, if you can run it. It might not be as good as Opus at "add X large feature" but I don't want that in a model. I want to do the thinking while it does the typing. And Qwen 3.6 27B is perfectly good at that (while in my experience models like the 35A3B and gemma are significant downgrades)

Plus, I never have to worry about rate limits, quotas, or sitting in a queue during peak time. And I can always see its full thoughts, don't have to worry about where my data is getting sent, and know it can't get secretly nerfed.

Running on 2x 3090, 500-1000tok/s prefill and 60tok/s output at Q6_K_XL with MTP on llama.cpp, 220k tokens context window (starts to get a bit dumb above 160k ish), no KV quantization

6 more replies

epistasis7d ago

> talking just way too much

OMG this is such an annoying property, just shut the hell up please, and be concise.

I suspect that this is an artifact of the thinking property, but please just summarize the thinking process far more concisely, where a single sentence answer is more than sufficient the frontier models seem devoted to going on to a minimum of 5 paragraphs and offering 3-5 new directions.

And requests to please only offer a single step at once, or single option at once, or to even stop eagerly offering future directions is really hard to prompt correctly.

And look, there I did exactly what I was complaining about...

3 more replies

derethanhausen7d ago

I would not generalize based on experiences with Sonnet. The flagship models (Opus being the claude equivalent) are dramatically better.

1 more reply

kitd7d ago

Funny that coding agents have personalities, including "that colleague" you want to avoid even if you know they're probably quite good at what they do!

1 more reply

radium3d7d ago

If you think about it, they're splitting the power across millions of users. Essentially, these AI companies have YOUR hardware that YOU are paying (them) for in a cabinet at some data center. This means the hardware could easily be run locally for inference for these 'big' models. It's just a problem of dynamics-- RAM is being bought in bulk by these companies through these B200 style cards, instead of sold slowly through the open public markets.

This is likely due to a combination of mass funding for the AI companies, but also they are trying to governmentally restrict which countries get access to these cards so certain countries get a head start. The only way to lock that down is to have them literally locked in their own GPU prisons (data centers). Third reason is it does make it possible to train the models faster by having them in the same data center connected directly. Having them distributed to everyone would slow down training considerably.

The current way to 'own' decent RAM and GPUs right now is through the stock market it seems.

giancarlostoro7d ago

There's a model on Huggingface where someone takes Qwen and makes it think Opus style, and that one seems to be decent, not sure if they have the 27B variant in that style. I do wonder if you can tweak your system prompt to force Qwen to behave better?

2 more replies

MostlyStable7d ago

Curious if you have tried custom instructions. I was never quite as unhappy with Claude's voice as you appear to be, but there were several things I didn't like. A custom prompt fixed almost all of them.

1 more reply

bjackman6d ago

Re being away from the HW: with Tailscale and llama-server it's now super easy to just run an inference server at home and use it from wherever you are.

1 more reply

linuxhansl6d ago

How qwen3.6:27b compare to qwen3.6:35b-a3b (MoE) in your experience (if you tried). I find the dense models are way too slow on my H/W.

2 more replies

andix7d ago

Sonnet is extremely overpriced. It's a good model, but not worth the money Anthropic charges for it.

dyauspitr7d ago

Why would I want some half assed coding assist tool. I want something that takes in a requirement and spits out a finished product. It’s not your equal, it’s better than you.

Shorel5d ago

I use tailscale to have remote access to my local models when on the move.

dackdel7d ago

what kind of hardware do you need in order to run qwen3.6-27b

4 more replies

indoordin0saur7d ago

Very curious what hardware you're running this on!

1 more reply

cmrdporcupine7d ago

The Anthropic models have always been annoying this way -- chatty/opinionated and Dunning-Krugerish. And love to run away and do things unprompted with me jamming my ESC ESC ESC key over and over so I can get a word in edgewise.

FWIW Codex/GPT models are way less this way. Maybe to a fault.

I'm setting up my DGX Spark to try Qwen 3.6 27B again, as I'm hearing a lot of good reviews. When I tried it some time ago it was still early for support in llama.cpp.

zerd7d ago

I noticed Fable was quite a bit terser, and I think it's due to changes in the system prompt [0]. They're literally saying "just give me the TLDR" and "give brief updates". You can tweak a lot of that with an AGENTS.md.

[0] https://twelvetables.blog/comparing-claude-fable-5s-system-p...

mik096d ago

try qwen 3.7 or glm 5.2 or one of the larger gemma models

chrisweekly7d ago

Why Sonnet 4.6 not Opus?

ltononro7d ago

Well but comparing with sonnet 4.6 instead of opus 4.6,.7 or .8 doesnt make a real point I mean, pay 200 USD/month (if you have that cash, or your company has it), might not justify using local at all (unless you have some reason to suspect about data leakage)

calebm7d ago

sync/ack

rmunn7d ago· 15 in thread

This is the kind of thing that Anthropic et al should be worried about. As it becomes easier and easier to run local models, the ceiling of what they'll be able to charge will get lower and lower. Not that nobody will be willing to pay $$$$$ per month, but a lot of people are going to multiply the per-month charge by 12 or 24 and say "Could I set up a local model for less than that, and have it pay for itself within a year or two?" And if a significant portion of customers decide to buy instead of rent, the companies whose business model is entirely centered around renting will suddenly find themselves hurting for customers.

sathackr7d ago

The opposite of that has been happening for 20 years now with cloud compute.

It won't happen with AI models either.

It's almost ingrained in the American business model now. Outsource everything. Nobody wants to manage a room full of servers when they can spend 2-3x as much and outsource that headache along with the responsibility for it.

Same will happen with AI. Whether that means paying Anthropic that premium or paying AWS.

I'm in a relatively small business, we recently had an outage related to our local infrastructure.

I got pressure from the CEO saying it wasn't reliable to host our own infrastructure anymore even though our total internal down time over the last 5 years is significantly less than even a single of the larger recent AWS outages.

Everyone wants to shuck the chore and the responsibility.

14 more replies

indoordin0saur7d ago

I'm curious when coding-heavy companies will start running their own on-prem AI clusters. Has anyone had the idea to sell something like 4 GPU machine an engineering team could throw in a closet somewhere and run whatever they want on it? I imagine this won't appeal to everybody but with the trust issues the hyperscalers have developed hoovering up people's data and using it to train their models, I imagine some will find value in a machine and model they have transparent control over including the option to walk over and unplug the thing.

1 more reply

starshadowx27d ago

Earlier I was thinking it's maybe comparable to paying for Netflix vs torrenting and running Plex or something. For the majority of normal, mainstream users I feel like most would just pay for the thing that is already setup and ready for them. There'll still be all the more techy or determined types who will do it themselves, I just wonder what the percentages of both groups will be.

2 more replies

storus7d ago

They are working hard on you not being able to run a thing locally. OpenAI buys all RAM on the spot market, causing the rise of RAM/VRAM prices 6x, making GPUs and decent computers unreachable for the majority of the population. OK, some richer folks might be able to get a 512GB MacStudio or a single RTX Pro 6000 for 13k and be able to run some decent local models, but the vast majority will need to use API. And at some point Nvidia might say: "We don't sell that many 6000s, so let's just cancel them altogether as we can gain 4x profit on datacenter-only GPUs" and then they'll become unobtainium and no private person would ever be able to run anything decent (~1 year behind the frontier) locally.

1 more reply

wuliwong7d ago

These local models can do some of the work the non-frontier models can do but for me, that's not worth much. If I am just using Sonnet 4.6, I can pretty much work all day on the $20/month plan. And Sonnet is still a way more powerful model than a one you could self host on an M2 mac.

If things change to token usage billing for everyone, maybe I'll be singing a different tune but on a subscription, I don't think it makes sense financially.

Fun? Yes. Financially sound? No.

1 more reply

bityard7d ago

The general consensus is that local models will continue to improve drastically, but hosted models will as well. There will _always_ be a pretty big gulf of capability between what you can do with a desk full of hardware at home vs a few racks of hardware in a datacenter. That seems to be the real "moat" of hosted models at this point in time: access to capital.

What's interesting/exciting is that local models are _already_ quite good at tasks we never imagined AI _ever_ doing before ChatGPT hit the scene just a few short years ago.

We're also in an interesting point in time where companies are releasing the fruits of their research/labor (the LLMs) to the general public for free. For now, I think they see it in their best interest to gain mindshare and rapport, as well as advancing the state of the art in smaller LLMs ("a rising tide lifts all boats") but I fear and expect that these will dry up as the major players buy the minor players, and all will seek a return on their considerable investments in AI research.

5 more replies

xdertz6d ago

AI usage is very spiky and good models require very expensive hardware. Running locally would just result in it sitting idle ~90% of the time. I think renting will always be cheaper, for comparable performance at least.

pessimizer7d ago

> but a lot of people are going to multiply the per-month charge by 12 or 24 and say "Could I set up a local model for less than that, and have it pay for itself within a year or two?" And if a significant portion of customers decide to buy instead of rent, the companies whose business model is entirely centered around renting will suddenly find themselves hurting for customers.

And those are going to all be big enterprise companies that probably will set up LLM services entirely in-house, because they've got the headcount to utilize servers at 100%.

I wonder if there will be (or is currently) business in selling their compute while they're not working, to opposite time zones, etc.

What's left for the big providers will be the dregs of individual subscriptions and small businesses that at their least paranoid might let employees just use their own subscriptions for work.

icoder7d ago

What I don't understand is that on one hand we read 'what they charge is much less than it costs them' and on the other hand this thread seems to suggest that 'what they charge is more than it would cost me'.

4 more replies

spopejoy6d ago

I know coding is the killer app thus far, but if businesses are seeing any kind of significant cost for other LLM usecases, seems like at least a decent consultancy opportunity to set up medium-sized businesses with in-house kit.

The other question is how the middle ground (hetzner etc) is shaping up, because obviously so many orgs won't want to run servers.

sbmthakur7d ago

Someone was able to run gemma-4-26B-A4B on an i5-8500 with 32 gb ram with NO GPU. Granted this is an extreme example these MoE models are value for money for a lot of use cases.

https://www.reddit.com/r/LocalLLaMA/s/YontVNVRbL

themaninthedark7d ago

Maybe that is why they are buying up as much hardware as they can? If their service is the only game in town.

1 more reply

frollogaston7d ago

Anthropic isn't just renting out compute, they're renting out a closed model that's better than anything you can download for free. So they're rightfully focused on preventing others from distilling their model.

1 more reply

ActorNightly7d ago

Local models will never achieve "real" performance (i.e actual usage, not benchmarks) compared to frontier models.

mik096d ago

i think in the long term the problem is going to be this: a great small model always come from a greater large model, but the larger base model keep getting larger and more closed sourced

so long as there's no algorithm breakthrough

_doctor_love7d ago· 7 in thread

"Just get a 64GB Mac with 1TB of storage!"

LOL - some of us have a budget

swatcoder7d ago

Sure, but it's also not really out of scale with the cost of a shop tool in other trades.

If you're a professional that's confident in a positive return on the investment (optimal or not), or just a hobbyist with the luxury budget for a "shop" that cost is well within norms.

That's not everybody, of course, but it's not some inconceivable fantasy. A lot of people in the tech community here on HN, specifically, end up with pretty high discretionary budgets that they pour into stuff like this.

1 more reply

amalcon7d ago

A Strix Halo with similar RAM is considerably cheaper. Still not cheap, mind, but performance is OK (not great) and it will run more or less the same models.

1 more reply

p-e-w7d ago

No need. You can run the Gemma 4 and Qwen3.5 MoE models with as little as 12 GB of VRAM at 30-40 tps (Q4/Q5), and they both blow GPT-4o and DeepSeek R1 out of the water.

anarticle7d ago

Pros buy their own tools. This is why working for yourself is better than working for a corpo, you get to choose your weapon.

1 more reply

techscruggs7d ago

He is using a 2022 M2, which you can get that for about $2k used. That is beyond reasonable.

2 more replies

tjwebbnorfolk7d ago

AI and budgets don't mix well at the moment

themythfable7d ago

Yeah, I never had a computer that cost north of $800 until recently. While that is far from the typical HN user's budget, my bet is that it is much closer to average.

Besides those with effectively unlimited budgets for their personal compute, local models are still a long ways off.

Though, that shouldn't be conflated with the value of open-source models, which can be used by cloud providers to significantly reduce cost of intelligence.

2 more replies

pornel7d ago· 5 in thread

[meta] I wonder why people have such wildly different bar for what is "good" agentic coding?

In a way, it's absolutely amazing that we've went from "Playing 'Set a Timer' on Apple Music" intelligence to something that may pass the Turing Test, but in practical terms the small models are still far from what I'd call "good" for more than a tech demo.

To me, 7B models are just a fuzzy echo of Wikipedia. Gemma models at 4 bit are too clumsy to even reliably generate JSON for tool calls or copy a line of code to apply a patch.

Qwen needs so much detail and babysitting to stop it from doom looping or losing the plot, that the instructions that I need to give are usually longer than the code I end up keeping.

Is there some magic prompt that I don't know? Do other people just have a lot more patience, or way lower expectations?

papersail7d ago

I had similar doubts. I think expectations differ because the workload differs. For small scripts, glue code, or simple CRUD changes, smaller models such as Qwen3.6-27B can work wonders than they do on a larger, messier code base.

palisade6d ago

Those who have never known anything better are okay with much less. For example, anyone who used Fable when it came out are saying that it is very difficult to go back to lesser models now. Even our strongest aren't good enough in comparison.

1 more reply

verdverm7d ago

There is a lower bar (that gets lower over time), but ime, the config you are describing is too low still.

qwen/gemma in the 27/35B range @fp8 are better than gemini-2.5, but less than gemini-3.1, you can run DS4-flash @fp8 on two DGX spark, and things keep becoming better. DiffusionGemma came out recently with 4x token gen speeds.

tl;dr - the models you appear to be trying with are too small or too quant'd

cheschire6d ago

Haves and have nots.

We aren’t wealthy enough to have the hardware that would make this good.

The people who have the money to buy a spare maxed out Mac mini just don’t get it. I see lots of folks with RTX 6000’s in threads like these. Or any RTX card that ends in “90”.

Cloud AI is what allows the proles to participate in the broader AI conversation, but not these AI conversations.

2 more replies

towledev6d ago

> may pass the Turing Test

Why do you say 'may'? Just curious. Surely you've got something

1 more reply

iagooar7d ago· 3 in thread

I love running two models locally: qwen3.6 27B 8bit (dense) and qwen3.6 35B 4bit (MoE).

The 27B is the smarter, more reliable one - but it is slower. The 35B is faster, still very smart but below 27B, a bit less reliable. The reason is the MoE - Mixture of Experts architecture, which only activates a subset of parameters, making the model much much faster.

I run the 27B on a MacBook Pro M5 Max + 40 GPU cores + 128GB RAM (well, on this beast I can have 27B + 35B in memory at the same time with headroom for all the other stuff). But because this is a laptop, it is not possible to run local LLMs all the time - it just gets too hot and too loud.

What excites me more: I run the 35B model on a MacMini M4 with 64GB RAM. It is fast, it gets a lot of work done (e.g. it scans, extracts and classifies my emails, it watches the mailbox all the time and does work). I also use it as my private Hermes assistant ("when is the next Starship launch?", "who is playing today at the World Cup? Give me some trivia").

Next step I am planning is a RTX Pro 6000 Blackwell workstation I can put in my basement. I want to run qwen really fast, with multiple threads / prompts / agents at once. And MAYBE if the budget allows, a 2x RTX Pro 6000 setup in order to run DeepSeek v4 flash on it (to run research on it).

Barbing7d ago

Did you get a Brave search API key or something for that “Hermes”?

4 more replies

zerd7d ago

I'd love an RTX 6000 Pro, but how can you justify it when it costs 10 years worth of Claude Max?

2 more replies

jnaina6d ago

how are you connecting the 35B model to your mailbox, for email classification?

1 more reply

embedding-shape7d ago· 2 in thread

Show us the resulting code of using them! :) I want to use local models, I have the hardware for it, but while trying them out as replacements for GPT 5.5 xhigh or Opus or other SOTA models, they aren't quite ready to be replaced yet, sadly. The quality and bumps they encounter just slows down the workflow so much, even screwing up tool call syntax sometimes.

But, for smaller more well-defined workflows, or as straight "edit this part to be like this exact" edits, they seem more than enough. Still waiting for them to become mature enough to be able to replace what we have as SOTA today, I'd say it's ready to be switched over then.

Speaking of local models, DiffusionGemma (and diffusion models in general) should not be slept on for local usage! Usually the problem locally is that the LLMs aren't efficiently making use of your hardware, unless you start batching requests and run many at the same time, but that require different approaches in general. Instead, diffusion models work much faster for individual prompts, and not by a small margin either.

Today I finally finished porting diffusiongemma-26B-A4B-it support from Transformers into Candle, and together with some optimizations I now have it basically flying with ~450 tok/s (~19 it/s) in Candle during inference, instead of ~180 tok/s (~11 it/s) from HF's Transformers library. Even using vLLM with similar sized LLMs, I don't think I've ever gotten past the ~250 tok/s threshold for single prompts, exciting stuff for local models :)

zozbot2347d ago

> Instead, diffusion models work much faster for individual prompts, and not by a small margin either.

Diffusion models can't really be trained beyond low-to-mid size and have lower quality than an equally sized, plain one-token-at-a-time model.

1 more reply

an0malous6d ago

No one ever shows the resulting code of using frontier models either. Curious.

schmuhblaster6d ago· 2 in thread

I’ve been playing around with qwen3.6-35b-a3b and managed to boost it significantly by leveraging my own custom harness [0].

It is quite astonishing to see how far local models have progressed, and I think that if you enjoy tinkering a bit, you can save a good bit of money (if you happen to have the hardware lying around anyways). Overall it’s still hard to beat the the cost/convenience combination of a cloud based model provider though.

[0] https://deepclause.substack.com/p/how-to-make-small-models-p...

phunterlau6d ago

Cool, so the determinstic harness can boost the agent pretty much!

edg50006d ago

Harness engineering is very interesting stuff. Thanks for sharing.

richbradshaw7d ago· 2 in thread

I’m keen to understand speed here etc etc. if I bought a Mac studio with 96GB - what can I realistically run, how’s it compare to fable/opus etc and how fast is it?

Currently maxing out two Claude code accounts every x hours when working on large code migrations or setting up new iOS apps etc - most of time it’s fine but occasionally it’s mega frustrating!

simonw7d ago

I strongly recommend trying LM Studio - it's the lowest friction way to try out models, you can browse https://lmstudio.ai/models and click "Get" and then "Run in LM Studio" to download and run a model.

With 96GB I'd start with the Gemma 4 and Qwen 3.6 models. Any of those should work fine.

AbsurdCensor7d ago

I think currently you can only get the M3 Ultra Studio with 96gb, and for coding tasks, say you rub Qwen Coder on it (which doesn't need that much ram), it's not the fastest, something like 30-40 tok/sec. Probably better with a MacBook Pro with the M5 chip. There is a website for comparing different configurations and models: https://llmcheck.net/benchmarks

angry_octet7d ago· 1 in thread

Programmers are used to paying nothing for tools. A basic laptop (SSD, multi core, 16GB of RAM) is hugely powerful if you are building in C/C++/Rust, even python. But all of a sudden it's no good, and we're back to using someone else's computer, hiring our tools every day. Worse, we get a different model every day, and maybe we aren't allowed to borrow the good tools some days because some mafioso are shaking down the manufacturer.

Most other trades need to invest significantly in tools. If you want good tooling, you really want 64GB of GPU memory (e.g. 2x 5090) and 96GB of RAM. If I'm paying $200k for an expert engineer then $50k every other year for tooling seems pretty reasonable.

rsanek7d ago

Who's paying the $50k? I don't see how it makes sense to pay that much for a home-grown setup when I could pay <$5k/year total for both of the two best frontier models at effectively unlimited usage.

3 more replies

sosodev7d ago· 1 in thread

I think this is overselling their capabilities. I've used Gemma 4 and Qwen 3.6 quite a bit on my strix halo home server. They're great models and the dense variants are significantly better, but they're still very far behind the frontier. If you boot up Gemma 4 MoE and OpenCode/Pi and expect to perform anything like Claude Code or Codex you're going to be very disappointed.

kristopolous6d ago

You need to switch out the prompts and work with it differently.

I posted this yesterday https://github.com/day50-dev/petsitter

I use it with https://github.com/day50-dev/simple-llm-cli

And modify the "tricks" until my evals get to good numbers. It's a model by model basis.

This is what the larger firms are doing - they have custom prompts per model

1 more reply

0xc0c0c07d ago· 1 in thread

I have used local models (around 128 gb) and the big proprietary models, and while I do want local models to win, it's important we keep the expectations of local models realistic. There are many blog posts about how local models today can fully replace some of the proprietary models and in some cases its true for the much smaller proprietary models, its very clearly much more behind the larger models.

You can be far more ambiguous with your tasks with the larger proprietary models as opposed to the local models. You can achieve the similar results with local models but you need to be much more detailed in your prompt.

One of the biggest things about running these local models is that the harness matters almost just as much as the model too. Codex is optimized for GPT models, CC is optimized for Claude, Cursor has a great harness that works very well across these providers. It took me a couple of iterations of the different harnesses to find one that would work well with the smaller Qwen models to do local coding.

failbuffer7d ago

So which harness did you end up choosing?

segmondy7d ago· 1 in thread

It's more than good. As of today, it's great. Those models listed in the blog are horrible compared to what you can run today, There's absolutely no reason to run those, you have Qwen3.6, Gemma4, and plenty other sized comparable models.

If you're resourceful, you can even run SOTA models. KimiK2.7, MiMo-V2.5/V2.5-Pro, MiniMax2.5/2.7/3, DeepSeekV3.1/v3.2/V4-Flash/V4Pro, GLM5.1, Step3.7-Flash, Qwen3.5-397B, Qwen3.5-122B, gpt-oss-120B

agile-gift02626d ago

> Qwen3.5-122B

do you find Qwen3.5-122B to be SOTA-level? I moved from it to Qwen3.6-27B (both Q8), and I prefer 3.6-27B, and it leaves me room to spare for other small models

ngxson7d ago· 1 in thread

My 2c: I think the "cloud vs local" debate is (maybe) a false dichotomy. In my experience, I use a hybrid approach and I've seen a huge productivity boost from it.

The cloud-based models are fine for big and complex tasks, but the pricing is ridiculous for small stuff—like summarizing a discussion or fixing a small bug. And cloud and privacy have never been a good match.

As an example, this comment itself was written with the help of Qwen3.5-4B running locally with an extension on top of llama.cpp default web UI [1]. The extension injects my browser's context directly into the conversation, which allows me to summarize things and draft up comments quickly. Speed is pretty acceptable for the size: ~5s TTFT and ~100 t/s generation, all running on a Macbook M5.

And when I want to run bigger tasks, I don't just stick to one provider. Apart from well-known closed-weight providers like OpenAI or Anthropic, I also experiment with open-weight models like GLM-5.1, DeepSeek V4, and Qwen3.6-27B, which provide quite good results for the price.

I'd argue both have value, and I don't see why anyone needs to choose one exclusively. Anyone else doing this?

[1]: https://github.com/ngxson/llama-companion

phainopepla27d ago

Why not just use DS V4 Flash for the small stuff? Very fast and extremely cheap.

1 more reply

simonw7d ago· 1 in thread

I think gemma-4-26b-a4b and Qwen3.6-35B-A3B show that there's something very interesting about a local model that does mixture-of-experts (which helps a lot with performance) and has in the order of 30 billion parameters.

These models are very capable, and use around 20-30GB of RAM while they are running.

Provided you have 64GB of RAM that leaves space for running other applications at the same time.

chrisweekly7d ago

Obtaining that 64GB RAM is a meaningful obstacle for many.

2 more replies

minton7d ago· 1 in thread

I’m glad people are looking into this because I do think it’s the future. However, why would you not take advantage of the heavily subsidized frontier models while you can. It’s obvious that they’re gonna have to raise prices at which point it might make sense to consider local models, but not today.

fendy30027d ago

Curiosity or anticipation I think. I have tried it in the name of those 2 factors, because when the frontier model price increase happens and we don't know anything about local models, we're screwed

sieste6d ago· 1 in thread

The "middle powers" (cf Carney) should invest in local models, rather than relying on US and China allowing them to rent their AI models. It takes a single executive order to cut the rest of the world off of American AI tools. "I'm happy to pay whatever to rent frontier models from hyperscalers" makes sense if you're citizen of a superpower, but it's risky, naive, bordering on irresponsible to adopt this mindset otherwise, especially when your business or career depend on the tool.

k__6d ago

Training DeepSeek was magnitudes cheaper than training the SOTA models it relied upon.

In theory, other countries should be able to replicate that effort and improve it.

1 more reply

gregwebs7d ago· 1 in thread

All these conversations seem like they are missing talking about planning vs execution. I want the best possible frontier model to plan out my changes. I also have a 2nd agent that is a frontier model check the plan. Then at that point the implementation can be done by a lesser and possibly local model. The frontier model can still do a final code review on the implementation of the changes.

Claude code supports this by setting the model to "opusplan"- it will automatically use Opus for planning and sonnet for implementation. This was completely necessary with the fable release. I was able to do this with fable and it was necessary to avoid getting quickly rate limited. In settings.json:

"env": { "ANTHROPIC_DEFAULT_OPUS_MODEL": "claude-fable-5" },

Obviously have that set to "claude-opus-4-8" now.

noveltyaccount7d ago

I do this with Codex 5.5 for planning (specs, technical design, and task list); and Qwen 3.5-35B for task by task build out. It requires more hand holding and makes more mistakes than using Codex for everything, but it helps me spread my $20 chatGPT subscription pretty far.

infogulch7d ago· 1 in thread

Anybody used a tinybox? https://tinygrad.org/#tinybox

The most "affordable" option is red v2 with 64GB GPU ram and costs $12,000. This is only ("only") 1.5x-3x the price of a beefy desktop (https://pcpartpicker.com/builds/), and could crush inference work even on bigger models. It could support coding tasks for a small team of developers, or run an AI agent for every person in your household...

pornel7d ago

64GB VRAM is too little to run good coding models IMHO. May be useful if you need voice models or run some slightly-smarter-regex batch processing or RAG workflows. Perhaps you're supposed to buy 4 or 8 of these and split inference across them.

If you have $12K to spend, you may be better off with DGX Spark or a Mac with 128GB VRAM. That can (barely) fit DeepSeek V4 Flash.

valisvalis7d ago· 1 in thread

There are good use cases for them for sure, the Gemma 4 Good hackathon a while ago showed how local models can solve problems in health and education in areas with low connectivity or small infrastructure.

LolWolf6d ago

what were your favorite projects?

pjmlp7d ago· 1 in thread

Only if blessed with enough RAM and disk space,

> 64 GB RAM and 1TB storage

Ah ok, not something regular joe and jane happen to have lying around at home.

Additionally the whole configuration is still very much low level, bunch of CLI commands, and if the model doesn't fit for the task at hand, it starts allucinating, generating gibberish, whatever.

sparkling7d ago

Even if i had such a machine, im not sure i would be willing to sacrifice 80% of my RAM and 50% of my disk to run a semi-okay model locally.

ramaseshanms6d ago· 1 in thread

Local models Inference never really took with non tech people. Everyday people who dont know the difference between autocorrect and GPTs. But thanks to recent hardware launches from Nvidia and AMD's in response to the MacMini series, it is quite evident that local AI will replace the conventional laptop market completely one day. Current laptops will be what Nokia represents to a iPhone or Android user. Huge Leaps ahead.

red__dragon6d ago

Are you referring to Spark that has 128 GB of Unified Memory? It would be still expensive.

tpurves6d ago· 1 in thread

I do think local models are huge pending market opportunity for Apple. An M5 Ultra Mac Studio (if that exists) could be decent local AI machine, though so expensive as to stay niche. But by the M6/M7 generations and a recovery in DRAM affordability, the future could be interesting moment for them to deliver a compelling local AI platform that 'just works'. But I do think that a mini-pc that is easy to configure, can be always plugged-in, always on, higher power envelope than a laptop, but not obnoxiously loud and hot, is the right form-factor

BenRacicot6d ago

Agreed, this is what caused me to build. This thesis exactly.

andix6d ago· 1 in thread

> For my local setup, I’m currently [..] and LM Studio as the inference server, although it would likely be faster if I just used llama.cpp directly

Is there any truth to this claim? LM Studio uses llama.cpp to run the models. I guess the overhead of LM Studio should be minimal.

After all LM Studio is a really easy way to host models, are there really major drawbacks?

cuvinny6d ago

LM Studio has a lot less tuning options when you launch it. Also it is precompiled so you don't have the latest (and sometimes buggy) releases so you may have to wait a few days to try out a MTP or a new model. LM Studio is easier though.

aliljet7d ago· 1 in thread

The problem here is always the cost-benefit. For $200/mo, you're receiving subsidized best of breed access. There's no model competing for that price anywhere. If a 27B param model is what you choose, show me your hardware! I would love to be wrong...

rsolva7d ago

But for how long? The subsidized phase is probably short, and then what? I run Qwen 3.5 27 Dense om my old AMD RX7900XTX at about 45 t/s and barely use my Claude Code subscription anymore.

huydotnet7d ago· 1 in thread

I love that local LLMs are being discussed more often on HN recently. But for the post, I find it strange that the author claimed they were working with local models from day 1, but wrote a post that still links to Qwen2.5 and Qwen3 in mid June 2026.

zahlman6d ago

Why shouldn't the author mention models that people might not have to buy a new computer to use?

1 more reply

cube007d ago· 1 in thread

The challenge I have is getting a large enough context window so tool calls work reliably, the local models easily slip into hallucinated JSON tool responses and won't trigger the tools as a result.

glaslong7d ago

Same here. I'm curious what others loving Qwen are doing differently, because it constantly hits this issue for me. It's been great for autofilling blocks, but difficult for me to use agentically.

hamburgererror6d ago· 1 in thread

Do you all use local models only for coding? What about using them as decision assistant? For instance, I work in science and sometimes I have many scattered ideas that I'd like to feed into an LLM so I can refine them and extract a meaningful research question. Are local LLM suited for that task?

probably_wrong6d ago

After predictably failing at generating a sewing pattern, Gemini gave me yesterday this excuse:

> Because AI generates pixels based on visual patterns rather than mathematical geometry, it creates the illusion of a sewing pattern without any of the functional blueprints required to actually drape and construct a real garment.

If you want the illusion of a meaningful research question then sure, local models will give you that.

ios-contractor6d ago· 1 in thread

I subscribe to this guy on youtube for local model stuff if anyone is interested https://www.youtube.com/@AZisk. I'm not affiliated and I'm not even a paying subscriber. But I like all stuff local.

andwhatisthis6d ago

I clicked and immediately subscribed, but then checked out his latest videos and was so put off by the stereotypical clickbait stuff (stupid faces on thumbnails, "I tried (...) and then THIS happened" etc) that I unsubscribed. I understand that it must be what one needs to do to maximize views and brown nose the recommendation algorithm but I just find it incredibly off putting

1 more reply

shunia_huang5d ago· 1 in thread

fav-ed this post and will check again when another one poped on front page and says "local models is perfect now"

Natalia7245d ago

"Good now" feels like the right bar. I still wouldn't use local models for everything, but for quick edits, grep-like code questions, and private notes they are already useful.

blobbers7d ago· 1 in thread

Have you tried optimizing for MLX? It seems like a waste to have neural cores and not use them.

I've often wondered why the hype around apple neural core when 99% of software doesn't use them.

genxy6d ago

Yeah, first think I looked for on the post was MLX and it wasn't there.

https://github.com/ml-explore/mlx-lm

Having used half the systems that Vicki mentioned, mlx was the best balance between power and ease of use. Just a pip install away.

stared7d ago· 1 in thread

I really recommend Qwen3.6 27B.

Make some tests, and its 8 bit version runs at 30tok/s when using llama.cpp with MTP and run on Macbook Max M5. I have 128 GB, but but 64 GB is well enough. https://github.com/stared/benching-local-llms-on-apple-silic...

When using benchmarks, it gives more-or-less the level of SotA mid-late 2025.

iagooar7d ago

I run the exact same model, on the exact same hardware - amazing results. Pair it with good search skills (Tavily, Brave, Exa) and you have a near-SOTA model on your desk.

dejawu7d ago

If vibe-coding is hopping into a self-driving car and telling it to take you anywhere you can get a coffee, then I use coding agents more like a bicycle - they let me get further faster than if I'd walked, but I still have to decide where to go and how to get there, and I still have to pedal.

I don't vibe-code, but I do decide what to implement and what patterns to use (perhaps asking the model to analyze and give advice on this first), then I have it handle the nitty-gritty of the implementation itself. For this usage style, the latest local models are as good as having Claude at home.

I won't say it's been _easy_ (I ended up implementing my own harness to accommodate the idiosyncrasies of local models), but I will say that for the effort, having a coding agent that's essentially free to query as much as I want has been life-changing as a dev, especially when it comes to working on side projects. Knowing that my agent will never get worse in quality, suddenly cost more than it does now, or be suddenly made unavailable by external factors, was absolutely worth the trouble. And on top of all that, I can't believe it's as good as it is.

xienze7d ago

The big caveat here is that these local models require you to invest some time tweaking your harness, AGENTS.md, and skills in order to get things roughly to the level you'd expect. But something like Qwen3.6-27B with web search capabilities and a good set of skills really is impressive! Especially considering that you can go wild and not worry about token costs.

The other thing that people tend to gloss over is that you really do need to spend some $$$ on decent hardware. Yeah, you CAN run some 4-bit quant with heavily quantized cache on your 16GB card, but it's not going to be a great experience (I think this is where a lot of the "if you think it's gonna be any good, you're going to be disappointed" stuff comes from). Yes it's a lot of $$$ upfront but it's very much unknown when hardware prices are going to come back to reality. There's a lot of hopes and dreams that any minute now an H100 will be worth pennies because "that's how it's always been" w.r.t. computer hardware, but we are living in interesting times. So you can't just make the tired old assumptions that a Claude subscription over three years time will work out to be dramatically less than the value of some card three years from now. We STILL have basically anything with >=24GB VRAM appreciating in value, which is absolutely wild. What I'm saying is, the depreciation curve may very well be a lot less dramatic and fast than it used to be, going forward.

chrismarlow97d ago

You can use a frontier model to create a plan that's specific enough for a local model of a very small size to execute on. The more specific you are and compartmentalize tasks the "dumber" the local model can be.

Edit: Obviously you'll be using more tokens but this is the trade off for running a smaller model and running locally. Similar to time memory trade off but in token economics. Sorry I need more coffee

Tharre7d ago

I've been running Qwen3.6-35B-A3B (and 3.5 previously) locally and it's a great model for many small tasks, probably a significant chunk of what most normal people are using LLMs for right now.

But for coding in a harness? In my experience it's unusable even for small projects. It just gets hard stuck at every little problem, wasting hundreds of thousands of tokens trying to make a convoluted solution work instead of doing the obvious thing. Or it will spend hours trying to reason through a fairly simple code flow, incrementally adding debug print statements, only to get confused by the output and then editing completely unrelated code that it convinced itself is the problem.

I've tried instead giving Sonnet the problem description and code and have it come up with a detailed plan that Qwen should implement, but doing that actually consumes a significant amount of tokens compared to just telling it to implement everything, and the results are honestly not that much better. There are just too often subtle issues with the plan that Qwen doesn't recognize when implementing, but make the resulting solution it comes up with unusable.

delis-thumbs-7e7d ago

Nobody asked, but I don’t think any of us should be using SoA models to code or to do pretty much anything at all. Instead we should develop open models to work on specific tasks and learn to code, write, draw etc. using fingers made of bones and brains made of flesh. Big corporations and research facilities can run them to generate code or math or whatever, with a bunch of specialists to check the output to be correct. Then again, even that might not be worth the costs (e.g. OpenAI’s 36B$ net loss last year), when the open models are so close and the whole AI scheme is running out of scams to pull.

There’s a lot of things we could use even quite small models for, which would not need an insane amount of computing power and memory, but too few of us is really researching them.

bayshark7d ago

Hey everyone, made a local LLM, configured for Home Assistant called Selora AI.

Specs: qwen3_17b_base.Q6_K.gguf selora-v047-answer.f16.gguf selora-v047-automation.f16.gguf selora-v047-clarification.f16.gguf selora-v047-command.f16.gguf

The full base model and LoRA adapters are only 3.5GB

Capabilities include configuring for smart home setup to help with answers, clarifications, commands, and creating automations in Home Assistant. The models with the LoRA adapters were made with lean scripted data made specifically for Home Assistant. A lot of work was put into this, feel free to give it a try and happy for any feedback!

https://huggingface.co/selorahomes/Selora-AI

anubhav2007d ago

I have been using qwen and glm based models from last 2 years, ended up buying mutiple machines for the same. Overall i feel 24vram is a must have to get get performance (speed wise) to match hosted soln. I have 2 machines a 12gb vram one and a 24gb one. On 12gb vram i get around 50tps generation and 500tps prompt processing and on 24gb one i get 180tps generation and 3500tps prompt processing. I have different configs for different scenarios and I also use llama cpp manager manage all my configs (https://github.com/anubhavgupta/llama-cpp-manager)

ptx7d ago

> Security: I run every Pi session in a Docker container and give it permissions only to bash so that it can’t run Python code or do web browsing

How does that work? The script in the post references the file "docker-compose.sandbox.yml", but I don't anything about what that file does.

The post that this one links to, that it's based on, says that Pi doesn't do proper sandboxing.

Presumably bash can still execute other binaries, otherwise it would be fairly useless. What stops it from executing Python? Or opening a network connection and downloading Python?

androiddrew7d ago

$2600 will buy you two AMD 9700 gpus with 32Gb ram per card running about 285 Watts per card. Less than a 5090 in both cost and power. A VLLM build patched for AITER and you can run Qwen3.6 27B FP8 at roughly 45-50TPS during real coding sessions with Opencode or PI with a full context window. I really hope more 30B dense models continue to be released, but Qwen3.6 should get you a lot of agentic mileage.

ROCm stack is not for people though who aren’t willing to dig in and patch things themselves.

K0IN7d ago

In a day to day base i host Qwen3.6:27b, but i *Really* want to host deepseekv4 flash, its such a "good" model for its size/speed/price.

I really wonder when companies will start hosting theire model for everday tasks on prem, cause its good enough (and realative cheap), instead of paying subscriptions for all devs.

asim6d ago

I don't run local models, my devices are 5-6 years old and not powerful enough. It's a bit counter intuitive and different to what a lot of engineers are doing but I don't have a mac mini, I don't have a powerful laptop, a lot of my dev work has always been cloud based, on github, on a VM, I'm mostly using SSH from my laptop and now Claude Code on my phone (exe.dev is hands down the best experience I had on this front when the agent is literally on the VM).

In an ideal world, yea you can run local models, but I need a powerful always on device for that, or the latest gear, and it will never be as fast as what I can use from google, anthropic, or through an API call. I really wish it was different but I have to shell out a ton of money for that, and I guess it's usecase specific right. Maybe if my phone was super powerful and could run models that would be great, but then I have this issue with cloud sync and using things anywhere else. There will be a world in which local models and self deployed models make sense, this is going to be a core experience, but I personally can't run them.

ltononro7d ago

Good depends a lot. If you are in the token maxxing hype you will probably find these models very bad comparing to SOTA, unfortunately.

The good news might be: opensource models are now good (enough) for day2day usage. But is it really? I feel that companies will always naturally strive for the best and use the SOTA (as long it is not too expensive).

I see OSS models being a good backbone for companies in the future that have validated workflows and could use those for privacy or to spare costs.

IDK, might have gone a little bit off-topic here.

polotics7d ago

So I've made this [me+vibe+tests]-coded Android alarm app called Promptly, and as Gemini-CLI on the Google Pro subscription is getting google-killed on June 18th, I set up two branches, one for Antigravity+Gemini3.5 and one for Pi-coding-agent with Qwen3-Coder-Next...

Running the same prompt on both with the same .md memory state...

Gemini3.5 is more "intelligent" but Antigravity gets it to decide to go on tangents that are quite time and token-consuming I think. Nice casino machine.

Pi+Qwen3 (~80GB, llama.cpp) is like vibecoding about 1.5 years ago, when you had to babysit, structure your program to have self-contained chunks, and keep an eye on all the cross-cutting concerns to not trip it up. When it works it works fine and when it fails it's my job to ensure it fails fast.

The code is about 10'000 lines of Kotlin in total so it already takes some effort to keep it simple for the AI. It's not a slopped quantity of code, i got solid feature creep :^)

https://play.google.com/store/apps/details?id=com.sixteenam.... ...hat tip to the recent copycat squatter btw it's an honor!

jmyeet7d ago

It's not "good". A more accurate description would be "sometimes useful and not far from being good". The author is using pretty small models. There have been a lot of improvements that scale in any case (eg MTP) but ultimately this is still hardware limited by 3 factors:

1. Memory bandwidth

2. VRAM size, which limits the size of a model you can use effectively. Yes you can swap but then you're taking a performance hit;

3. Raw FLOPS, including quantization.

Apple here is interesting because they have a shared memory model and you can buy Macs currently with up to 128GB of RAM (previously 256/612GB on Mac Studios, both discontinued). New M5 Mac Studios are expected in Q3 but that's not guaranteed. It may take until next year

Depending on the chip, Macs top out at ~900GB/s. A 5090 or 6000 Pro has 1800GB/s. A B100 is at like 3.2TB/s. A 5090 has, depending on how you count, 5-7x the FLOPS of a M5 Pro so a 5090 is still better than any current Max... except for the 32GB limit.

NVidia aggressively segment the market by limiting VRAM. The RTX 6000 Pro is basically a 5090 with slightly more CUDA cores and 96GB of VRAM instead of 32GB for $10-11k instead of $3k.

So let's project this into the future a little. The M6 Ultra/Max may well be 1TB+/s memory bandwidth with much higher FLOPS and thus actually be competitive for larger models. A 6090 in the current market will probably still have 32GB of VRAM if I had to guess. Maybe it goes up to 48GB.

But anyway I think we're only 2-3 years away from sub-$5000 hardware that does 100-300+tok/s on models larger than 31B. And that's going to be a game changer.

sermakarevich6d ago

I am running an experiment with local qwen3.6:36B for a week: https://news.ycombinator.com/item?id=48520757

It really is better than I would expect it to be. But it requires a special treatment. Since the model is smaller it needs a smaller and simpler tasks. I use smarter model to decompose the task into primitive subtasks, write good description, submit to worker with qwen3.6, review completion and create new task to fix if required (20% of cases). This workflow works fine.

linuxhansl6d ago

I soooo wish that to be true. Alas, in my experience it is not... Yet.

What is true is that it gets easier and faster to run local models. With QAT (quantization aware training), turboquant (or similar) K/V compression; what used to be impossible to run is now fairly easy.

I can run gemma4:26b-a4b-qat on my laptop with 20-30 tokens/s with a 256k context window. That was unthinkable just 6 months ago.

So the local models are "OK" for small'ish projects.

But it does not at all(!) compare to the frontier models. For a large project Claude's Opus 4.6+ just work, whereas local gemma tangles itself up, makes weird mistakes, and just can't handle it (for those cases it is faster if I do it myself).

If the trends continues, with 1.58bit QAT models, even better K/V compression, faster multi-token prediction et al, maybe soon it will be comparable.

wxw7d ago

> “if we are constrained by performance and price, what architectural tradeoffs do we need to make?” a question that so far has not really been asked in the mad token gold rush.

To be fair, I think the labs are also interested in this (e.g OpenAI parameter golf). But the incentives are tricky. When the subsidies and tokenmaxxing era ends, local models will be essential.

aquarious_7d ago

I support local models and enjoy playing around with them, but even for personally development it is just more viable for me to pay $200 a month to Anthropic for the latest models. It seems to me with the cost of hardware needed to run local models that, for now, it is pure hobbyist and exploratory (which is fun in its own right)

andix7d ago

Because I've seen too many people spending a lot of money on expensive hardware, without really using it in the end:

Most of those models are also available via Openrouter and many other platforms. Dirt cheap, and much faster than on consumer GPUs. Perfect to try and compare the different options.

jlengrand7d ago

Just wanna say it's always fun and nostalgic to see authors pass by here who I was reading back when I started my career. I was reading Vicki's blogs way back, even remember learning some email parsing in python from her over 10 years ago. TY!

abalashov7d ago

And if you want to dial in a setting in between: I've switched to Kimi K2.6 (now K2.7) and DeepSeek through OpenRouter and Reasonix for pretty much everything, with no discernible loss of analytical quality or utility.

However, like many commenters, I don't really believe in vibe-coding, long-horizon agentic one-shot agentic coding, etc. and do not use LLMs for huge generation tasks that involve designing things end-to-end.

I also have an MBP with 128 GB of unified memory and do quite a bit of Qwen3.6-35B-A3B. No, it's not as smart as the aforementioned models, to say nothing of frontier, but many people seem pleasantly shocked by the number of banal tasks that do not require these.

jnaina6d ago

Running Qwen3-30B-A3B-Instruct-2507-AWQ-4bit on an Olares One with NVIDIA GeForce RTX 5090 Mobile GPU (24GB GDDR7 VRAM) and an Intel Core Ultra 9 275HX processor.

Plenty fast for coding work and for sharing with my OpenClaw setup.

Currently in the process of adding another external GPU (RTX 4090 with pipeline parallelism) via thunderbolt 5 to the Olares One box, for higher quantization, possibly 8-bit, larger context, better concurrency, more kv cache.

ronef6d ago

What's the best practice right now for setting these up? We've been primarily using Nix/Flox to set up the models pretty quickly and at least with minimized amount of commands(biased Nix/Floxer) here and found it useful

cautiouscat7d ago

> I have no concrete scientific evidence of this - my own personal vibe metric of “is a model good enough” is, “do I have to double-check it against an API model”, and GPT-OSS was the first one where I started doing that a lot less often.

The good old butt dyno!

I’ve been eyeing local models more and more with Anthropic squeezing more and more on the subscriptions. A few comments on HN had me waiting until they improved more but this article makes me wonder if I should reconsider that.

I’ve been doing some pretty niche development using a game and a script extender for said game. If these models can handle that, I’d feel good about switching.

robertkarl7d ago

You can trade off latency / accuracy / cost for any ML task. And with the local models.... the cost is free.

Having a local Qwen check another Qwen's work increases the accuracy quite a bit at the cost of more latency. You can't have your cake and eat it too.

In benchmarking local models, I'm having success increasing even a 9B qwen's score on terminal-bench adjacent problems, just by asking it to plan and handing the plan back to qwen with a fresh context. Try it with Qwen3.5, unsloth Q4+, and a thinking budget of around 1024 tokens.

Mr_Eri_Atlov7d ago

I think this is a pivotal moment for LLMs.

Gemma 4 and Qwen3.6 27B aren't perfect, yet they are such a step forward from the previous generation that it's both feasible to get stuff done locally with patience and very likely that future releases will subvert cloud capabilities entirely.

Plus, they have definite reliability advantages over cloud models that can be wiped out by a government order or lobotomized to handle traffic surges.

acb126d ago

What are the minimal hardware requirements to run a reasonable good local model for real SWE work these days? And have a reasonable inference speed? Seems like the requirement are pretty high and the inference speed is not ideal.

b3ing7d ago

They are ok for simple stuff, coding is weak, chat is alright, writing is ok. But I had many of them write stories for ideas and they kept using the same names regardless of what the story was about. I can’t complain, it’s free. Can’t wait till they get even better, but for local image generation they are good, slow but just create a bunch in the background while you do other things otherwise it’s like 14.4k modems

jauntywundrkind7d ago

i'd love to get to a point where big models can launch subagents that are fast and local. there's a lot of focus on token rate, but just as much, the way cloud providers have other latencies & processing styles not optimized for latency (running large batches all at once), and i think local might have some real wins. Gemma 4 seems already on the right track. lfm2.5-8b-a1b (https://www.liquid.ai/blog/lfm2-5-8b-a1b) and DiffusionGemma seem to both be very high token rate. but getting that latency down, so that a series of tool calls can happen faster, would be a real win. I think especially with good prompting that becomes much more possible.

One caveat, I have absolutely no patience for a lot of subagent systems, like opencode, where the subagent is walled off and incommunicatable. My subagents really should be their own session, that i can deal with as I please, with some MessageChannel like offerings/tools available to them. Ideally with modes where messages auto-flow in and out, and modes where I can be a gate-monitor. https://developer.mozilla.org/en-US/docs/Web/API/MessageChan...

Not really super related but MCP has been working on Events for a while. That ability to respond fast would be great. https://github.com/modelcontextprotocol/experimental-ext-tri...

Asking local to be fast feels like an obvious folly, but given how much better small models have got, and seeing these models tune themselves for speed: I want to hope!

noveltyaccount7d ago

From the recent Nvidia & Microsoft announcement about new chips for consumers:

> “Our goal is to deliver unmetered intelligence to every home and every desk with Windows,” said Satya Nadella, chairman and CEO of Microsoft. “RTX Spark marks a real breakthrough towards that vision.”

Makes me optimistic that those two companies are going to keep investing in quality local models.

hank8087d ago

Local models are good? Or are we saying that open source/open weights models are good? What I'm asking is, are they good because they are "local" or are they good because you can install and run them yourself, wherever you want? Same node, different node, different cluster, way out in the ether/cloud...

daniban7d ago

With Apple silicon and now the RTX Spark there are real discussions whether local AI is the future. The only problem is Western open source models are so far behind. I genuinely feel there's a push to fix this. Gemma is getting more frequent releases and Nvdia is quietly creating very cool small models. I hope both the hardware and models catch up and local really does emerge.

restlake6d ago

my most mind blowing recent development here was testing the Gemma 4 models at release for vision and image recognition vs some benchmarks I had from using Gemma 3 for the same tasks. Gemma 4 is significantly faster and massively more accurate, to a level where I fundamentally can finally turn off my wifi and run a batch of my photos through the local model and trust the results for the extensive classification that Gemma seems well suited to handle. incredible times for local LLMs

jotato7d ago

I currently have a desktop with a 4060 ti (16gb of vram). Most models I have tested that fit within that are not good enough for anything other then type completion (in regards to coding tasks)

I have been considering getting the 58gb Mac Mini but that is a decent amount of money to spend without confirmation on a) how fast is it and b) will it work for well-defined tasks.

jszymborski7d ago

I run local models and they work fine for me, but specifically for use in coding harnesses, I'm having a hard time. Tools tend to end up in the same loop, trying to `ls` the same folder or `grep` the same file, over and over and eating up the whole context. Super hard to get it to do anything but that. Any tips?

throwarayes7d ago

I am happy to pay OpenAI for a cheaper model a few generations behind. But they deprecate models aggressively. They push you to bigger and smarter models, when 95% of my work doesn’t need it.

I’d love it if model providers just let old models run and let us pay less, but the deprecation makes me want to look into local models.

fridder7d ago

Is there a local harness designed around the local model use case that is claude code like? Opencode has been problematic at times, pi works for one off for me but not back and forth conversations with the LLM. Considering I only use Qwen or Gemma models I'm close to just writing my own at this point

anax327d ago

I've just made a milestone on my project, moving away from AWS (budget) to self-hosted and the local models are so much faster than in the past. Beyond LLMs, having embeddings, image, video, audio gen available is crazy.

Running locally is the bar; it's hard to make these things a service which scales.

MrKoby077d ago

I think a lot of people just don't have specs like that, making it still painful.

k__7d ago

I tried some smaller Gemma4 and Qwen3.6 quants on my MBA with M5/16GB and had like 20-60 tokens per second. At 60 it felt pretty okay and that hardware is on the lower end.

I'd assume a Mac with 32-64GB memory would get some reasonable results.

WASDx7d ago

Looking at some benchmarks, the latest ~30B Gemma/Qwen score similar as Claude or GPT versions that were released just one year earlier. That's crazy progress. I can't imagine how it will be in a few years.

ta-run7d ago

Not related, but, I can't seem to get my copilot-cli (office is an MS shop) use qwen3.5:27b on ollama for some odd reason.

After the recent changes to usage, I've spent an annoyingly long number of hours trying to get this to work.

wasimxyz7d ago

https://canirun.ai

skittleson7d ago

i've been running qwen 3.6 35B A3B with llama.cpp on a 3090ti. i have found it better then sonnet in many ways. Speed and iterations was key. here is the gist of my current configuration: https://gist.github.com/spencerkittleson/5e44b6895a17ca45161... I use this with tailscale so all my devices have full access to it. That machine get toasty....

xbmcuser6d ago

Running local models might be good but until the virtual hardware monopolies of tsmc and others is broken they will out of reach for most people.

mohamedkoubaa7d ago

I wonder when a cheaper consumer grade inference chip will hit the market. The general purpose GPUs have much more silicon and complex firmware than what's strictly needed for inference

prlin7d ago

If you wanted to do some research or learn about post training and agent harnesses, is that a good option with these local models? What hardware is recommended, or easiest to go with a Mac Studio with 64GB+ RAM?

wrxd7d ago

I wonder how much local models hallucinate. I am getting almost daily an "Honest answers: I made that up." reply from Claude Opus when I challenge some silly thing it's trying to do.

pinstripes6d ago

I very much enjoy scrolling r/homelabs ever so often, so many cool local rigs there

zx80806d ago

> None of these are groundbreaking tasks (again, a lot of personalized Google/docs lookups)

Does it really needs a GPU at 300Watts to do all that tasks?

malkosta7d ago

The problem with QWEN is that it just can't edit files reliably, I had to hack Pi all over to reduce the pain, but still far from perfect...does Gemma 4 strugle on this?

aidenn06d ago

Can anybody recommend sub $10k hardware that can run the models mentioned in TFA at something faster than a snails-pace?

1 more reply

lthi7477d ago

Maybe it is good but it is very difficult, or at least with regular computer. For users like me with 16GB laptop it is almost impossible task.

ibizaman7d ago

Tangential but reading on mobile, the font size in the code snippets are all over the place. I actually have the same issue on my blog. Anyone knows why?

lanycrost6d ago

I'm crazy for gemma and Qwen, really hope we will be able to run LLMS everywhere like a Doom

ricardobayes6d ago

They are good, and yesterday's release GLM 5.2 even benchmarks really close to Opus.

walmas7d ago

Maybe the future isn't Data Centers, climate crisis, drought, and endless subscription and token fees.

nikagrawal1217d ago

I tried for my legal AI application that I'm building and it was able to do majority of the tasks. I used gemma4:26B

bthornbury7d ago

the qwopus 27b model is good for grunt work style tasks, even across multiple files. Piping a bunch of things through, small factoring changes, stuff that just takes time to type out.

I wouldn't rely on it for large stuff like codex though. I haven't tried out deepseek/kimi, if we could run those locally it would be great.

Muaz_Ashraf6d ago

for the past few days I am building things via local models and review and fix the bugs from opus. Its working okay but still local models are Just HYPE and Irritating.

ridruejo7d ago

Local models are one of the main drivers for our installer / Desktop app for OpenClaw https://holaclaw.ai (disclaimer I am one of the founders). The smaller models are really only suitable for the most basic tasks, but if you have 32gb-64gb you can get real work done (ie complex web workflows) without third party hosted models

AgentMasterRace6d ago

If you have an extra PC and enjoy 5 tokens a second... Sure

fl4regun7d ago

In my experience, with a system of 32GB RAM and 24GB VRAM, no, they aren't that good.

osigurdson7d ago

Running AI on timesharing mainframes does seem like an odd final state for the world.

fg1377d ago

> I have a 2022 M2 Mac with 64 GB RAM

I closed the article after that.

The author has no idea what a privilege it is to have a machine like that for personal use, and how 99% of the population are not going to afford a setup like that.

Just some back-of-the-envelope maths will tell you that a $20/month Claude subscription makes much more sense financially.

1 more reply

holoduke7d ago

Good? My Macbook m3 with 36gb locked up after it filled all memory with Gemma4. A bit useful yes. But it eats all resources. For local models to be useful we need at least 128gb of system memory and 512gb of video memory. Plus 8 times the compute of a single 5090/h200

0xbadcafebee7d ago

Local models have been good for a while. But this being the HN echo chamber, people here think that local models can only be used for coding, and are expecting Opus 4.8 on their iPhone. Turns out AI can be used for things other than just coding. Even tiny models (<4B parameters) can do tons of useful things on local devices. Search, index, summarization, troubleshooting, crafting documents/formatting, image analysis, transcription, object identification, robot navigation, text-to-speech, speech-to-text, browser/window control, MCP/tool calls, and much more.

Larger models just do more complex reasoning. But if you want them to be really good, you need a beefy Mac. They have the best combination of memory bandwidth and RAM to allow medium-sized models to run at speed. GPUs have less memory but more bandwidth, and AMD iGPUs have more memory but less bandwidth. The Mac is the best compromise on the market today.

Once you do have a beefy Mac, you want to run a dense model. This gives you the best possible result with the system you have. You can go MoE for faster results, use cutting-edge inference techniques, parameter tweaks, etc. But a basic dense model (at Q6 quant) on a big-ass mac will serve 90% of your coding needs.

frollogaston7d ago

"Good" refers to the speed and not the quality. There's so much hype about Macs being great for LLMs, but nobody seems to be seriously using them for that because the open models are unfortunately so far behind.

drchaim7d ago

really want to try local models, but I don't have the hardware yet. Probably I'm the only one here still using a Mac Mini m1 8gb 2020. :/

1 more reply

ZionBoggan7d ago

This is actually a really insightful post !

Patchistry6d ago

do you run you local models along side some of your "paid" models?

henryoman6d ago

Will there be a gemma4n

sn0n6d ago

Qwen 3? Qwen 2.5 coder?? Is this an llm article written on an outdated model?? LoL

atulmy7d ago

Exact reason I'm building csuite.so, do check it out and let me know if you need early access!

dbg314156d ago

They misspelled “Better than before. But… yeah.”

nullc6d ago

I'm a little mystified at people taking about qwen 3.6 27b/ gemma 31b being slow in one breath and then saying they're using a 16GB gpu in the next.

You do need to use sutable hardware.

I get 50tok/s from Qwen 3.6 27b with Q8 & MTP (I can get more aggregate tok/s in parallel rather than using MOE, but don't have enough memory for too many full sized contexts) and 100 tok/s with 35B-A3b Q8 (no MTP as it's not that useful with MOE) on a single workstation gpu that I spent 3k on a couple years ago.

These speeds are somewhat faster than what I've seen from commercial SOTA models, they're plenty fast for many applications.

teknologist6d ago

I found a tool that makes it easy to run Salvatore Sanfilippo's (Redis creator) ds4.c on a Mac: https://github.com/notatestuser/ds4-control

His program uses quantization, but is very optimised and has builds that can fit into 96GB of memory with great results.

DS4 Flash is usually my go-to for a lot of things these days, and I don't have to worry about a cloud model stopping or telling me it's concerned about my usage.

matrix127d ago

gemma:12b at 75% of frontier? Yeah....

1 more reply

dakolli6d ago

It doesn't make sense, if your small local model is 75% as effective as a frontier model and frontier models are still what.. 50% effective maybe slightly more, with tons of downsides.. Why would I spend 5k on hardware to run these mediocre models. I don't really see the point in the frontier model either.

dakolli6d ago

Imagine spending $5k to run a 32B param llm locally.. You could run much more capable open source models through Openrouter for years running 24/7 at 50tps. This will never make sense to me.

Computer07d ago

I have 16GB VRAM and 96GB Ram on all my computers and I do enjoy local models. I would not use them for coding, though I have experimented with it, it is largely a waste of time on my hardware. I love local chat with different models however, when using the model in this way it is much easier to experiment with the largest models near the limit of your hardware, and I do find it useful on the airplane somewhat. I have also used local models for data classification tasks and let it run over the weekend etc and the results were acceptable.

jingw2227d ago

open source must win

pauljeba6d ago

How do I beleive thi? you wrote this blog by hand.lol

monegator7d ago

I've been trying local models for the boring stuff you might be thinking about: writing small docs.

So i've tested a couple, and the speed is finally impressive. My colleague uses paid tiers of claude and GPT, and the speed is comparable. Maybe even slightly faster on my end.

The problem is: i'm running the model on my work laptop, a 12th gen i5 with 16GB of RAM (which, you know, i asked to upgrade to 64, but that was right at the time of the great RAM shortage of the '20s) so i'm pretty limited in what i can use. And this is running alongside the usual suspects: Web browser hugging 1.5GB, MPLABX hugging 3, windows taking at least 5 just to sit idle, thermal throttled to 1GHz ... And yet its speed is comparable to a paid service. A lunch's worth of tokens vs a few cents of power.

So, what i found, what i fount... What i found is that i need AT LEAST 16k of context window, otherwise they will halt when i pass a small C file for analysis. And coding models will shit the bed with 4k. But we all know that, context size is King.

I found out that Qwen will keep looping while thinking, but that's not a surprise to you, either. But give it enough time and you will get an useful answer. I was hoping to using it as a better warning system for some languages, but i fear i need muuuch more context size, because i tried to feed a file that had a function with an endless loop:

At 4k context it almost shit the bed if i gave it just the offending function, then told it where to look at. At 16k context, with the whole file, it needed some guidance to what the problem was, and after 10-15 minutes of thinking it found the issue. Problem is, it kept second guessing itself for another 20 minutes on the same unrelated thing before giving the output. For which the fix was wrong, but the semanthic was correct. Good enough. Maybe it will be faster if i don't ask for a fix (which i didn't i just asked to look for a specific issue)

Wish i had 3 times the RAM so i can see what happens with more context.

Then i gave it the task to analyze a C file to make an API document. It took half an hour, but then i had a good starting point, which i had to keep changing because it would confuse commands with IDs and things like that.

This was the Qwen 3.5 9B model.

I then tested Gemma 4, being impressed at the tokens per second it gives on my Pixel 8A. Same tasks: same issues with short context, with long context it gave absolutely useless answers when looking at code, but it took 1/3 the time of qwen.

In producing documentation, instead, it was much faster, and it never hallucinated data. Good. in 15 minutes i had everything done.

Not bad for stuff running on a business laptop, while doing actual work.

Tomorrow i will try Qwen 3.6, let's see how it goes..

pauljeba6d ago

How do I beleive you? You wrote this post by hand. lol

zrg6d ago

tldr it is not

aleksandrm7d ago

Clickbait title, because running local models is still not good now.

j / k navigate · click thread line to collapse

595 comments

231 comments · 119 top-level

c0rruptbytes7d ago· 32 in thread

I don't know about good, I use a lot of local models and they're still pretty painful to run locally

You have dense models (qwen 27b, gemma 31b) who are pretty smart, but pretty slow

You have MoE models (gemma 26b, qwen 35b, north mini code 30b) who are pretty fast, but make a lot of mistakes

So you need a lot of compute to make the pre-fill fast, you need bandwidth to make the decode fast, you need a lot of memory to hold everything - lot of ifs

On top of that, your laptop becomes a loud hot churning machine, it's uncomfortable to work with.

So are they good? not really. Do they work? yes

saghm7d ago

8 more replies

aftbit7d ago

If you throw the right GPUs at the problem, they become much better - but still not quite in the realm of Sonnet or DeepSeek 4 Flash, let alone Opus / DeepSeek Pro or Mythos/Fable/GPT-5.5.

Given enough budget, power, and cooling, you can run some pretty good data pipelines, but for code, I think it still makes sense to shell out to an API provider most of the time.

10 more replies

zozbot2347d ago

4 more replies

adam_arthur7d ago

Gemma 4 is particularly good at pipeline/automation tasks.

It outperforms all the Qwen models (even 100B+) for rule following/automation style tasks in my experience. Its image interpretation is also very good, and out-benchmarks Opus.

Qwen seems to ignore instructions and consistently outputs incorrect formats (when token generation format is not explicitly constrained)

I'd recommend for anyone getting into local models to focus on memory bandwidth over RAM. Models under 100B parameters are now sufficient and hugely useful for automation.

I agree that for coding/creation use cases, there's still not a compelling argument for local models.

But e.g. if you want to scan a list of stocks and interpret news/high pass filtering, interpreting logs, interpreting screenshots, the local models are more than sufficient already.

5 more replies

freehorse7d ago

> You have MoE models (gemma 26b, qwen 35b, north mini code 30b) who are pretty fast, but make a lot of mistakes

I have been using the qwen 27b and it is great, but running a dense model like this in a macbook is a bit suboptimal, and i wish I could run sth faster than 15 tok/s.

1 more reply

robomartin7d ago

> On top of that, your laptop becomes a loud hot churning machine, it's uncomfortable to work with.

Laptop?

locknitpicker6d ago

> I don't know about good, I use a lot of local models and they're still pretty painful to run locally

You are somehow assuming cloud-based models are not painful.

I prompted the model to proceed and apply the instruction file prompts. It went ahead and applied changes. Success. It cost $0.16.

I reviewed the code again. Only half of the sloppy code was touched up. I prompted it to fix the whole mess, not just a couple of files. It complied. One coin less in my purse.

So, around a third of the cost of a feature is spent on the model cleaning the mess it left in it's wake.

And this was a tiny feature with a plan, a solid set of instruction files.

Very expensive.

Are costs going down? I doubt so. OpenAI seems to still be spending 3 times it's revenue already.

In comparison, local models sound very good.

chrsw6d ago

hnlmorg7d ago

To be honest even the cloud models are a hot mess at times. This week I’ve spent more time rejected code from OpenAI models than I have approving it.

segmondy6d ago

I run 27B at Q8 with fp16 KV cache at 50tk/sec on 2 3090s. Not 4090, Not 5090. 6 years old GPUs.

Stagnant7d ago

2 more replies

xlii6d ago

> I use a lot of local models and they're still pretty painful to run locally.

The worst thing with local models is that I can't just give you a recipe, because what's the best params depends on your use case.

atomicnumber37d ago

I largely don't disagree with you but come to a different conclusion. I have two systems:

1) a "programming desktop" with a $500 upper mid range Ryzen (idr exact), 8GB VRAM Radeon card I bought solely for RuneScape, and 64GB ram

2) a maxed out Alienware 16 Area51, so it's a 5090 with 24GB vram and 64GB system ram. I bought it for gaming, of course.

I run qwen 3.6 35B A3B Q6 with 200k context window. I compare this to Claude pro max or whatever that I use at work.

The main difference between the machines is that the one with the RuneScape gpu does 10 TPS while the Alienware does 30-40tps. Both are fine though the 30-40tps is obviously a lot snappier.

I find with both models that:

- they do really well at "be a 30GB zip file of reddit and stackoverflow answers"

- they do really well at point fixing random bullshit errors that would otherwise waste my time (this is related to above of course)

- they do quite well at, given a pretty good specification of what you want, figuring it out, even if you've specified several steps needed

- they both cannot really be given a large ish task and left to just drive it on their own

andy_ppp7d ago

1 more reply

iwontberude7d ago

heipei7d ago

2 more replies

devilsdata7d ago

1 more reply

naikrovek5d ago

> You have dense models (qwen 27b, gemma 31b) who are pretty smart, but pretty slow

of course faster would be better, but it's not always a requirement. smart and slow is far better than dumb and fast or even nothing at all.

EnPissant6d ago

When running on a GPU, dense models are shaping up to be the best way due to two things:

- Maximum intelligence per VRAM (you dont have much)

When running on large unified memory like Strix Halo or Spark Dgx, MoE models are usually best:

- You can get similar intelligence as a smaller dense model with fewer active params (to compensate for the slower memory) by throwing ram at the problem.

2 more replies

not_kurt_godel7d ago

FuriouslyAdrift7d ago

Kimi 2.6 or 2.8 is what we are playing with locally. They need 512GB to 1TB to run with full capabilities so that's not exactly "desktop"

Our GPU computer server cost $110k.

1 more reply

smcleod7d ago

Those dense models are pretty fast with MTP now. 40-70TK/s depending on your machine, that's faster than cloud models (although not as smart obviously).

NamlchakKhandro6d ago

Pi mono is king. Everything else is hypetrash.

If I can't customise it then I won't waste my time using it it getting use to it.

Claude code is trash, it's customisability is extremely shallow, open code, codex, copilot, Kiro, etc etc... all trash. Yes even open code..

If open code was so awesome then open claw would have been based on it... But it wasn't. That's should tell you everything you need to know.

greenavocado7d ago

4 bit unsloth quants are good if you never ask for more than 20k context, use it as autocomplete on steroids, and never delegate serious questions to it

beadw7d ago

1 more reply

ridiculous_leke7d ago

dominotw7d ago

i use it usecases like that latter and they are fine.

markdog126d ago

100% agree. I've spent many hours testing out local models/harnesses. So far, they're very much not worth the tradeoff. Obviously, I hope that changes.

everdrive7d ago

What counts as a lot of memory? What could someone do with 16 GB of RAM?

6 more replies

onel6d ago

Agree with this. Open models are the future but currently they are a pain to run locally.

As painful as it is to admit, the future might be cloud inference from a trusted provider.

adam_patarino6d ago

I always find it amusing when people would rather spend $200 / mo than let their laptop fan turn on.

citizenpaul7d ago

1 more reply

hypfer7d ago· 21 in thread

After having been a happy user of Qwen3.6-27B for a few weeks, due to being away from the hardware, I'm currently forced to use Claude Sonnet 4.6

Of course, being significantly larger, it will encode more knowledge, but that doesn't help me when I hate talking to it. And all that on top of the fact that talking with it costs real money.

I wonder what it might be that makes me hate it so much. Maybe because it doesn't see itself as a tool but almost an equal? As if its opinions would have weight.

Qwen too can act like an overeager intern, but if you tell it that it is an idiot, it will drop that ego. Not so much with Claude. In my experience, anyway.

Anyway, point is: full ack on that headline.

ggerganov7d ago

[0] - https://github.com/search?q=%22Assisted-by%22+user%3Aggml-or...

[1] - https://github.com/ggml-org/llama.cpp/blob/master/.pi/gg/SYS...

7 more replies

StevenWaterman7d ago

Running on 2x 3090, 500-1000tok/s prefill and 60tok/s output at Q6_K_XL with MTP on llama.cpp, 220k tokens context window (starts to get a bit dumb above 160k ish), no KV quantization

6 more replies

epistasis7d ago

> talking just way too much

OMG this is such an annoying property, just shut the hell up please, and be concise.

And requests to please only offer a single step at once, or single option at once, or to even stop eagerly offering future directions is really hard to prompt correctly.

And look, there I did exactly what I was complaining about...

3 more replies

derethanhausen7d ago

I would not generalize based on experiences with Sonnet. The flagship models (Opus being the claude equivalent) are dramatically better.

1 more reply

kitd7d ago

Funny that coding agents have personalities, including "that colleague" you want to avoid even if you know they're probably quite good at what they do!

1 more reply

radium3d7d ago

The current way to 'own' decent RAM and GPUs right now is through the stock market it seems.

giancarlostoro7d ago

2 more replies

MostlyStable7d ago

1 more reply

bjackman6d ago

Re being away from the HW: with Tailscale and llama-server it's now super easy to just run an inference server at home and use it from wherever you are.

1 more reply

linuxhansl6d ago

How qwen3.6:27b compare to qwen3.6:35b-a3b (MoE) in your experience (if you tried). I find the dense models are way too slow on my H/W.

2 more replies

andix7d ago

Sonnet is extremely overpriced. It's a good model, but not worth the money Anthropic charges for it.

dyauspitr7d ago

Why would I want some half assed coding assist tool. I want something that takes in a requirement and spits out a finished product. It’s not your equal, it’s better than you.

Shorel5d ago

I use tailscale to have remote access to my local models when on the move.

dackdel7d ago

what kind of hardware do you need in order to run qwen3.6-27b

4 more replies

indoordin0saur7d ago

Very curious what hardware you're running this on!

1 more reply

cmrdporcupine7d ago

FWIW Codex/GPT models are way less this way. Maybe to a fault.

I'm setting up my DGX Spark to try Qwen 3.6 27B again, as I'm hearing a lot of good reviews. When I tried it some time ago it was still early for support in llama.cpp.

zerd7d ago

[0] https://twelvetables.blog/comparing-claude-fable-5s-system-p...

mik096d ago

try qwen 3.7 or glm 5.2 or one of the larger gemma models

chrisweekly7d ago

Why Sonnet 4.6 not Opus?

ltononro7d ago

calebm7d ago

sync/ack

rmunn7d ago· 15 in thread

sathackr7d ago

The opposite of that has been happening for 20 years now with cloud compute.

It won't happen with AI models either.

Same will happen with AI. Whether that means paying Anthropic that premium or paying AWS.

I'm in a relatively small business, we recently had an outage related to our local infrastructure.

Everyone wants to shuck the chore and the responsibility.

indoordin0saur7d ago

starshadowx27d ago

storus7d ago

wuliwong7d ago

If things change to token usage billing for everyone, maybe I'll be singing a different tune but on a subscription, I don't think it makes sense financially.

Fun? Yes. Financially sound? No.

1 more reply

bityard7d ago

What's interesting/exciting is that local models are _already_ quite good at tasks we never imagined AI _ever_ doing before ChatGPT hit the scene just a few short years ago.

5 more replies

xdertz6d ago

pessimizer7d ago

And those are going to all be big enterprise companies that probably will set up LLM services entirely in-house, because they've got the headcount to utilize servers at 100%.

I wonder if there will be (or is currently) business in selling their compute while they're not working, to opposite time zones, etc.

What's left for the big providers will be the dregs of individual subscriptions and small businesses that at their least paranoid might let employees just use their own subscriptions for work.

icoder7d ago

4 more replies

spopejoy6d ago

The other question is how the middle ground (hetzner etc) is shaping up, because obviously so many orgs won't want to run servers.

sbmthakur7d ago

Someone was able to run gemma-4-26B-A4B on an i5-8500 with 32 gb ram with NO GPU. Granted this is an extreme example these MoE models are value for money for a lot of use cases.

https://www.reddit.com/r/LocalLLaMA/s/YontVNVRbL

themaninthedark7d ago

Maybe that is why they are buying up as much hardware as they can? If their service is the only game in town.

1 more reply

frollogaston7d ago

1 more reply

ActorNightly7d ago

Local models will never achieve "real" performance (i.e actual usage, not benchmarks) compared to frontier models.

mik096d ago

i think in the long term the problem is going to be this: a great small model always come from a greater large model, but the larger base model keep getting larger and more closed sourced

so long as there's no algorithm breakthrough

_doctor_love7d ago· 7 in thread

"Just get a 64GB Mac with 1TB of storage!"

LOL - some of us have a budget

swatcoder7d ago

Sure, but it's also not really out of scale with the cost of a shop tool in other trades.

If you're a professional that's confident in a positive return on the investment (optimal or not), or just a hobbyist with the luxury budget for a "shop" that cost is well within norms.

1 more reply

amalcon7d ago

A Strix Halo with similar RAM is considerably cheaper. Still not cheap, mind, but performance is OK (not great) and it will run more or less the same models.

1 more reply

p-e-w7d ago

No need. You can run the Gemma 4 and Qwen3.5 MoE models with as little as 12 GB of VRAM at 30-40 tps (Q4/Q5), and they both blow GPT-4o and DeepSeek R1 out of the water.

anarticle7d ago

Pros buy their own tools. This is why working for yourself is better than working for a corpo, you get to choose your weapon.

1 more reply

techscruggs7d ago

He is using a 2022 M2, which you can get that for about $2k used. That is beyond reasonable.

2 more replies

tjwebbnorfolk7d ago

AI and budgets don't mix well at the moment

themythfable7d ago

Yeah, I never had a computer that cost north of $800 until recently. While that is far from the typical HN user's budget, my bet is that it is much closer to average.

Besides those with effectively unlimited budgets for their personal compute, local models are still a long ways off.

Though, that shouldn't be conflated with the value of open-source models, which can be used by cloud providers to significantly reduce cost of intelligence.

2 more replies

pornel7d ago· 5 in thread

[meta] I wonder why people have such wildly different bar for what is "good" agentic coding?

To me, 7B models are just a fuzzy echo of Wikipedia. Gemma models at 4 bit are too clumsy to even reliably generate JSON for tool calls or copy a line of code to apply a patch.

Qwen needs so much detail and babysitting to stop it from doom looping or losing the plot, that the instructions that I need to give are usually longer than the code I end up keeping.

Is there some magic prompt that I don't know? Do other people just have a lot more patience, or way lower expectations?

papersail7d ago

palisade6d ago

1 more reply

verdverm7d ago

There is a lower bar (that gets lower over time), but ime, the config you are describing is too low still.

tl;dr - the models you appear to be trying with are too small or too quant'd

cheschire6d ago

Haves and have nots.

We aren’t wealthy enough to have the hardware that would make this good.

The people who have the money to buy a spare maxed out Mac mini just don’t get it. I see lots of folks with RTX 6000’s in threads like these. Or any RTX card that ends in “90”.

Cloud AI is what allows the proles to participate in the broader AI conversation, but not these AI conversations.

2 more replies

towledev6d ago

> may pass the Turing Test

Why do you say 'may'? Just curious. Surely you've got something

1 more reply

iagooar7d ago· 3 in thread

I love running two models locally: qwen3.6 27B 8bit (dense) and qwen3.6 35B 4bit (MoE).

Barbing7d ago

Did you get a Brave search API key or something for that “Hermes”?

4 more replies

zerd7d ago

I'd love an RTX 6000 Pro, but how can you justify it when it costs 10 years worth of Claude Max?

2 more replies

jnaina6d ago

how are you connecting the 35B model to your mailbox, for email classification?

1 more reply

embedding-shape7d ago· 2 in thread

zozbot2347d ago

> Instead, diffusion models work much faster for individual prompts, and not by a small margin either.

Diffusion models can't really be trained beyond low-to-mid size and have lower quality than an equally sized, plain one-token-at-a-time model.

1 more reply

an0malous6d ago

No one ever shows the resulting code of using frontier models either. Curious.

schmuhblaster6d ago· 2 in thread

I’ve been playing around with qwen3.6-35b-a3b and managed to boost it significantly by leveraging my own custom harness [0].

[0] https://deepclause.substack.com/p/how-to-make-small-models-p...

phunterlau6d ago

Cool, so the determinstic harness can boost the agent pretty much!

edg50006d ago

Harness engineering is very interesting stuff. Thanks for sharing.

richbradshaw7d ago· 2 in thread

I’m keen to understand speed here etc etc. if I bought a Mac studio with 96GB - what can I realistically run, how’s it compare to fable/opus etc and how fast is it?

Currently maxing out two Claude code accounts every x hours when working on large code migrations or setting up new iOS apps etc - most of time it’s fine but occasionally it’s mega frustrating!

simonw7d ago

With 96GB I'd start with the Gemma 4 and Qwen 3.6 models. Any of those should work fine.

AbsurdCensor7d ago

angry_octet7d ago· 1 in thread

rsanek7d ago

Who's paying the $50k? I don't see how it makes sense to pay that much for a home-grown setup when I could pay <$5k/year total for both of the two best frontier models at effectively unlimited usage.

3 more replies

sosodev7d ago· 1 in thread

kristopolous6d ago

You need to switch out the prompts and work with it differently.

I posted this yesterday https://github.com/day50-dev/petsitter

I use it with https://github.com/day50-dev/simple-llm-cli

And modify the "tricks" until my evals get to good numbers. It's a model by model basis.

This is what the larger firms are doing - they have custom prompts per model

1 more reply

0xc0c0c07d ago· 1 in thread

failbuffer7d ago

So which harness did you end up choosing?

segmondy7d ago· 1 in thread

If you're resourceful, you can even run SOTA models. KimiK2.7, MiMo-V2.5/V2.5-Pro, MiniMax2.5/2.7/3, DeepSeekV3.1/v3.2/V4-Flash/V4Pro, GLM5.1, Step3.7-Flash, Qwen3.5-397B, Qwen3.5-122B, gpt-oss-120B

agile-gift02626d ago

> Qwen3.5-122B

do you find Qwen3.5-122B to be SOTA-level? I moved from it to Qwen3.6-27B (both Q8), and I prefer 3.6-27B, and it leaves me room to spare for other small models

ngxson7d ago· 1 in thread

My 2c: I think the "cloud vs local" debate is (maybe) a false dichotomy. In my experience, I use a hybrid approach and I've seen a huge productivity boost from it.

I'd argue both have value, and I don't see why anyone needs to choose one exclusively. Anyone else doing this?

[1]: https://github.com/ngxson/llama-companion

phainopepla27d ago

Why not just use DS V4 Flash for the small stuff? Very fast and extremely cheap.

1 more reply

simonw7d ago· 1 in thread

These models are very capable, and use around 20-30GB of RAM while they are running.

Provided you have 64GB of RAM that leaves space for running other applications at the same time.

chrisweekly7d ago

Obtaining that 64GB RAM is a meaningful obstacle for many.

2 more replies

minton7d ago· 1 in thread

fendy30027d ago

Curiosity or anticipation I think. I have tried it in the name of those 2 factors, because when the frontier model price increase happens and we don't know anything about local models, we're screwed

sieste6d ago· 1 in thread

k__6d ago

Training DeepSeek was magnitudes cheaper than training the SOTA models it relied upon.

In theory, other countries should be able to replicate that effort and improve it.

1 more reply

gregwebs7d ago· 1 in thread

"env": { "ANTHROPIC_DEFAULT_OPUS_MODEL": "claude-fable-5" },

Obviously have that set to "claude-opus-4-8" now.

noveltyaccount7d ago

infogulch7d ago· 1 in thread

Anybody used a tinybox? https://tinygrad.org/#tinybox

pornel7d ago

If you have $12K to spend, you may be better off with DGX Spark or a Mac with 128GB VRAM. That can (barely) fit DeepSeek V4 Flash.

valisvalis7d ago· 1 in thread

LolWolf6d ago

what were your favorite projects?

pjmlp7d ago· 1 in thread

Only if blessed with enough RAM and disk space,

> 64 GB RAM and 1TB storage

Ah ok, not something regular joe and jane happen to have lying around at home.

Additionally the whole configuration is still very much low level, bunch of CLI commands, and if the model doesn't fit for the task at hand, it starts allucinating, generating gibberish, whatever.

sparkling7d ago

Even if i had such a machine, im not sure i would be willing to sacrifice 80% of my RAM and 50% of my disk to run a semi-okay model locally.

ramaseshanms6d ago· 1 in thread

red__dragon6d ago

Are you referring to Spark that has 128 GB of Unified Memory? It would be still expensive.

tpurves6d ago· 1 in thread

BenRacicot6d ago

Agreed, this is what caused me to build. This thesis exactly.

andix6d ago· 1 in thread

> For my local setup, I’m currently [..] and LM Studio as the inference server, although it would likely be faster if I just used llama.cpp directly

Is there any truth to this claim? LM Studio uses llama.cpp to run the models. I guess the overhead of LM Studio should be minimal.

After all LM Studio is a really easy way to host models, are there really major drawbacks?

cuvinny6d ago

aliljet7d ago· 1 in thread

rsolva7d ago

But for how long? The subsidized phase is probably short, and then what? I run Qwen 3.5 27 Dense om my old AMD RX7900XTX at about 45 t/s and barely use my Claude Code subscription anymore.

huydotnet7d ago· 1 in thread

zahlman6d ago

Why shouldn't the author mention models that people might not have to buy a new computer to use?

1 more reply

cube007d ago· 1 in thread

The challenge I have is getting a large enough context window so tool calls work reliably, the local models easily slip into hallucinated JSON tool responses and won't trigger the tools as a result.

glaslong7d ago

Same here. I'm curious what others loving Qwen are doing differently, because it constantly hits this issue for me. It's been great for autofilling blocks, but difficult for me to use agentically.

hamburgererror6d ago· 1 in thread

probably_wrong6d ago

After predictably failing at generating a sewing pattern, Gemini gave me yesterday this excuse:

If you want the illusion of a meaningful research question then sure, local models will give you that.

ios-contractor6d ago· 1 in thread

I subscribe to this guy on youtube for local model stuff if anyone is interested https://www.youtube.com/@AZisk. I'm not affiliated and I'm not even a paying subscriber. But I like all stuff local.

andwhatisthis6d ago

1 more reply

shunia_huang5d ago· 1 in thread

fav-ed this post and will check again when another one poped on front page and says "local models is perfect now"

Natalia7245d ago

"Good now" feels like the right bar. I still wouldn't use local models for everything, but for quick edits, grep-like code questions, and private notes they are already useful.

blobbers7d ago· 1 in thread

Have you tried optimizing for MLX? It seems like a waste to have neural cores and not use them.

I've often wondered why the hype around apple neural core when 99% of software doesn't use them.

genxy6d ago

Yeah, first think I looked for on the post was MLX and it wasn't there.

https://github.com/ml-explore/mlx-lm

Having used half the systems that Vicki mentioned, mlx was the best balance between power and ease of use. Just a pip install away.

stared7d ago· 1 in thread

I really recommend Qwen3.6 27B.

When using benchmarks, it gives more-or-less the level of SotA mid-late 2025.

iagooar7d ago

I run the exact same model, on the exact same hardware - amazing results. Pair it with good search skills (Tavily, Brave, Exa) and you have a near-SOTA model on your desk.

dejawu7d ago

xienze7d ago

chrismarlow97d ago

Tharre7d ago

I've been running Qwen3.6-35B-A3B (and 3.5 previously) locally and it's a great model for many small tasks, probably a significant chunk of what most normal people are using LLMs for right now.

delis-thumbs-7e7d ago

There’s a lot of things we could use even quite small models for, which would not need an insane amount of computing power and memory, but too few of us is really researching them.

bayshark7d ago

Hey everyone, made a local LLM, configured for Home Assistant called Selora AI.

Specs: qwen3_17b_base.Q6_K.gguf selora-v047-answer.f16.gguf selora-v047-automation.f16.gguf selora-v047-clarification.f16.gguf selora-v047-command.f16.gguf

The full base model and LoRA adapters are only 3.5GB

https://huggingface.co/selorahomes/Selora-AI

anubhav2007d ago

ptx7d ago

> Security: I run every Pi session in a Docker container and give it permissions only to bash so that it can’t run Python code or do web browsing

How does that work? The script in the post references the file "docker-compose.sandbox.yml", but I don't anything about what that file does.

The post that this one links to, that it's based on, says that Pi doesn't do proper sandboxing.

Presumably bash can still execute other binaries, otherwise it would be fairly useless. What stops it from executing Python? Or opening a network connection and downloading Python?

androiddrew7d ago

ROCm stack is not for people though who aren’t willing to dig in and patch things themselves.

K0IN7d ago

In a day to day base i host Qwen3.6:27b, but i *Really* want to host deepseekv4 flash, its such a "good" model for its size/speed/price.

I really wonder when companies will start hosting theire model for everday tasks on prem, cause its good enough (and realative cheap), instead of paying subscriptions for all devs.

asim6d ago

ltononro7d ago

Good depends a lot. If you are in the token maxxing hype you will probably find these models very bad comparing to SOTA, unfortunately.

I see OSS models being a good backbone for companies in the future that have validated workflows and could use those for privacy or to spare costs.

IDK, might have gone a little bit off-topic here.

polotics7d ago

Running the same prompt on both with the same .md memory state...

Gemini3.5 is more "intelligent" but Antigravity gets it to decide to go on tangents that are quite time and token-consuming I think. Nice casino machine.

The code is about 10'000 lines of Kotlin in total so it already takes some effort to keep it simple for the AI. It's not a slopped quantity of code, i got solid feature creep :^)

https://play.google.com/store/apps/details?id=com.sixteenam.... ...hat tip to the recent copycat squatter btw it's an honor!

jmyeet7d ago

1. Memory bandwidth

2. VRAM size, which limits the size of a model you can use effectively. Yes you can swap but then you're taking a performance hit;

3. Raw FLOPS, including quantization.

NVidia aggressively segment the market by limiting VRAM. The RTX 6000 Pro is basically a 5090 with slightly more CUDA cores and 96GB of VRAM instead of 32GB for $10-11k instead of $3k.

But anyway I think we're only 2-3 years away from sub-$5000 hardware that does 100-300+tok/s on models larger than 31B. And that's going to be a game changer.

sermakarevich6d ago

I am running an experiment with local qwen3.6:36B for a week: https://news.ycombinator.com/item?id=48520757

linuxhansl6d ago

I soooo wish that to be true. Alas, in my experience it is not... Yet.

I can run gemma4:26b-a4b-qat on my laptop with 20-30 tokens/s with a 256k context window. That was unthinkable just 6 months ago.

So the local models are "OK" for small'ish projects.

If the trends continues, with 1.58bit QAT models, even better K/V compression, faster multi-token prediction et al, maybe soon it will be comparable.

wxw7d ago

> “if we are constrained by performance and price, what architectural tradeoffs do we need to make?” a question that so far has not really been asked in the mad token gold rush.

To be fair, I think the labs are also interested in this (e.g OpenAI parameter golf). But the incentives are tricky. When the subsidies and tokenmaxxing era ends, local models will be essential.

aquarious_7d ago

andix7d ago

Because I've seen too many people spending a lot of money on expensive hardware, without really using it in the end:

Most of those models are also available via Openrouter and many other platforms. Dirt cheap, and much faster than on consumer GPUs. Perfect to try and compare the different options.

jlengrand7d ago

abalashov7d ago

jnaina6d ago

Running Qwen3-30B-A3B-Instruct-2507-AWQ-4bit on an Olares One with NVIDIA GeForce RTX 5090 Mobile GPU (24GB GDDR7 VRAM) and an Intel Core Ultra 9 275HX processor.

Plenty fast for coding work and for sharing with my OpenClaw setup.

ronef6d ago

cautiouscat7d ago

The good old butt dyno!

I’ve been doing some pretty niche development using a game and a script extender for said game. If these models can handle that, I’d feel good about switching.

robertkarl7d ago

You can trade off latency / accuracy / cost for any ML task. And with the local models.... the cost is free.

Having a local Qwen check another Qwen's work increases the accuracy quite a bit at the cost of more latency. You can't have your cake and eat it too.

Mr_Eri_Atlov7d ago

I think this is a pivotal moment for LLMs.

Plus, they have definite reliability advantages over cloud models that can be wiped out by a government order or lobotomized to handle traffic surges.

acb126d ago

b3ing7d ago

jauntywundrkind7d ago

Not really super related but MCP has been working on Events for a while. That ability to respond fast would be great. https://github.com/modelcontextprotocol/experimental-ext-tri...

Asking local to be fast feels like an obvious folly, but given how much better small models have got, and seeing these models tune themselves for speed: I want to hope!

noveltyaccount7d ago

From the recent Nvidia & Microsoft announcement about new chips for consumers:

Makes me optimistic that those two companies are going to keep investing in quality local models.

hank8087d ago

daniban7d ago

restlake6d ago

jotato7d ago

I currently have a desktop with a 4060 ti (16gb of vram). Most models I have tested that fit within that are not good enough for anything other then type completion (in regards to coding tasks)

I have been considering getting the 58gb Mac Mini but that is a decent amount of money to spend without confirmation on a) how fast is it and b) will it work for well-defined tasks.

jszymborski7d ago

throwarayes7d ago

I am happy to pay OpenAI for a cheaper model a few generations behind. But they deprecate models aggressively. They push you to bigger and smarter models, when 95% of my work doesn’t need it.

I’d love it if model providers just let old models run and let us pay less, but the deprecation makes me want to look into local models.

fridder7d ago

anax327d ago

Running locally is the bar; it's hard to make these things a service which scales.

MrKoby077d ago

I think a lot of people just don't have specs like that, making it still painful.

k__7d ago

I tried some smaller Gemma4 and Qwen3.6 quants on my MBA with M5/16GB and had like 20-60 tokens per second. At 60 it felt pretty okay and that hardware is on the lower end.

I'd assume a Mac with 32-64GB memory would get some reasonable results.

WASDx7d ago

ta-run7d ago

Not related, but, I can't seem to get my copilot-cli (office is an MS shop) use qwen3.5:27b on ollama for some odd reason.

After the recent changes to usage, I've spent an annoyingly long number of hours trying to get this to work.

wasimxyz7d ago

https://canirun.ai

skittleson7d ago

xbmcuser6d ago

Running local models might be good but until the virtual hardware monopolies of tsmc and others is broken they will out of reach for most people.

mohamedkoubaa7d ago

I wonder when a cheaper consumer grade inference chip will hit the market. The general purpose GPUs have much more silicon and complex firmware than what's strictly needed for inference

prlin7d ago

wrxd7d ago

I wonder how much local models hallucinate. I am getting almost daily an "Honest answers: I made that up." reply from Claude Opus when I challenge some silly thing it's trying to do.

pinstripes6d ago

I very much enjoy scrolling r/homelabs ever so often, so many cool local rigs there

zx80806d ago

> None of these are groundbreaking tasks (again, a lot of personalized Google/docs lookups)

Does it really needs a GPU at 300Watts to do all that tasks?

malkosta7d ago

The problem with QWEN is that it just can't edit files reliably, I had to hack Pi all over to reduce the pain, but still far from perfect...does Gemma 4 strugle on this?

aidenn06d ago

Can anybody recommend sub $10k hardware that can run the models mentioned in TFA at something faster than a snails-pace?

1 more reply

lthi7477d ago

Maybe it is good but it is very difficult, or at least with regular computer. For users like me with 16GB laptop it is almost impossible task.

ibizaman7d ago

Tangential but reading on mobile, the font size in the code snippets are all over the place. I actually have the same issue on my blog. Anyone knows why?

lanycrost6d ago

I'm crazy for gemma and Qwen, really hope we will be able to run LLMS everywhere like a Doom

ricardobayes6d ago

They are good, and yesterday's release GLM 5.2 even benchmarks really close to Opus.

walmas7d ago

Maybe the future isn't Data Centers, climate crisis, drought, and endless subscription and token fees.

nikagrawal1217d ago

I tried for my legal AI application that I'm building and it was able to do majority of the tasks. I used gemma4:26B

bthornbury7d ago

the qwopus 27b model is good for grunt work style tasks, even across multiple files. Piping a bunch of things through, small factoring changes, stuff that just takes time to type out.

I wouldn't rely on it for large stuff like codex though. I haven't tried out deepseek/kimi, if we could run those locally it would be great.

Muaz_Ashraf6d ago

for the past few days I am building things via local models and review and fix the bugs from opus. Its working okay but still local models are Just HYPE and Irritating.

ridruejo7d ago

AgentMasterRace6d ago

If you have an extra PC and enjoy 5 tokens a second... Sure

fl4regun7d ago

In my experience, with a system of 32GB RAM and 24GB VRAM, no, they aren't that good.

osigurdson7d ago

Running AI on timesharing mainframes does seem like an odd final state for the world.

fg1377d ago

> I have a 2022 M2 Mac with 64 GB RAM

I closed the article after that.

The author has no idea what a privilege it is to have a machine like that for personal use, and how 99% of the population are not going to afford a setup like that.

Just some back-of-the-envelope maths will tell you that a $20/month Claude subscription makes much more sense financially.

1 more reply

holoduke7d ago

0xbadcafebee7d ago

frollogaston7d ago

drchaim7d ago

really want to try local models, but I don't have the hardware yet. Probably I'm the only one here still using a Mac Mini m1 8gb 2020. :/

1 more reply

ZionBoggan7d ago

This is actually a really insightful post !

Patchistry6d ago

do you run you local models along side some of your "paid" models?

henryoman6d ago

Will there be a gemma4n

sn0n6d ago

Qwen 3? Qwen 2.5 coder?? Is this an llm article written on an outdated model?? LoL

atulmy7d ago

Exact reason I'm building csuite.so, do check it out and let me know if you need early access!

dbg314156d ago

They misspelled “Better than before. But… yeah.”

nullc6d ago

I'm a little mystified at people taking about qwen 3.6 27b/ gemma 31b being slow in one breath and then saying they're using a 16GB gpu in the next.

You do need to use sutable hardware.

These speeds are somewhat faster than what I've seen from commercial SOTA models, they're plenty fast for many applications.

teknologist6d ago

I found a tool that makes it easy to run Salvatore Sanfilippo's (Redis creator) ds4.c on a Mac: https://github.com/notatestuser/ds4-control

His program uses quantization, but is very optimised and has builds that can fit into 96GB of memory with great results.

DS4 Flash is usually my go-to for a lot of things these days, and I don't have to worry about a cloud model stopping or telling me it's concerned about my usage.

matrix127d ago

gemma:12b at 75% of frontier? Yeah....

1 more reply

dakolli6d ago

Imagine spending $5k to run a 32B param llm locally.. You could run much more capable open source models through Openrouter for years running 24/7 at 50tps. This will never make sense to me.

Computer07d ago

jingw2227d ago

open source must win

pauljeba6d ago

How do I beleive thi? you wrote this blog by hand.lol

monegator7d ago

I've been trying local models for the boring stuff you might be thinking about: writing small docs.

So i've tested a couple, and the speed is finally impressive. My colleague uses paid tiers of claude and GPT, and the speed is comparable. Maybe even slightly faster on my end.

Wish i had 3 times the RAM so i can see what happens with more context.

This was the Qwen 3.5 9B model.

In producing documentation, instead, it was much faster, and it never hallucinated data. Good. in 15 minutes i had everything done.

Not bad for stuff running on a business laptop, while doing actual work.

Tomorrow i will try Qwen 3.6, let's see how it goes..

pauljeba6d ago

How do I beleive you? You wrote this post by hand. lol

zrg6d ago

tldr it is not

aleksandrm7d ago

Clickbait title, because running local models is still not good now.

j / k navigate · click thread line to collapse