Ask HN: Has anyone replaced Claude/GPT with a local model for daily coding?

1318 pointscloudking12d ago562 comments

Has anyone here fully swapped Claude/GPT for a local model as their main coding tool, not just for side experiments? If so, please share your setup and performance (e.g tok/s)

562 comments

132 comments · 6 top-level

Greenpants12d ago· 99 in thread

I have! I care about data privacy and LLMs being free. I'm using the Pi coding harness but containerized and sandboxed, to make sure it's running completely offline. On my Mac Studio with 128GB RAM (or MacBook with 36GB RAM) I'm using Qwen3.6 35b, with only 3b active parameters so that it runs really fast. I've done a complete redesign for my website's homepage and blog with Django + Wagtail. The latter is interesting, because Wagtail is a bit less well-known, so the agent, without giving it internet access, doesn't always know how to develop for Wagtail. I've used Qwen3.5 122b for when things get more complex. At 10b active parameters, it's significantly slower though.

I've noticed a few things compared to large models like Claude. For starters, you really need to know what you're asking, and be precise; it doesn't do much thinking for you. Any assumptions left open, and it'll take the easiest route to reach the goal (e.g. CSS in HTML), often not the best in terms of architecture.

It gets into loops quite often, and surprisingly often gets the edit tool call wrong, after which it will spend lots of thinking tokens and re-read files instead of retrying (despite the system prompt suggesting so).

Comparing agentic Qwen3.6 35b to Claude Opus is like a junior with knowledge across the board, that you really need to guide, versus a senior that thinks with you on architecture. If Opus gives a 15x speedup, local and fully offline Qwen gives a 5x speedup. Which, given that it's completely free, is still mind-boggling to me :)

lambda12d ago

This is very similar to my setup. Pi in a container (I do let it have network access, just no access to creds or anything, only the one directory that I'm working on at the time and my ~/.pi directory), talking to llama.cpp in another container. I'm on a Strix Halo 128 GiB unified memory laptop.

I've never used the frontier models in earnest, I don't believe in using proprietary tools for my programming, so I can't really compare.

And I'm still a AI skeptic, so I'm doing more testing and kicking the tires than I am actually using it. That means I spend a lot of time trying to break various models, probe them for strengths and weaknesses, etc.

But I find that when I do try to use it for real for agentic coding, Qwen 3.6 35B-A3B is definitely the one I reach for the most often.

For other chat tasks and translation, I'll frequently use Gemma 4 31B.

For audio, I'll use Gemma 4 12B.

I keep a bunch of other models around to try out every once in a while (Qwen 3.5 122B-A10B, Qwen 3.6 27B, Nemotron 3 Super 122B-A12B, Step 3.7 Flash and Minimax M2.7 both at somewhat more aggressive quants, and GPT-OSS 120B if I want super fast but not terribly smart), but so far Qwen 3.6 35B-A3B is really the sweet spot for coding on a setup like this.

chakspak12d ago

Hopefully this isn't off-topic, but your setup sounds just like mine, Strix Halo and (I'm assuming) llama.cpp on ROCm, and I'm finding that the Qwen hybrid models don't handle prompt caching and instead re-process the context in full on every turn. I'm wondering if you were able to solve this and how?

lambda12d ago

I use Vulkan mostly instead of ROCm. Vulkan is actually a bit faster, paradoxically. I do switch out and try them both out, and it's not a huge difference, but I've been mostly saying on Vulkan.

The re-processing context every turn problem is definitely something I've hit. Some of the causes have been solved upstream in llama.cpp; make sure you're up to date.

But another cause of the issue that has a big effect is that older Qwen models didn't support preserving thinking. This means that each time you have a long sequence of tool calls with interleaved thinkging, as soon as you had your next turn in the chat, it would have to re-process all of that as it would drop all of the reasoning.

Qwen 3.6, however, now supports preserving thinking. This can use a bit more context, becasue you're not dropping the thinking every turn, but it re-uses the cache better, not causing you to have to reprocess a whole turn at a time each time.

In my models.ini, I have this for the Qwen3.6 models:

  chat-template-kwargs = {"preserve_thinking": true}

There are still occasional issues I hit where it will have to re-process, but getting up to date and enabling preserve_thinking has helped a ton.

2 more replies

havfo11d ago

I was able to solve this for my setup, 7900XTX and llama.cpp on ROCM in the oh-my-pi fork of pi.dev harness. I documented my setup on github, check under my username/omp-config, but the important thing is making sure the context is strictly append-only, and starting llama.cpp with

  --chat-template-kwargs '{"preserve_thinking":true}'

anaisbetts11d ago

If you're hitting this you have a bug, this is not related to the model. Either your harness is editing the messages between turns incorrectly (i.e. it is not append-only), or sometimes this is because of llama.cpp bugs, but bet on the former. Setting up something like Tailscale's Aperture will let you capture the requests and then you can diff them.

LoganDark12d ago

What harness are you using? Some of them (e.g. OpenCode) mutate the system prompt every turn, and therefore can't work with a KV cache.

I've had the best luck with Pi so far, but it comes without some bells and whistles you might be used to (e.g. plan mode, subagents, MCP client support)

2 more replies

dnautics12d ago

> Qwen hybrid models don't handle prompt caching and instead re-process the context in full on every turn. I'm wondering if you were able to solve this and how?

Isn't this the nature of how LLMs work? Or do you mean that it recalculates the entire KV cache instead of saving the old KV cache, in which case the problem is likely in your executor (llama.cpp, vllm, e.g.) configuration or capabilities?

1 more reply

verdverm12d ago

There is a bug in llama-cpp for qwen/gemma models, use vLLM instead

1 more reply

fjdjshsh12d ago

>I'm still a AI skeptic

What does this mean in June 2026 wrt coding?

To me it sounds like being a "rice cooker skeptic". Some people don't like using rice cookers, some do.

svantana11d ago

I'm a housekeeper skeptic. While I concede that a professional housekeeper would probably do a better job than me on most domestic tasks, I still think everyone should clean their own home, cook their own dinner, and write their own code.

femto11312d ago

For me the distinction is that your rice only needs to be edible once, while your code may need to last for decades. Using AI to code anything I could comfortably throw away if needed is a lot less fraught than letting it make choices that I and anybody who inherits the code is gonna have to live with, especially if by outsourcing those choices I reduce my understanding of the implications of those choices.

4 more replies

HWR_1412d ago

I assume it means they are not sure it gives them a speed up. Which, since I don't know what they are trying to do, may be reasonable.

1 more reply

lambda10d ago

It means that even if it works for certain tasks, I think that the problems caused by use of LLMs outweigh their benefits. I think it's a bad idea to generate large piles of code that you don't understand, but due to competitive pressures, it's too tempting for people to pass up, leading to a world in which software is getting worse by the day, while pumping CO2 into the atmosphere and boiling scarce water supplies to do so, DDOSing websites to scrape the data, and polluting the internet with mountains of slop.

This isn't about using rice cookers or not, that's a personal choice for how you cook your food, and choosing to do so or not really only affects the person cooking and cleaning. A rice cooker probably uses a similar amount of energy as cooking it by hand, possibly even less.

But when people using LLMs are causing active harm, and are making it more difficult to collaborate on a team, it's a lot harder to accept that it's just a personal preference.

If you wanted to use the rice cooker analogy, imagine if rice cookers let you cook rice in just one minute. Faster, don't have to wait for the rice to be done, great! But in order to do so, you have to cook 50 pounts of rice, but throw out the majority of it, and use a thousand kilowatt hours of energy to do so. You'd better believe I'm going to be skeptical of everyone deciding that they suddenly have to use these 1-minute rice cookers that burn so much energy and generate so much waste.

Iolaum11d ago

Haven't used for actual coding but was testing locally - for example running some swebench instances - whether qwen-3.6-35b-a3b@Q8 was better than qwen-3.5-122b-a10b@Q4. With MTP the former runs at around 55t/s and the latter at around 30t/s meaning the latter is also usable. It looked like qwen-3.5-122b-a10b@Q4 performed a bit better.

mahadevank12d ago

Thanks a lot for your comment. I was using Qwen3 but asn't aware ofo the A3B Mixture-of-experts model. Works much better, thanks

adyavanapalli12d ago

For the edit tool, you should consider implementing a hash-based approach where each line of code is hashed and referenced by it when doing replacements. You can read up on the approach here: https://blog.can.ac/2026/02/12/the-harness-problem/

I didn't do much benchmarking, but anecdotally, I found it to be making less edit errors. YMMV

pieterk12d ago

Yup, I used this for a while and IME it may get you a few percentages more of useful context initially, so quality feels a bit higher, but things start breaking down in funnier ways when you do run out of that quality for any reason later, so definitely caveat emptor.

ojr12d ago

I can use Gemini 3 Flash with the harness I built for around 8 years and still not exceed the cost of a Mac Studio with 128GB, the price for privacy is very high. Agentic flows that get stuck can be worked around but I prefer developer velocity.

ClikeX11d ago

> the price for privacy is very high

Not sure if you intended this to be this philosophical, but this is basically the slogan for modern life now.

gwerbin11d ago

Yeah but the price for, say, private email is a lot less.

disqard12d ago

Under-rated take, thanks for stating this!

Not everyone can plough $$$$ into hardware right now (more power to those who can), so choosing to rent is an A-Ok strategy.

tpm12d ago

It's ok if you can send your code and data to the provider. Some of us can't.

1 more reply

danans12d ago

> I can use Gemini 3 Flash with the harness I built for around 8 years and still not exceed the cost of a Mac Studio with 128GB

And sounds like you haven't factored in the cost of electricity to run that Mac Studio as an LLM machine. Probably get a few more years.

ihateolives11d ago

Sure, but Gemini subscription gives you just that - Gemini subscription, but new computer allows you to do other stuff with it as well. When you're upgrading anyway for other reasons then it's not fair to compare full Studio price to just one subscription.

electronsoup12d ago

> It gets into loops quite often, and surprisingly often gets the edit tool call wrong

I find that running better quantization, like Q8 tend to prevent this even though its a bit slower to run, it saves overall time with less churn

Using 3.6-27b is even slower again than 3.6-35b, but I find the accuracy really pays off

girvo12d ago

Right. Tokens/s decode isn't the most important thing to me: wall clock time for task completion is. And tracking all of that, on my GB10-based Asus box, Step 3.7 Flash at IQ4_XS beats Qwen 3.6 27B despite the latter having MTP, on all of my actual coding task evaluations in real codebases.

Qwen seems better at one-shotting things based on vague prompts to an acceptable degree, but thats literally not what I use these things for!

One thing if people do play with it, is it seems very very sensitive to quantisation of the K part of the KV cache. F16 K and Q8 V got rid of a lot of the loops that it was otherwise hitting.

There's also a regression in llama.cpp wrt. Step Flash, where quantisation is getting worse KLD and Perplexity than it otherwise was previously, for the exact same quants. Very odd, but it's being looked into at least!

gwerbin11d ago

Do you think the choice of quantization matters that much for other models? I've seen a lot of discussion about different quantization and FP formats but I feel totally unequipped to make an informed decision about what to try.

What's your evaluation setup like? It sounds like maybe the best thing to do is have a realistic evaluation that resembles your actual intended workload and workflow, and then just try everything.

2 more replies

ttoinou11d ago

I tried Step 3.7 Flash on my mac 128GB and it seemed very dumb. antirez ds4 flash is much better !

1 more reply

kristopolous11d ago

I've got a tool that sits in between the harness and inference engine called petsitter. It is a middleman validator to avoid just these kinds of issues. You can stack the fixes as needed (they're called tricks in the petsitter parlance)

It's what I use. Fixes the problem

https://github.com/day50-dev/petsitter

westoque12d ago

> Comparing agentic Qwen3.6 35b to Claude Opus is like a junior with knowledge across the board, that you really need to guide, versus a senior that thinks with you on architecture.

that's why i use the frontier models because its a senior co-worker vs a junior. if you use the junior for the sake of privacy i think you're missing out on the best insights for a specific task.

physix12d ago

The dilemma I am facing is cost.

Consumer-grade subscriptions of the frontier models give you superb capabilities per dollar, them being heavily subsidized. But if you're working in an enterprise setting, that won't work. You need to upgrade, and that gets significantly more expensive.

Furthermore, basing the SDLC on leveraging the bargain subscriptions risks falling apart in the future, both from a cost perspective as well as the question of availability (e.g. Mythos).

So from a strategic perspective, going local on the LLM and still achieving great results with the right approach is very relevant.

willisrocks12d ago

Or you can get the best of both worlds--use frontier models to build a spec/plan, and use cheap models (open source or not) for implementation. Your max or team plan can go a lot further this way without giving up much for quality. Play with something like Superpowers to make this really approachable.

bxk7612d ago

Best insights can be over rated due to bandwith limitation of the brain. Even if Einstein is sitting next to you the whole day and helping out Theory of Bounded Rationality applies.

ltononro12d ago

What kind of coding do you do? Do you keep track of frontier models to vibe check the differences and re-evaluate constantly or are you ok with having a nerfed model forever? (not being judmental, just really wanto to know your framework here)

Greenpants12d ago

Some of the work I do, I do for an (EU) organisation that doesn't have clear rules or guidelines on the use of AI yet. Though I have seen colleague-developers blatantly putting source code into external Claude-like models, I stay true to my principles and don't. I know for certain that everything that I run through my local, offline Pi Container Sandbox cannot leave the machine, and thus can't result in a data breach. I do this for the peace of mind.

I do (unscientifically) experiment whenever a new capable local LLM (<=130b) releases with a license that permits commercial use. As for knowing my models require more work than Opus, I don't mind still having to puzzle on getting the architecture right. In any case, it forces me to stay in the loop of what's being built, which is a good thing.

psychoslave12d ago

Could you give more details on how to make such a set up?

I'm not familiar with Pi, and not sure which kind of container you are referring to. Something mainstream like docker, or more classic like a BSD jail?

I started to experiment with locale LLMs, through ollama and Lemonade. Enough to throw simple prompts with code excerpts and get small scope code refactors. Though I still struggled to make them work with external tools, like my IDE, so they can be leveraged on to an agentic level with access to a full repository.

That's mainly for work, as they push for using LLMs, though with the new copilote license they provide it doesn't take me even a week to burn the whole token credit.

The tool can be useful, but in my experience without heavy guard rails and loops over tests. I suspect late models to also burn many token into rabbit hole of nonsense hypothesis, instead of doing straight forward correct implemention as you would expect from any entity with such a huge cumulated resources eaten and experimental playground to leverage on. Maybe incentives don't help model provider to minimize sold token, maybe it's just so hard to tame the beast all these bright minds with virtually infinite resources are not good enough.

Anyway, sorry for digression, but I would be extremely interested with a step by step tutorial to make a local LLM work in agentic level, including which kind of hardware is required to make it work properly.

kordlessagain12d ago

I'm adding Pi to Nemesis8 right now because I saw your comment, so thank you!

https://github.com/DeepBlueDynamics/nemesis8

ltononro11d ago

So you don't really trust the data policy (non-retention) of the big companies like Anthropic/OpenAI + regulations in EU. This is very interesting. I myself have been blindly trusting these organizations with my data and still not sure if I am trading code/trajectories for productivity.

Another POV is that most of the code written in most of my codebases were generated by Codex/Claude, so they would be "stealing data from themselves" in a sense.

I've been working with Transformers/LLM training in 2018-2021 and then now, more recently again. Things are far different. I think they would be more interested in the "how" you got your code to be satisfactory with your guidance than the actual code generated. But mostly I personally trust that they are not really using my trajectories for that (unless I explicitly allow it in the configs)

dumbfounder11d ago

It’s just a SaaS service like any other. They all want to use your data, but there are terms to make sure they don’t.

stared11d ago

Why I do like Qwen 3.6 35B A3B, I have found that the difference improvement of Qwen 3.6 27B is massive. Sure, it is 3x slower (https://github.com/stared/benching-local-llms-on-apple-silic...), but for the total development time it felt that still 27B is faster to get the goal.

Is it that in your case is it different?

geophile12d ago

My experience is almost identical. I have found that I need to be very careful with planning, breaking things down into small isolated steps (I can have qwen do this); and also (me) writing a very clear design. Relying on qwen to fill in a lot of those precise details results in those about-to-write loops.

Yeah, that edit inability is weird. I’ve updated AGENTS.md to limit editing (as opposed to rewriting) and that helps a little.

pieterk12d ago

Yup, it's fantastically useful.

Maybe even more useful than Opus when I have all the constraints to an issue. There is less "knowledge" in the model (I get by with 48GB of RAM allocated to an 8b quant), so it has fewer things to hallucinate about.

I've been getting to know its limits pretty well over the last few weeks and would say it's an excellent code search/replacement/generation* engine.

It's got the "in-context script generation" flow down as well, so it will easily help automate tasks that you describe with text and perhaps example commands, or tools, or skills* that you provide.

*Think of it + Pi as an NLP abstraction layer over grep, or a shell, rather than a jack of all trades + world knowledge all-in-one.

robertlagrant11d ago

How are you sandboxing your Pi coding harness? Directly only mounting certain folders, using capabilities to kill the network and not giving it all your shell env vars, that sort of thing? Or do you use a tool?

fidelramos9d ago

I'm using firejail to sandbox Opencode, for security and to keep the agents from personal data. I documented it in my blog [1].

[1] https://blog.fidelramos.net/software/how-i-sandbox-ai-agents...

throw1092011d ago

And, is the sandboxing for security (avoid RCE on the host) or merely guardrails for the models?

I've wanted the latter quite a bit for Pi, because weaker models like Deepseek V4 have extreme issues with obeying prompts (e.g. I'll instruct it to find a bug but not fix it, and it'll "helpfully" try to fix it anyway), so having a "read-only mode" actually backed by the OS would be very useful.

SeriousM11d ago

Haha, yes! Last time I asked it for options how to tackle a task and only do the research without touching any code. With xhigh rasoning, it echoed the options that many times until it was convinced that option A is the better choice and started implementing it.

0xbadcafebee12d ago

The harness and the LLM parameters are pretty essential to getting better results and reducing loops. Tweak the parameters and you can mostly eliminate loops without negatively affecting performance (it's a bit complex but ask a SOTA AI to guide you and it's not hard). The harness should also react more intelligently to failures; it can do things like return additional context or hints as it tracks error rates and avg duration of calls. Pi can be easily extended, and it's suggested by the author you modify it to perform better for your use case.

gwerbin12d ago

I've noticed the same about the edit tool, in both Gemma and Qwen. Maybe I'm not running them with the right sampler settings, but I'm happy to hear I'm not the only one. Lots of mismatched whitespace and stuff, the model ends up doing hex dumps and maybe 5 or 6 attempts at editing a 5-line function into a 250-line Python file.

All of these models also seem to get stuck in long thinking loops, sometimes tripling the tokens of a frontier closed model which is really painful when inference is already on the slow side (on my Macbook).

hparadiz12d ago

I am right there with you. Mind-boggling. It's a indistinguishable from magic technology!! I tried running some basic tasks through Qwen with Opencode on a 10 year old dual Xeon server for shits and giggles. I gave it a simple task like "use ffprobe first but convert this webm to mp4" and it was able to complete the task with zero network calls outside my network. On 10 year old hardware. It took about 3 minutes to complete the task. Now you may be saying 3 minutes? pfft. But I dare you to do it yourself. You're gonna be googling the CLI switches for at least 10 minutes and setting up your command. I had it actually optimize all the switches on the fly for me based on an initial ffprobe to see what is optimal.

bluerooibos12d ago

> 10 year old dual Xeon server...On 10 year old hardware.

Hold on, what are the specs of your rig? How much RAM?

I've been considering getting an old refurbished 2018 Mac Mini with 64Gb of DDR4 RAM but everything I've read suggests this will be way slower than my 16Gb M1 Pro Macbook.

hparadiz12d ago

I inherited a box with dual Xeons and 256 GB of DDR4. I then ran several tests and benchmarks of the hardware with several models.

I've been meaning to write a blog post but well whatever here's the md.

https://gist.github.com/hparadiz/f3596d00a62d8ebb2dadcc46ee5...

Qwen3.5 9B performed best.

You can absolutely still use this to do some basic stuff like tell opencode to convert a video file from one format to another. But frankly you're better off getting two AMD GPUs. Say a dual 7900XT would get way better performance.

linzhangrun10d ago

No need to touch the Macintosh from the X86 era

bandrami11d ago

> You're gonna be googling the CLI switches for at least 10 minutes

So there's this really amazing program called "man"

hparadiz11d ago

Yea there's something called a phone book too.

1 more reply

gmac11d ago

Which is generally slower than Googling, because it's paged content in a terminal which can search only for literal strings?

1 more reply

ololobus11d ago

You are right, but I think you miss the whole point of the agentic workflows that are being discussed in this post comments.

Yes, you surely can read man, docs, whatever, then DIY. The point is that in many areas people don’t really want to become an expert, like in ffmpeg cli arguments, they just want the work to be done. Above is an example of agent being able to do it locally, and I think it’s great

1 more reply

yieldcrv12d ago

> It gets into loops quite often

matches my experience and a deal breaker

also the context window sizes are too low. I can't operate in 65,000 windows any more because even just reading the code's file structure overruns it and gets me nowhere. Definitely its own art form.

200k context windows and above for me now

I saw a paper last night that should help this a lot though

Greenpants12d ago

I get that it's a deal breaker to some; it definitely requires patience.

In Pi, /new is my best friend and most-used command for sure. For simple tasks (I decompose complex ones anyway since I don't trust small local LLMs to do this for me), the model doesn't need much context, given that I'm proficient in my codebase myself: "I'd like Feature X. Look into files 1, 2 and 3 to make your edits."

kennywinker12d ago

Qwen3.6-35b handles 256k context fine if you’ve got room for it. I’m running it with 128k context with just 16gb vram.

dotancohen12d ago

  > you really need to know what you're asking, and be precise

Any chance that you could share some recent prompts to give other HNers a head start on his to approach Qwen? If you are uncomfortable posting them here, my Gmail username is the same as my HN username.

Thank you.

Greenpants12d ago

I'm glad you're asking. I already started writing a blog post on how to best make use of local models. I'll share it as soon as I have a complete enough list. If anyone else reading this would like to chime in with their tips & tricks, let us know!

For the time being, off the top of my head, I'd say:

- Prompt Engineering tips & tricks apply here (like being complete in the relevant context you provide in your question, and the specific task(s) the agent should do like reasoning, modifying one file, or trying to fix a complex task all at once (not recommended)).

- If you already know which files the agent should look into, mention them to save time and potentially context.

- In my personal workflow, I write down lots of atomic TODOs needed to solve a problem. As I write it down, I'll notice assumptions I'm making, or the fact that the TODO could still be decomposed further into (atomic) subtasks.

- It's best to get a feeling yourself for how Qwen handles your repository. I noticed if I don't specify an architecture for development, it'll make quick & dirty fixes. If I don't tell it to remove debug statements, it won't. This is what was meant with "be precise" – Claude Opus might think for you and act in your best interest. Smaller Qwen models will just do what you ask them to, and no more. They have design knowledge, but you have to explicitly ask them to "activate" that part of their knowledge.

thefossguy6911d ago

Is there a way to be notified of your blog post on this?

tsss11d ago

But if you have to write everything down in such detail, isn't it faster to just do the task yourself?

dotancohen11d ago

Thank you, that was extraordinarily helpful.

I look forward to that blog post!

jmuguy12d ago

Given your knowledge on this - do you think we'll see an open source model with Opus levels of capability? IMO if/when this happens - I would 100% stop using Anthropic.

Greenpants12d ago

Let me put it like this. I started with local LLMs when ChatGPT still used GPT-3.5. I was amazed how my MacBook with 8GB RAM could run openhermes2.5-mistral: a 7b parameter model that could generate short stories that sort of made sense. Incredible!

Two years later, and I'm running Qwen3.6 35b agentically to develop the start of a repository and automatically run tests to then improve on itself. I never thought we'd get here so quickly with LLMs back then.

I'm pretty sure in two years we'll have current Opus-like quality in the 30-100b parameter model range. But at that point, Opus 6.3 will reason along for us so much better still, that we'll still look at those models in awe. It's great to look ahead, but let's not forget to appreciate how effective the current local models already are :)

jmuguy12d ago

Haha well I ask because I don't really want/need anything beyond Opus most of the time. And I'm paranoid that Anthropic is going to be forced to charge the true cost of all this before too long.

1 more reply

lambda12d ago

If you believe the benchmarks, Qwen 3.6 35B-A3B already outperforms Claude 4 Opus.

Now, there's a bit of a degree to which some of the open source models do some benchmaxxing, and bigger models with more params may always feel like they have more depth. But anyhow, right now you have something that is arguably comparable to Claude 4 Opus on your laptop. I can't really compare myself because I never used it. It looks like Claude 4 Opus is still available on OpenRouter, so you could try it out and compare yourself if you're interested.

It will likely always be the case that there are proprietary cloud models that are more powerful than what you can run on a laptop. You can just do a whole lot more with terabytes of VRAM on multi-GPU clusters than you can do on a laptop. So for folks who must have the most capable, you're probably not going to want to leave Anthropic.

But right now, the models you can run on your laptop are comparable to the cloud models that were popular when vibecoding and Claude Code first took off.

MrScruff12d ago

You really need to take the benchmarks with a massive pinch of salt. I’ve been testing local LLMs since the original llama and there’s nothing I’ve tried that is in the same category as Opus.

1 more reply

make311d ago

There is no Claude 4 Opus model... It's a series of model, of which the strongest is Opus 4.8, and Qwen 3.6 35B-A3b gets 51.5% on Swe-bench pro to Opus 4.8's 69.2%

1 more reply

zozbot23412d ago

People can't seem to agree on what "Opus class" even means (the latest Opus is apparently pretty weak) but DeepSeek Pro, Kimi and GLM all are quite capable.

computerex12d ago

Nothing compares to Opus when it comes to "taste" in web design in my experience. Nothing compares to opus in very difficult HPC/model inference development. I worked on this with opus: https://github.com/computerex/dlgo

OpenAI was offering 2x usage at one point and I still used opus just because it's so much more effective.

1 more reply

rvnx12d ago

To me totally yes, even further, if they keep their existing route, over time people will stop using Anthropic.

More and more specialized and ultra-performant chips are going to flood the consumer market. Especially once new hardware foundries will start producing (well if we don't die from WW3 in the interval).

In 10 years from now, when even basic computers will have 128 GB of memory, and phones will have super optimized tuned models, then what will be the point of Anthropic ?

Just use Gemma/Gemini/Siri or whatever.

Pornography and uncensored models is also pushing toward local models.

It's not like needs of people grows exponentially, the needs follow an asymptote instead (they are capped).

The real revolution is offline robots and self-driving cars, but LLMs are already quite maxed.

For programmers, now, what Anthropic offers is like 3% improvement on a known test (like this pelican riding a bicycle), or on questions leaked from benchmark insiders.

It's ok but not like revolutionary (Fable was better but it was unusable, easy 20 minutes per one prompt due to overthinking).

spullara12d ago

This is the only setup that I think is reasonable to use locally right now. I had an agent set it up for me from this guys recipe:

https://ikyle.me/blog/2026/how-to-setup-a-local-coding-agent...

One thing I did change was the context length to 256k rather than 64k.

vizually8d ago

@greenpants, "Pi coding harness but containerized and sandboxed" care to address some specifics and/or reference implementation for this. may be a GITHub URL?

MoonWalk11d ago

This is good info, thanks. I want to do something similar, but know very little about how to set the components of LLMs up.

I've read a bit on what the various components are. What I don't see in your comment is what you're using to run your model locally. Ollama?

motbus312d ago

Try deepseek V4 flash

agnelnieves11d ago

there goes the rest of my night

calenti10d ago

What IDE do you use? How do you integrate it? I was using Continue but it exited its funding round to the Titler octopus and the Chat function in VSCode is choking on the Ollama responses.

Greenpants10d ago

That's precisely why my agent use is IDE-agnostic: I run Pi in any terminal. Often use it with the terminal inside VSCodium, though sometimes in a terminal outside an IDE if I don't expect to edit any files myself (e.g. for small one-shot projects).

awllau12d ago

Based on your explanation, it doesn't sound feasible for me, a complete non-engineer, to switch to fully offline? I do a lot of back and forth discussion with LLMs as someone who reads and writes 0 code.

Greenpants11d ago

I'm afraid I'd have to agree. That is, unless you have 512GB+ RAM sitting on a shelf and run the much larger SOTA-comparable local models.

nyxtom12d ago

Have you found that being much more spec driven helps guide it better?

nicman2312d ago

about the edit tool it is almost always trailing white spaces. if you give it a skill with a sed 's/( )*$//g' or something like that it speeds up things

GardenLetter2712d ago

Could the harness not check for a failed tool call and pass it to a small model for correction without clogging up the main context?

lambda12d ago

The thing is, to do a proper fix it would really need all of the context (maybe the tool call that failed was for an edit to a file that was last touched way at the beginning of the context), so you'd need to either keep that smaller model running doing prompt processing all the time, or have a very long wait while it does prompt processing on your whole session.

And then also, sometimes the tool call errors are because of something like a file was changed out from under it; the larger model is probably going to do a better job of figuring that out and fixing it up.

Finally, in Pi, you can always just use the /tree command to skip back to before a series of failed tool calls, with a summary if you want to let the model know what happened. The Pi /tree command is pretty powerful in managing your context

everforward12d ago

An illustrative example I've seen a lot is creating Jira tickets in projects with custom fields marked as mandatory. It tries to create the ticket without the field and the tool call fails. The LLM needs access to the full context so that it can generate text to put in the "Why couldn't this meeting be an email?" field.

Greenpants12d ago

I'm actually quite sure that directly retrying the tool call would often fix the edit-call already. But these models have been trained to "think" for a while for any problem solving, so they'll presume the problem of the edit is more fundamental and spend unnecessary tokens filling up the context.

I'll experiment more with the effectiveness of AGENTS.md rules for local Pi agents. I feel like smaller (local) LLMs just lack in attentiveness to elements in the context window, like precise instructions, compared to e.g. Claude models.

timmit12d ago

I got a 48GB Ram MacBook, somehow I cannot even run a 20b model, I was suprised that you get 35b model locally.

klardotsh12d ago

4-5 bit quants would probably fit pretty well on your rig. Check HuggingFace for Qwen3.6-35B-A3B-MTP-GGUF [1]. They've also got a cool UI thing these days to help indicate which quants of a model will run on your hardware.

Full octane isn't gonna fit on much of anything south of a 128GB machine once adding KV cache.

[1]: https://huggingface.co/unsloth/Qwen3.6-35B-A3B-MTP-GGUF

amelius12d ago

Sounds super cool, don't get me wrong, but I suppose for most people the bar is higher than HTML/CSS.

q3k12d ago

I love to warm up a whole rack of servers just so that some shitass buggy TUI can generate a line of bash that comments out my test runner.

We truly live in the dumbest timeline.

nozzlegear12d ago

I use local LLMs on my Mac Studio to write and pass unit test suites in F#, among other boring project chores I don't want to do myself.

krainboltgreene12d ago

> is like a junior with knowledge across the board, that you really need to guide, versus a senior that thinks with you on architecture

I don't want to be rude, but your linkedin has a sumtotal (generous) of like 8 months of programming as a profession (job title is AI Engineer). The rest is at best programming adjacent. How would you know what either of these situations are really like?

SoftTalker12d ago

I haven't logged in to LinkedIn or looked at it since a former employer demanded that everyone create a profile. So mine is now about 20 years out of date.

krainboltgreene12d ago

His is very up to date. Not everyone is you.

underdeserver11d ago

Nit - it is not completely free.

You are paying for the extra power draw.

rjblackman12d ago

it might be worth trying oh-my-pi in your case as it claims to improve the edit calls by using a unique patching format.

p0w3n3d11d ago

which coding agent are you using?

codinhood12d ago· 18 in thread

I don't think you're going to get many "true" answers to this. The opportunity cost of not using the latest and best models is just too much right now.

Every month I research this and come to the same conclusion: the time, effort, and cost required to get local models (and the coding tools around them) to perform even close to Claude Code with sonnet/opus just not worth it right now. If it was, it would be distributive enough to be in the news.

Not that I'm discounting someone hasn't already solved this, just trying to Occam razor my way out of diving too deep down rabbit holes.

pyeri12d ago

At some point, there will come a saturation point for that "Opportunity cost FOMO train ride", and I think we are already past that point. Mythos class models are a whole different beasts and cutting edge on reasoning but not much use for the problem domains most developers are trying to solve.

The present Sonnet/Opus versions (~4.8) will likely be what everyone in the enterprise might end up using eventually. And even though local models aren't there yet, there are budget alternatives from the families of DeepSeek, Kimi, GPT, MiniMax, etc. available through APIs of NVidida, OpenRouter, Groq, etc. which are very much Sonnet grade.

codinhood12d ago

Yeah this is exactly what I'm waiting for.

Personally, I don't think we're at that point yet. While I do think model improvement is starting to plateau (reaching a local ceiling), I'm not convinced local models are as good as sonnet/opus yet. The gap is still too much. But I'm excited for those models to reach those levels.

jrm412d ago

But you're pretty much measuring opportunity cost in tokens per second, no?

I think it strongly remains to be seen whether e.g. tokens per second (multiplied or whatever by percieved quality of private model) actually means "better or more useful output."

I strongly suspect it does not. (though I also strongly suspect this will be very difficult to measure because the incentive to lie about metrics here will be so strong.)

codinhood12d ago

If you’re arguing that model metrics don’t necessarily translate into useful output, I agree. That’s not how I measure the success of a mode and not really the point I'm trying to make. I try to set things up and test it on my actual projects.

What I’m saying is that if local models were actually comparable to Claude Code in practice, we wouldn’t be having threads like this. It would be obvious to the people using them, and it would be massively disruptive. Why would individuals and companies pay hundreds or thousands for Claude Code if they could run something locally and consistently get similar results?

Every month I revisit the local ecosystem hoping the answer has changed. So far, my experience has been that it hasn’t.

jrm412d ago

Having, e.g. seen Microsoft maintain a monopoly for well over a decade, there's nothing in my experience that suggests that "quality always beats hype" is remotely true.

It's entirely possible Claude is just winning the hype game.

1 more reply

Rastonbury12d ago

I think they are referring to the opportunity cost of time saved on doing things a local model cannot do or fixing it's mistakes against the cost of a subscription

bob102911d ago

I've got a machine in a corner collecting dust that cost me $12k to build 2 years ago. It runs fine but it's wildly impractical to use as a daily driver (loud/hot). I keep it as a reminder to not do this again.

At my current pace it would take me until sometime late 2030 to spend the same amount in gpt5.5 tokens.

anonzzzies11d ago

You forget that, especially on HN, many people are scaremongering that prices will soon skyrocket. Then it will be another story... I easily run $4k+/mo on my claude sub; if I would have to pay that, I definitely would spend 12k on hardware instead and accept a dumber helper.

jonfw11d ago

You are not stuck between public API pricing for frontier models via Claude and self hosted.

Remember that there are other LLM providers, open models, and previous gen models, that are way cheaper that frontier Claude and still way better than what can realistically run locally

mark_l_watson12d ago

Sounds like a correct conclusion to me also. I am trying to transition to a layered system: local, then OpenCode with commercial vendor APIs for models like DeepSeek v4 flash, then DeepSeek v4 Pro.

With a layered approach we can slowly shift to running more locally and still get required work done. Really, my local setup is so much better than it was 2 months ago, and extremely better than 6 months ago - on the same hardware.

sakopov12d ago

This seems to be the answer. Building a rig with a decent graphics card will cost $2k+ and will produce sub-par results. Might as well milk the $100/m Claude sub until open-source alternatives reach parity with today's frontier models.

phyzix576112d ago

The opportunity cost to who? Its getting super expensive for businesses and engineers across the board to pay for frontier models.

Gigachad11d ago

The cost of the hardware to run local models is still massively more expensive than the subscriptions while offering worse models.

Eventually I think it will even out but right now the hosted stuff is very subsidised.

gunapologist9912d ago

Rather than Occam, consider Pareto?

If you truly believe that it WILL get there within the next couple of years, then you might as well start playing with it now (and, yes, you will be very surprised, especially for shorter/smaller projects or nicely modularized larger projects)

kristopolous11d ago

that's super contextually dependent. I use them just as essentially a decompress of what I already know that I'm doing. I legitimately use 4B models just fine. I've got a large number of tools that make this entirely feasible and a daily driver for me (like https://github.com/day50-dev/llm-manpage-tool) ...

It's not really a bitter lesson here, I can scale those 4B models easier than someone can scale their 1000B models.

NamlchakKhandro11d ago

Thinking Claude is leading edge... I really think you need to re-evaluate what you research you think you're doing.

Claude Code is not Claude Opus/Sonnet/Haiku.

reassess_blind11d ago

What is leading edge then?

MadrasThorn12d ago

It's great at accelerating hardware innovation however.

cuttysnark12d ago· 3 in thread

I've had some success with local models by chaining "agents" together in a workflow. Each agent has a different prompt and uses a different ollama model based on what their role is. The project manager, schema agent(qwen3:14b), etc. doesn't use the same model as the coding agent (qwen2.5-coder:7b). Between each step is an orchestrator and with a Playwright task which attempts to surface errors to the agent who introduced the previous code block. Only error-free blocks are forwarded to the next workflow step.

Probably the biggest improvement was including a backend-for-agents service definition which instructed the schema agent they were to only produce only a manifest based on the task, and to pass off that off to the next agent.

In short, I split tasks up into many pieces by defining a workflow where agents are only allowed to do very specific things before their work is passed along. This keeps them grounded and capable while also creating places for me to intervene if a workflow was say 25% or 90% successful.

pianopatrick12d ago

I wish someone would do a benchmark and competition for this kind of work flow so we could figure out what works well.

Like "Here's this consumer grade GPU. Using only this GPU but with whatever models and workflow you want, see how well you can do on xyz benchmark."

Contestants would be given like 1 hour max and scored based on % of questions answered, % of questions correct and total time to finish.

Like "The Local AI challenge"

Curiositry5d ago

Tell me if you find this! I was thinking the exact same thing.

sowbug12d ago

Have you (or anyone else) tried letting agents compete? For example, give the same coding task to two models, or to the same model with a different seed, and have the reviewer choose the better result.

Some think the human brain works similarly: thousands of mini-brain cortical columns, each with a slightly different take on the situation, voting in a majority-rules system.

moezd12d ago· 3 in thread

Not yet. Without pure Apple game or decent GPUs, even with a lot of RAM and threads, all you get is about 30-50 tokens/second, and that's thinking turned off. Without these optimizations your model will have a field day with your MCPs, skills and agent descriptions and you will watch the paint dry before seeing the first output token. Local model serving means you have to fight for every token in your context window, which is quite opposite of what Claude/GPT/Copilot are pushing the industry towards.

amarshall12d ago

Thinking doesn’t change output speed. Anthropic’s models are ~ 40–60 t/s median output speed.

moezd11d ago

Do you have access to Anthropic model weights to run them locally?

amarshall11d ago

No, and having that is not required to know output speed nor the effect of thinking, so I don’t see the point in such a superfluous, indirect question.

As for the question you’re likely asking: benchmarks that include speed across many models and providers available at various places e.g. https://artificialanalysis.ai/leaderboards/models

AH4oFVbPT4f812d ago· 2 in thread

Ollama + Hermes on M5 Max 128GB using .NET using Qwen 3.6:35b-a3b as the primary model to do the work. I might use 27b to plan what to do.

xeonax12d ago

Whats .NET doing in between?

AH4oFVbPT4f812d ago

Sorry, I meant to say I was writing .NET C# with the setup

boringg12d ago· 1 in thread

Will the AI labs always make sure there is at least a years worth of differential? I guess the underlying business premise is that each new release has a step function change that prevents this kind of behaviour..

snoman12d ago

If the government is going to gate access to frontier models from here on out, even if new releases are a step function change… which they’re not… then it may be even more comparable to what’s available with a subscription.

j / k navigate · click thread line to collapse

562 comments

132 comments · 6 top-level

Greenpants12d ago· 99 in thread

lambda12d ago

I've never used the frontier models in earnest, I don't believe in using proprietary tools for my programming, so I can't really compare.

But I find that when I do try to use it for real for agentic coding, Qwen 3.6 35B-A3B is definitely the one I reach for the most often.

For other chat tasks and translation, I'll frequently use Gemma 4 31B.

For audio, I'll use Gemma 4 12B.

chakspak12d ago

lambda12d ago

I use Vulkan mostly instead of ROCm. Vulkan is actually a bit faster, paradoxically. I do switch out and try them both out, and it's not a huge difference, but I've been mostly saying on Vulkan.

The re-processing context every turn problem is definitely something I've hit. Some of the causes have been solved upstream in llama.cpp; make sure you're up to date.

In my models.ini, I have this for the Qwen3.6 models:

  chat-template-kwargs = {"preserve_thinking": true}

There are still occasional issues I hit where it will have to re-process, but getting up to date and enabling preserve_thinking has helped a ton.

2 more replies

havfo11d ago

  --chat-template-kwargs '{"preserve_thinking":true}'

anaisbetts11d ago

LoganDark12d ago

What harness are you using? Some of them (e.g. OpenCode) mutate the system prompt every turn, and therefore can't work with a KV cache.

I've had the best luck with Pi so far, but it comes without some bells and whistles you might be used to (e.g. plan mode, subagents, MCP client support)

2 more replies

dnautics12d ago

> Qwen hybrid models don't handle prompt caching and instead re-process the context in full on every turn. I'm wondering if you were able to solve this and how?

1 more reply

verdverm12d ago

There is a bug in llama-cpp for qwen/gemma models, use vLLM instead

1 more reply

fjdjshsh12d ago

>I'm still a AI skeptic

What does this mean in June 2026 wrt coding?

To me it sounds like being a "rice cooker skeptic". Some people don't like using rice cookers, some do.

svantana11d ago

femto11312d ago

4 more replies

HWR_1412d ago

I assume it means they are not sure it gives them a speed up. Which, since I don't know what they are trying to do, may be reasonable.

1 more reply

lambda10d ago

But when people using LLMs are causing active harm, and are making it more difficult to collaborate on a team, it's a lot harder to accept that it's just a personal preference.

Iolaum11d ago

mahadevank12d ago

Thanks a lot for your comment. I was using Qwen3 but asn't aware ofo the A3B Mixture-of-experts model. Works much better, thanks

adyavanapalli12d ago

I didn't do much benchmarking, but anecdotally, I found it to be making less edit errors. YMMV

pieterk12d ago

ojr12d ago

ClikeX11d ago

> the price for privacy is very high

Not sure if you intended this to be this philosophical, but this is basically the slogan for modern life now.

gwerbin11d ago

Yeah but the price for, say, private email is a lot less.

disqard12d ago

Under-rated take, thanks for stating this!

Not everyone can plough $$$$ into hardware right now (more power to those who can), so choosing to rent is an A-Ok strategy.

tpm12d ago

It's ok if you can send your code and data to the provider. Some of us can't.

1 more reply

danans12d ago

> I can use Gemini 3 Flash with the harness I built for around 8 years and still not exceed the cost of a Mac Studio with 128GB

And sounds like you haven't factored in the cost of electricity to run that Mac Studio as an LLM machine. Probably get a few more years.

ihateolives11d ago

electronsoup12d ago

> It gets into loops quite often, and surprisingly often gets the edit tool call wrong

I find that running better quantization, like Q8 tend to prevent this even though its a bit slower to run, it saves overall time with less churn

Using 3.6-27b is even slower again than 3.6-35b, but I find the accuracy really pays off

girvo12d ago

Qwen seems better at one-shotting things based on vague prompts to an acceptable degree, but thats literally not what I use these things for!

One thing if people do play with it, is it seems very very sensitive to quantisation of the K part of the KV cache. F16 K and Q8 V got rid of a lot of the loops that it was otherwise hitting.

gwerbin11d ago

What's your evaluation setup like? It sounds like maybe the best thing to do is have a realistic evaluation that resembles your actual intended workload and workflow, and then just try everything.

2 more replies

ttoinou11d ago

I tried Step 3.7 Flash on my mac 128GB and it seemed very dumb. antirez ds4 flash is much better !

1 more reply

kristopolous11d ago

It's what I use. Fixes the problem

https://github.com/day50-dev/petsitter

westoque12d ago

> Comparing agentic Qwen3.6 35b to Claude Opus is like a junior with knowledge across the board, that you really need to guide, versus a senior that thinks with you on architecture.

that's why i use the frontier models because its a senior co-worker vs a junior. if you use the junior for the sake of privacy i think you're missing out on the best insights for a specific task.

physix12d ago

The dilemma I am facing is cost.

Furthermore, basing the SDLC on leveraging the bargain subscriptions risks falling apart in the future, both from a cost perspective as well as the question of availability (e.g. Mythos).

So from a strategic perspective, going local on the LLM and still achieving great results with the right approach is very relevant.

willisrocks12d ago

bxk7612d ago

Best insights can be over rated due to bandwith limitation of the brain. Even if Einstein is sitting next to you the whole day and helping out Theory of Bounded Rationality applies.

ltononro12d ago

Greenpants12d ago

psychoslave12d ago

Could you give more details on how to make such a set up?

I'm not familiar with Pi, and not sure which kind of container you are referring to. Something mainstream like docker, or more classic like a BSD jail?

That's mainly for work, as they push for using LLMs, though with the new copilote license they provide it doesn't take me even a week to burn the whole token credit.

kordlessagain12d ago

I'm adding Pi to Nemesis8 right now because I saw your comment, so thank you!

https://github.com/DeepBlueDynamics/nemesis8

ltononro11d ago

Another POV is that most of the code written in most of my codebases were generated by Codex/Claude, so they would be "stealing data from themselves" in a sense.

dumbfounder11d ago

It’s just a SaaS service like any other. They all want to use your data, but there are terms to make sure they don’t.

stared11d ago

Is it that in your case is it different?

geophile12d ago

Yeah, that edit inability is weird. I’ve updated AGENTS.md to limit editing (as opposed to rewriting) and that helps a little.

pieterk12d ago

Yup, it's fantastically useful.

I've been getting to know its limits pretty well over the last few weeks and would say it's an excellent code search/replacement/generation* engine.

It's got the "in-context script generation" flow down as well, so it will easily help automate tasks that you describe with text and perhaps example commands, or tools, or skills* that you provide.

*Think of it + Pi as an NLP abstraction layer over grep, or a shell, rather than a jack of all trades + world knowledge all-in-one.

robertlagrant11d ago

fidelramos9d ago

I'm using firejail to sandbox Opencode, for security and to keep the agents from personal data. I documented it in my blog [1].

[1] https://blog.fidelramos.net/software/how-i-sandbox-ai-agents...

throw1092011d ago

And, is the sandboxing for security (avoid RCE on the host) or merely guardrails for the models?

SeriousM11d ago

0xbadcafebee12d ago

gwerbin12d ago

hparadiz12d ago

bluerooibos12d ago

> 10 year old dual Xeon server...On 10 year old hardware.

Hold on, what are the specs of your rig? How much RAM?

I've been considering getting an old refurbished 2018 Mac Mini with 64Gb of DDR4 RAM but everything I've read suggests this will be way slower than my 16Gb M1 Pro Macbook.

hparadiz12d ago

I inherited a box with dual Xeons and 256 GB of DDR4. I then ran several tests and benchmarks of the hardware with several models.

I've been meaning to write a blog post but well whatever here's the md.

https://gist.github.com/hparadiz/f3596d00a62d8ebb2dadcc46ee5...

Qwen3.5 9B performed best.

linzhangrun10d ago

No need to touch the Macintosh from the X86 era

bandrami11d ago

> You're gonna be googling the CLI switches for at least 10 minutes

So there's this really amazing program called "man"

hparadiz11d ago

Yea there's something called a phone book too.

1 more reply

gmac11d ago

Which is generally slower than Googling, because it's paged content in a terminal which can search only for literal strings?

1 more reply

ololobus11d ago

You are right, but I think you miss the whole point of the agentic workflows that are being discussed in this post comments.

1 more reply

yieldcrv12d ago

> It gets into loops quite often

matches my experience and a deal breaker

also the context window sizes are too low. I can't operate in 65,000 windows any more because even just reading the code's file structure overruns it and gets me nowhere. Definitely its own art form.

200k context windows and above for me now

I saw a paper last night that should help this a lot though

Greenpants12d ago

I get that it's a deal breaker to some; it definitely requires patience.

kennywinker12d ago

Qwen3.6-35b handles 256k context fine if you’ve got room for it. I’m running it with 128k context with just 16gb vram.

dotancohen12d ago

  > you really need to know what you're asking, and be precise

Thank you.

Greenpants12d ago

For the time being, off the top of my head, I'd say:

- If you already know which files the agent should look into, mention them to save time and potentially context.

thefossguy6911d ago

Is there a way to be notified of your blog post on this?

tsss11d ago

But if you have to write everything down in such detail, isn't it faster to just do the task yourself?

dotancohen11d ago

Thank you, that was extraordinarily helpful.

I look forward to that blog post!

jmuguy12d ago

Given your knowledge on this - do you think we'll see an open source model with Opus levels of capability? IMO if/when this happens - I would 100% stop using Anthropic.

Greenpants12d ago

jmuguy12d ago

Haha well I ask because I don't really want/need anything beyond Opus most of the time. And I'm paranoid that Anthropic is going to be forced to charge the true cost of all this before too long.

1 more reply

lambda12d ago

If you believe the benchmarks, Qwen 3.6 35B-A3B already outperforms Claude 4 Opus.

But right now, the models you can run on your laptop are comparable to the cloud models that were popular when vibecoding and Claude Code first took off.

MrScruff12d ago

You really need to take the benchmarks with a massive pinch of salt. I’ve been testing local LLMs since the original llama and there’s nothing I’ve tried that is in the same category as Opus.

1 more reply

make311d ago

There is no Claude 4 Opus model... It's a series of model, of which the strongest is Opus 4.8, and Qwen 3.6 35B-A3b gets 51.5% on Swe-bench pro to Opus 4.8's 69.2%

1 more reply

zozbot23412d ago

People can't seem to agree on what "Opus class" even means (the latest Opus is apparently pretty weak) but DeepSeek Pro, Kimi and GLM all are quite capable.

computerex12d ago

OpenAI was offering 2x usage at one point and I still used opus just because it's so much more effective.

1 more reply

rvnx12d ago

To me totally yes, even further, if they keep their existing route, over time people will stop using Anthropic.

In 10 years from now, when even basic computers will have 128 GB of memory, and phones will have super optimized tuned models, then what will be the point of Anthropic ?

Just use Gemma/Gemini/Siri or whatever.

Pornography and uncensored models is also pushing toward local models.

It's not like needs of people grows exponentially, the needs follow an asymptote instead (they are capped).

The real revolution is offline robots and self-driving cars, but LLMs are already quite maxed.

For programmers, now, what Anthropic offers is like 3% improvement on a known test (like this pelican riding a bicycle), or on questions leaked from benchmark insiders.

It's ok but not like revolutionary (Fable was better but it was unusable, easy 20 minutes per one prompt due to overthinking).

spullara12d ago

This is the only setup that I think is reasonable to use locally right now. I had an agent set it up for me from this guys recipe:

https://ikyle.me/blog/2026/how-to-setup-a-local-coding-agent...

One thing I did change was the context length to 256k rather than 64k.

vizually8d ago

@greenpants, "Pi coding harness but containerized and sandboxed" care to address some specifics and/or reference implementation for this. may be a GITHub URL?

MoonWalk11d ago

This is good info, thanks. I want to do something similar, but know very little about how to set the components of LLMs up.

I've read a bit on what the various components are. What I don't see in your comment is what you're using to run your model locally. Ollama?

motbus312d ago

Try deepseek V4 flash

agnelnieves11d ago

there goes the rest of my night

calenti10d ago

What IDE do you use? How do you integrate it? I was using Continue but it exited its funding round to the Titler octopus and the Chat function in VSCode is choking on the Ollama responses.

Greenpants10d ago

awllau12d ago

Greenpants11d ago

I'm afraid I'd have to agree. That is, unless you have 512GB+ RAM sitting on a shelf and run the much larger SOTA-comparable local models.

nyxtom12d ago

Have you found that being much more spec driven helps guide it better?

nicman2312d ago

about the edit tool it is almost always trailing white spaces. if you give it a skill with a sed 's/( )*$//g' or something like that it speeds up things

GardenLetter2712d ago

Could the harness not check for a failed tool call and pass it to a small model for correction without clogging up the main context?

lambda12d ago

everforward12d ago

Greenpants12d ago

timmit12d ago

I got a 48GB Ram MacBook, somehow I cannot even run a 20b model, I was suprised that you get 35b model locally.

klardotsh12d ago

Full octane isn't gonna fit on much of anything south of a 128GB machine once adding KV cache.

[1]: https://huggingface.co/unsloth/Qwen3.6-35B-A3B-MTP-GGUF

amelius12d ago

Sounds super cool, don't get me wrong, but I suppose for most people the bar is higher than HTML/CSS.

q3k12d ago

I love to warm up a whole rack of servers just so that some shitass buggy TUI can generate a line of bash that comments out my test runner.

We truly live in the dumbest timeline.

nozzlegear12d ago

I use local LLMs on my Mac Studio to write and pass unit test suites in F#, among other boring project chores I don't want to do myself.

krainboltgreene12d ago

> is like a junior with knowledge across the board, that you really need to guide, versus a senior that thinks with you on architecture

SoftTalker12d ago

I haven't logged in to LinkedIn or looked at it since a former employer demanded that everyone create a profile. So mine is now about 20 years out of date.

krainboltgreene12d ago

His is very up to date. Not everyone is you.

underdeserver11d ago

Nit - it is not completely free.

You are paying for the extra power draw.

rjblackman12d ago

it might be worth trying oh-my-pi in your case as it claims to improve the edit calls by using a unique patching format.

p0w3n3d11d ago

which coding agent are you using?

codinhood12d ago· 18 in thread

I don't think you're going to get many "true" answers to this. The opportunity cost of not using the latest and best models is just too much right now.

Not that I'm discounting someone hasn't already solved this, just trying to Occam razor my way out of diving too deep down rabbit holes.

pyeri12d ago

codinhood12d ago

Yeah this is exactly what I'm waiting for.

jrm412d ago

But you're pretty much measuring opportunity cost in tokens per second, no?

I think it strongly remains to be seen whether e.g. tokens per second (multiplied or whatever by percieved quality of private model) actually means "better or more useful output."

I strongly suspect it does not. (though I also strongly suspect this will be very difficult to measure because the incentive to lie about metrics here will be so strong.)

codinhood12d ago

Every month I revisit the local ecosystem hoping the answer has changed. So far, my experience has been that it hasn’t.

jrm412d ago

Having, e.g. seen Microsoft maintain a monopoly for well over a decade, there's nothing in my experience that suggests that "quality always beats hype" is remotely true.

It's entirely possible Claude is just winning the hype game.

1 more reply

Rastonbury12d ago

I think they are referring to the opportunity cost of time saved on doing things a local model cannot do or fixing it's mistakes against the cost of a subscription

bob102911d ago

At my current pace it would take me until sometime late 2030 to spend the same amount in gpt5.5 tokens.

anonzzzies11d ago

jonfw11d ago

You are not stuck between public API pricing for frontier models via Claude and self hosted.

Remember that there are other LLM providers, open models, and previous gen models, that are way cheaper that frontier Claude and still way better than what can realistically run locally

mark_l_watson12d ago

Sounds like a correct conclusion to me also. I am trying to transition to a layered system: local, then OpenCode with commercial vendor APIs for models like DeepSeek v4 flash, then DeepSeek v4 Pro.

sakopov12d ago

phyzix576112d ago

The opportunity cost to who? Its getting super expensive for businesses and engineers across the board to pay for frontier models.

Gigachad11d ago

The cost of the hardware to run local models is still massively more expensive than the subscriptions while offering worse models.

Eventually I think it will even out but right now the hosted stuff is very subsidised.

gunapologist9912d ago

Rather than Occam, consider Pareto?

kristopolous11d ago

It's not really a bitter lesson here, I can scale those 4B models easier than someone can scale their 1000B models.

NamlchakKhandro11d ago

Thinking Claude is leading edge... I really think you need to re-evaluate what you research you think you're doing.

Claude Code is not Claude Opus/Sonnet/Haiku.

reassess_blind11d ago

What is leading edge then?

MadrasThorn12d ago

It's great at accelerating hardware innovation however.

cuttysnark12d ago· 3 in thread

pianopatrick12d ago

I wish someone would do a benchmark and competition for this kind of work flow so we could figure out what works well.

Like "Here's this consumer grade GPU. Using only this GPU but with whatever models and workflow you want, see how well you can do on xyz benchmark."

Contestants would be given like 1 hour max and scored based on % of questions answered, % of questions correct and total time to finish.

Like "The Local AI challenge"

Curiositry5d ago

Tell me if you find this! I was thinking the exact same thing.

sowbug12d ago

Some think the human brain works similarly: thousands of mini-brain cortical columns, each with a slightly different take on the situation, voting in a majority-rules system.

moezd12d ago· 3 in thread

amarshall12d ago

Thinking doesn’t change output speed. Anthropic’s models are ~ 40–60 t/s median output speed.

moezd11d ago

Do you have access to Anthropic model weights to run them locally?

amarshall11d ago

No, and having that is not required to know output speed nor the effect of thinking, so I don’t see the point in such a superfluous, indirect question.

As for the question you’re likely asking: benchmarks that include speed across many models and providers available at various places e.g. https://artificialanalysis.ai/leaderboards/models

AH4oFVbPT4f812d ago· 2 in thread

Ollama + Hermes on M5 Max 128GB using .NET using Qwen 3.6:35b-a3b as the primary model to do the work. I might use 27b to plan what to do.

xeonax12d ago

Whats .NET doing in between?

AH4oFVbPT4f812d ago

Sorry, I meant to say I was writing .NET C# with the setup

boringg12d ago· 1 in thread

snoman12d ago

j / k navigate · click thread line to collapse