I've noticed a few things compared to large models like Claude. For starters, you really need to know what you're asking, and be precise; it doesn't do much thinking for you. Any assumptions left open, and it'll take the easiest route to reach the goal (e.g. CSS in HTML), often not the best in terms of architecture.
It gets into loops quite often, and surprisingly often gets the edit tool call wrong, after which it will spend lots of thinking tokens and re-read files instead of retrying (despite the system prompt suggesting so).
Comparing agentic Qwen3.6 35b to Claude Opus is like a junior with knowledge across the board, that you really need to guide, versus a senior that thinks with you on architecture. If Opus gives a 15x speedup, local and fully offline Qwen gives a 5x speedup. Which, given that it's completely free, is still mind-boggling to me :)
I've never used the frontier models in earnest, I don't believe in using proprietary tools for my programming, so I can't really compare.
And I'm still a AI skeptic, so I'm doing more testing and kicking the tires than I am actually using it. That means I spend a lot of time trying to break various models, probe them for strengths and weaknesses, etc.
But I find that when I do try to use it for real for agentic coding, Qwen 3.6 35B-A3B is definitely the one I reach for the most often.
For other chat tasks and translation, I'll frequently use Gemma 4 31B.
For audio, I'll use Gemma 4 12B.
I keep a bunch of other models around to try out every once in a while (Qwen 3.5 122B-A10B, Qwen 3.6 27B, Nemotron 3 Super 122B-A12B, Step 3.7 Flash and Minimax M2.7 both at somewhat more aggressive quants, and GPT-OSS 120B if I want super fast but not terribly smart), but so far Qwen 3.6 35B-A3B is really the sweet spot for coding on a setup like this.
The re-processing context every turn problem is definitely something I've hit. Some of the causes have been solved upstream in llama.cpp; make sure you're up to date.
But another cause of the issue that has a big effect is that older Qwen models didn't support preserving thinking. This means that each time you have a long sequence of tool calls with interleaved thinkging, as soon as you had your next turn in the chat, it would have to re-process all of that as it would drop all of the reasoning.
Qwen 3.6, however, now supports preserving thinking. This can use a bit more context, becasue you're not dropping the thinking every turn, but it re-uses the cache better, not causing you to have to reprocess a whole turn at a time each time.
In my models.ini, I have this for the Qwen3.6 models:
chat-template-kwargs = {"preserve_thinking": true}
There are still occasional issues I hit where it will have to re-process, but getting up to date and enabling preserve_thinking has helped a ton. --chat-template-kwargs '{"preserve_thinking":true}'I've had the best luck with Pi so far, but it comes without some bells and whistles you might be used to (e.g. plan mode, subagents, MCP client support)
Isn't this the nature of how LLMs work? Or do you mean that it recalculates the entire KV cache instead of saving the old KV cache, in which case the problem is likely in your executor (llama.cpp, vllm, e.g.) configuration or capabilities?
What does this mean in June 2026 wrt coding?
To me it sounds like being a "rice cooker skeptic". Some people don't like using rice cookers, some do.
This isn't about using rice cookers or not, that's a personal choice for how you cook your food, and choosing to do so or not really only affects the person cooking and cleaning. A rice cooker probably uses a similar amount of energy as cooking it by hand, possibly even less.
But when people using LLMs are causing active harm, and are making it more difficult to collaborate on a team, it's a lot harder to accept that it's just a personal preference.
If you wanted to use the rice cooker analogy, imagine if rice cookers let you cook rice in just one minute. Faster, don't have to wait for the rice to be done, great! But in order to do so, you have to cook 50 pounts of rice, but throw out the majority of it, and use a thousand kilowatt hours of energy to do so. You'd better believe I'm going to be skeptical of everyone deciding that they suddenly have to use these 1-minute rice cookers that burn so much energy and generate so much waste.
I didn't do much benchmarking, but anecdotally, I found it to be making less edit errors. YMMV
Not sure if you intended this to be this philosophical, but this is basically the slogan for modern life now.
Not everyone can plough $$$$ into hardware right now (more power to those who can), so choosing to rent is an A-Ok strategy.
And sounds like you haven't factored in the cost of electricity to run that Mac Studio as an LLM machine. Probably get a few more years.
I find that running better quantization, like Q8 tend to prevent this even though its a bit slower to run, it saves overall time with less churn
Using 3.6-27b is even slower again than 3.6-35b, but I find the accuracy really pays off
Qwen seems better at one-shotting things based on vague prompts to an acceptable degree, but thats literally not what I use these things for!
One thing if people do play with it, is it seems very very sensitive to quantisation of the K part of the KV cache. F16 K and Q8 V got rid of a lot of the loops that it was otherwise hitting.
There's also a regression in llama.cpp wrt. Step Flash, where quantisation is getting worse KLD and Perplexity than it otherwise was previously, for the exact same quants. Very odd, but it's being looked into at least!
What's your evaluation setup like? It sounds like maybe the best thing to do is have a realistic evaluation that resembles your actual intended workload and workflow, and then just try everything.
It's what I use. Fixes the problem
that's why i use the frontier models because its a senior co-worker vs a junior. if you use the junior for the sake of privacy i think you're missing out on the best insights for a specific task.
Consumer-grade subscriptions of the frontier models give you superb capabilities per dollar, them being heavily subsidized. But if you're working in an enterprise setting, that won't work. You need to upgrade, and that gets significantly more expensive.
Furthermore, basing the SDLC on leveraging the bargain subscriptions risks falling apart in the future, both from a cost perspective as well as the question of availability (e.g. Mythos).
So from a strategic perspective, going local on the LLM and still achieving great results with the right approach is very relevant.
I do (unscientifically) experiment whenever a new capable local LLM (<=130b) releases with a license that permits commercial use. As for knowing my models require more work than Opus, I don't mind still having to puzzle on getting the architecture right. In any case, it forces me to stay in the loop of what's being built, which is a good thing.
I'm not familiar with Pi, and not sure which kind of container you are referring to. Something mainstream like docker, or more classic like a BSD jail?
I started to experiment with locale LLMs, through ollama and Lemonade. Enough to throw simple prompts with code excerpts and get small scope code refactors. Though I still struggled to make them work with external tools, like my IDE, so they can be leveraged on to an agentic level with access to a full repository.
That's mainly for work, as they push for using LLMs, though with the new copilote license they provide it doesn't take me even a week to burn the whole token credit.
The tool can be useful, but in my experience without heavy guard rails and loops over tests. I suspect late models to also burn many token into rabbit hole of nonsense hypothesis, instead of doing straight forward correct implemention as you would expect from any entity with such a huge cumulated resources eaten and experimental playground to leverage on. Maybe incentives don't help model provider to minimize sold token, maybe it's just so hard to tame the beast all these bright minds with virtually infinite resources are not good enough.
Anyway, sorry for digression, but I would be extremely interested with a step by step tutorial to make a local LLM work in agentic level, including which kind of hardware is required to make it work properly.
Another POV is that most of the code written in most of my codebases were generated by Codex/Claude, so they would be "stealing data from themselves" in a sense.
I've been working with Transformers/LLM training in 2018-2021 and then now, more recently again. Things are far different. I think they would be more interested in the "how" you got your code to be satisfactory with your guidance than the actual code generated. But mostly I personally trust that they are not really using my trajectories for that (unless I explicitly allow it in the configs)
Is it that in your case is it different?
Yeah, that edit inability is weird. I’ve updated AGENTS.md to limit editing (as opposed to rewriting) and that helps a little.
Maybe even more useful than Opus when I have all the constraints to an issue. There is less "knowledge" in the model (I get by with 48GB of RAM allocated to an 8b quant), so it has fewer things to hallucinate about.
I've been getting to know its limits pretty well over the last few weeks and would say it's an excellent code search/replacement/generation* engine.
It's got the "in-context script generation" flow down as well, so it will easily help automate tasks that you describe with text and perhaps example commands, or tools, or skills* that you provide.
*Think of it + Pi as an NLP abstraction layer over grep, or a shell, rather than a jack of all trades + world knowledge all-in-one.
[1] https://blog.fidelramos.net/software/how-i-sandbox-ai-agents...
I've wanted the latter quite a bit for Pi, because weaker models like Deepseek V4 have extreme issues with obeying prompts (e.g. I'll instruct it to find a bug but not fix it, and it'll "helpfully" try to fix it anyway), so having a "read-only mode" actually backed by the OS would be very useful.
All of these models also seem to get stuck in long thinking loops, sometimes tripling the tokens of a frontier closed model which is really painful when inference is already on the slow side (on my Macbook).
Hold on, what are the specs of your rig? How much RAM?
I've been considering getting an old refurbished 2018 Mac Mini with 64Gb of DDR4 RAM but everything I've read suggests this will be way slower than my 16Gb M1 Pro Macbook.
I've been meaning to write a blog post but well whatever here's the md.
https://gist.github.com/hparadiz/f3596d00a62d8ebb2dadcc46ee5...
Qwen3.5 9B performed best.
You can absolutely still use this to do some basic stuff like tell opencode to convert a video file from one format to another. But frankly you're better off getting two AMD GPUs. Say a dual 7900XT would get way better performance.
So there's this really amazing program called "man"
Yes, you surely can read man, docs, whatever, then DIY. The point is that in many areas people don’t really want to become an expert, like in ffmpeg cli arguments, they just want the work to be done. Above is an example of agent being able to do it locally, and I think it’s great
matches my experience and a deal breaker
also the context window sizes are too low. I can't operate in 65,000 windows any more because even just reading the code's file structure overruns it and gets me nowhere. Definitely its own art form.
200k context windows and above for me now
I saw a paper last night that should help this a lot though
In Pi, /new is my best friend and most-used command for sure. For simple tasks (I decompose complex ones anyway since I don't trust small local LLMs to do this for me), the model doesn't need much context, given that I'm proficient in my codebase myself: "I'd like Feature X. Look into files 1, 2 and 3 to make your edits."
> you really need to know what you're asking, and be precise
Any chance that you could share some recent prompts to give other HNers a head start on his to approach Qwen? If you are uncomfortable posting them here, my Gmail username is the same as my HN username.Thank you.
For the time being, off the top of my head, I'd say:
- Prompt Engineering tips & tricks apply here (like being complete in the relevant context you provide in your question, and the specific task(s) the agent should do like reasoning, modifying one file, or trying to fix a complex task all at once (not recommended)).
- If you already know which files the agent should look into, mention them to save time and potentially context.
- In my personal workflow, I write down lots of atomic TODOs needed to solve a problem. As I write it down, I'll notice assumptions I'm making, or the fact that the TODO could still be decomposed further into (atomic) subtasks.
- It's best to get a feeling yourself for how Qwen handles your repository. I noticed if I don't specify an architecture for development, it'll make quick & dirty fixes. If I don't tell it to remove debug statements, it won't. This is what was meant with "be precise" – Claude Opus might think for you and act in your best interest. Smaller Qwen models will just do what you ask them to, and no more. They have design knowledge, but you have to explicitly ask them to "activate" that part of their knowledge.
I look forward to that blog post!
Two years later, and I'm running Qwen3.6 35b agentically to develop the start of a repository and automatically run tests to then improve on itself. I never thought we'd get here so quickly with LLMs back then.
I'm pretty sure in two years we'll have current Opus-like quality in the 30-100b parameter model range. But at that point, Opus 6.3 will reason along for us so much better still, that we'll still look at those models in awe. It's great to look ahead, but let's not forget to appreciate how effective the current local models already are :)
Now, there's a bit of a degree to which some of the open source models do some benchmaxxing, and bigger models with more params may always feel like they have more depth. But anyhow, right now you have something that is arguably comparable to Claude 4 Opus on your laptop. I can't really compare myself because I never used it. It looks like Claude 4 Opus is still available on OpenRouter, so you could try it out and compare yourself if you're interested.
It will likely always be the case that there are proprietary cloud models that are more powerful than what you can run on a laptop. You can just do a whole lot more with terabytes of VRAM on multi-GPU clusters than you can do on a laptop. So for folks who must have the most capable, you're probably not going to want to leave Anthropic.
But right now, the models you can run on your laptop are comparable to the cloud models that were popular when vibecoding and Claude Code first took off.
OpenAI was offering 2x usage at one point and I still used opus just because it's so much more effective.
More and more specialized and ultra-performant chips are going to flood the consumer market. Especially once new hardware foundries will start producing (well if we don't die from WW3 in the interval).
In 10 years from now, when even basic computers will have 128 GB of memory, and phones will have super optimized tuned models, then what will be the point of Anthropic ?
Just use Gemma/Gemini/Siri or whatever.
Pornography and uncensored models is also pushing toward local models.
It's not like needs of people grows exponentially, the needs follow an asymptote instead (they are capped).
The real revolution is offline robots and self-driving cars, but LLMs are already quite maxed.
For programmers, now, what Anthropic offers is like 3% improvement on a known test (like this pelican riding a bicycle), or on questions leaked from benchmark insiders.
It's ok but not like revolutionary (Fable was better but it was unusable, easy 20 minutes per one prompt due to overthinking).
https://ikyle.me/blog/2026/how-to-setup-a-local-coding-agent...
One thing I did change was the context length to 256k rather than 64k.
I've read a bit on what the various components are. What I don't see in your comment is what you're using to run your model locally. Ollama?
And then also, sometimes the tool call errors are because of something like a file was changed out from under it; the larger model is probably going to do a better job of figuring that out and fixing it up.
Finally, in Pi, you can always just use the /tree command to skip back to before a series of failed tool calls, with a summary if you want to let the model know what happened. The Pi /tree command is pretty powerful in managing your context
I'll experiment more with the effectiveness of AGENTS.md rules for local Pi agents. I feel like smaller (local) LLMs just lack in attentiveness to elements in the context window, like precise instructions, compared to e.g. Claude models.
Full octane isn't gonna fit on much of anything south of a 128GB machine once adding KV cache.
[1]: https://huggingface.co/unsloth/Qwen3.6-35B-A3B-MTP-GGUF
We truly live in the dumbest timeline.
I don't want to be rude, but your linkedin has a sumtotal (generous) of like 8 months of programming as a profession (job title is AI Engineer). The rest is at best programming adjacent. How would you know what either of these situations are really like?
You are paying for the extra power draw.
Every month I research this and come to the same conclusion: the time, effort, and cost required to get local models (and the coding tools around them) to perform even close to Claude Code with sonnet/opus just not worth it right now. If it was, it would be distributive enough to be in the news.
Not that I'm discounting someone hasn't already solved this, just trying to Occam razor my way out of diving too deep down rabbit holes.
The present Sonnet/Opus versions (~4.8) will likely be what everyone in the enterprise might end up using eventually. And even though local models aren't there yet, there are budget alternatives from the families of DeepSeek, Kimi, GPT, MiniMax, etc. available through APIs of NVidida, OpenRouter, Groq, etc. which are very much Sonnet grade.
Personally, I don't think we're at that point yet. While I do think model improvement is starting to plateau (reaching a local ceiling), I'm not convinced local models are as good as sonnet/opus yet. The gap is still too much. But I'm excited for those models to reach those levels.
I think it strongly remains to be seen whether e.g. tokens per second (multiplied or whatever by percieved quality of private model) actually means "better or more useful output."
I strongly suspect it does not. (though I also strongly suspect this will be very difficult to measure because the incentive to lie about metrics here will be so strong.)
What I’m saying is that if local models were actually comparable to Claude Code in practice, we wouldn’t be having threads like this. It would be obvious to the people using them, and it would be massively disruptive. Why would individuals and companies pay hundreds or thousands for Claude Code if they could run something locally and consistently get similar results?
Every month I revisit the local ecosystem hoping the answer has changed. So far, my experience has been that it hasn’t.
It's entirely possible Claude is just winning the hype game.
At my current pace it would take me until sometime late 2030 to spend the same amount in gpt5.5 tokens.
Remember that there are other LLM providers, open models, and previous gen models, that are way cheaper that frontier Claude and still way better than what can realistically run locally
With a layered approach we can slowly shift to running more locally and still get required work done. Really, my local setup is so much better than it was 2 months ago, and extremely better than 6 months ago - on the same hardware.
Eventually I think it will even out but right now the hosted stuff is very subsidised.
If you truly believe that it WILL get there within the next couple of years, then you might as well start playing with it now (and, yes, you will be very surprised, especially for shorter/smaller projects or nicely modularized larger projects)
It's not really a bitter lesson here, I can scale those 4B models easier than someone can scale their 1000B models.
Claude Code is not Claude Opus/Sonnet/Haiku.
Probably the biggest improvement was including a backend-for-agents service definition which instructed the schema agent they were to only produce only a manifest based on the task, and to pass off that off to the next agent.
In short, I split tasks up into many pieces by defining a workflow where agents are only allowed to do very specific things before their work is passed along. This keeps them grounded and capable while also creating places for me to intervene if a workflow was say 25% or 90% successful.
Like "Here's this consumer grade GPU. Using only this GPU but with whatever models and workflow you want, see how well you can do on xyz benchmark."
Contestants would be given like 1 hour max and scored based on % of questions answered, % of questions correct and total time to finish.
Like "The Local AI challenge"
Some think the human brain works similarly: thousands of mini-brain cortical columns, each with a slightly different take on the situation, voting in a majority-rules system.
As for the question you’re likely asking: benchmarks that include speed across many models and providers available at various places e.g. https://artificialanalysis.ai/leaderboards/models