With Claude, you sometimes want to under-specify or phrase things more indirectly to give a color to the implementation or elicit something creative. Also (you might raise an eyebrow at this) being nice to Claude will be rewarded and being mean to Claude will be punished. Claude tends to mirror your tone more aggressively and you don't want to get into negative loops with it.
With GPT, you have to be precise and reduce ambiguity. GPT will often try to resolve ambiguity in a min-max style "I'm going to do X, but make sure it is not quite Y". It will tend to be more paranoid and overengineer to catch all edge cases if you don't tell it precisely what the scope is.
With Qwen, you have to give it a shape and let it fill it in. Qwen likes XML, JSON and lists. Qwen likes to be shown a bunch of examples of previous work.
This is not scientific at all, just vibes, YMMV.
This is the problem.
I would love to have a product sheet showing what each models strengths an weaknesses are, so that I can have a clear decision tree of "if this kind of work, use model X", or "model Y should be used in ways Z". But they all look the same from the outside and the only way to figure out which might be marginally better at what is to do extensive, time consuming, and perhaps expensive testing.
Think of it less like a static tool, and more like a human helper, where the same holds.
The idea is that benchmark score comparisons are useful for a large cross-product comparison across models + their settings, but less useful if you're looking for the best model for <your-specific-task>. So on this site, each model gets its own page showing the list of tasks that people have rated it on, and the score out of 10 for each task. Common tasks, like coding, will likely be on most/all models, and more niche tasks may only be on a few. It is human moderated (by me only right now).
The corpus is pretty empty right now, so please spread the word if this seems like a useful idea!
Realising this made me respect the "I" in "AI" a bit more seriously.
This presumes that the labs themselves know how well their models perform. But all they have are overtuned benchmarks and hype vibes.
They do not test how models perform when used interactively, like most of us do.
Maybe we need better reviewers then?
I recommend everybody do this because you don’t need any special data except what you are already using, and the results will be very eye opening: there is WAY more randomness or instability involved than you would otherwise assume. A lot of what you might think is a better prompt technique, or a particularly good or bad outcome, could just as well be random chance or just different behaviors across model version or sizes. And your results can be massively biased by small differences in input. We’ve been calling some of these “magic words” at work, specific technical terms or references/techniques that you need only mention to get vast improvements in outcome.
There’s a skill to it. With agentic loops if you get the model into a self-eval structure where it’s hard to cheat or take shortcuts, and it’s in the right structure or domain that models its training, you’re golden. But it’s hard to find the sweet spots (pro tip, have Opus 4.8 convert PyTorch models into ONNX or quants or get them running on different hardware, I swear it was like I activated some kind of savant-like skillset; meanwhile I can’t for the life of me get it to properly write/test EBNF formalizations of common languages and formats without cheating).
The worst part is that it changes so much so frequently that it’s almost useless to really go digging for this kind of knowledge unless you’re actually the one training the models. I wish this kind of “stability” in output was more emphasized in their training so they’d be predictable. I assume it’s hard to do without overfitting or breaking the explore-exploit loop but also, I would spend so much more on LLMs for batch workloads if they could do them more reliably…
It can be frustrating. The AI pretends to be a human, and so a part of my brain expects them to commit and have a "parti pris" like a human, so the exercise is a good reminder of the feedback loop. My mental model is that before the first three or four messages, the model has many finer points of its personality still underdetermined. I'd suggest that as the mechanism for "role-based prompting". And it explains the "savant sleeper agent" thing you describe. You want to get the state in the right attractor on the manifold.
These machines are pretty incredible, but for conversation-driven workflows you really have to be in the driver's seat. A human has a property that the AI does not have, at least under current architectures: we are regulated by the outside world. A bit of a tangent, but I can see how AI psychosis arises from these dynamics.
I'm surprised how little attention this is getting in today's comments. Open-weights means being able to afford multiple runs, space sampling, critique and synthesis.
Last night, sketching some intro biology content emphasizing cross-cutting concerns, Qwen would get sucked into the Next Generation Science Standards attractor. But nudge it by adding just one similar phrase of Chinese, and most runs ran free (outputting English but for headers with parenthesized Chinese). The multi-lingual LLM "no, not *that* region of latent space".
> We’ve been calling some of these “magic words” at work, specific technical terms or references/techniques that you need only mention to get vast improvements in outcome.
Any chance you could share some of these? Seems like something we could all benefit from.Or it is more like playing a slot machine and you imagine the rest.
Maybe it works some of the time but it isn't a solution that works everytime.
It reminds me of people hovering to play a slot machine when someone gets up and it hasn't paid out as if they've solved slot machines.
While I don't mind putting something in a loop until the tests pass, I'm less comfortable doing that when providers are silently rerouting to lower quality models, or in Google's case burning quota faster to ease their own server load without being transparent about what the "standard limits" are to begin with. [1]
I'm hopeful I'll be more comfortable with these "slot machines" when frontier models get to the point where they can be run locally on hardware I can actually afford so I know exactly what I'm getting and not jumping at shadows with providers playing tricks behind the scenes to ease their own load without admitting the customer is getting less for their money as they get more popular.
[1]: https://support.google.com/gemini/answer/16275805?hl=en&sjid...
What I do know absolutely for sure is that LLM benchmarks are not to be trusted, they are just a minor indicator and real world usage is often very different.
With a squillion dollars at stake per bench point, someone will have figured out a plausibly deniable way to game these benchmarks.
> With GPT, you have to be precise and reduce ambiguity. GPT will often try to resolve ambiguity in a min-max style "I'm going to do X, but make sure it is not quite Y". It will tend to be more paranoid and overengineer to catch all edge cases if you don't tell it precisely what the scope is.
I agree with all of this except for one thing: I swear to god, being mean to Claude at the right time can be enormously effective. The F-bomb in particular seems to really help it snap out of ruts sometimes.
It's hard to decide when to use the best tool for a job you are aware of to ensure throughput and when to spend time experimenting with a new tool to learn what it's good at.
I asked Claude Sonnet 4.6 for the same thing, and Claude's version was like if the game had been written in JS originally.
Also, for some reason it made it a single HTML file, removed all assets, dynamically generated graphics and dynamically generated music. It also gave me a new, better background.
This surprised me, since it was not what I asked for. I just asked it to port the game.
I was pretty pleased about the choices it made, but I'm not sure how to turn that behavior on and off. Sometimes you want it to be creative, sometimes you want it to actually do what you said.
You have to be a lot more explicit but it’s hard to know a priori what decisions it’ll make. A good idea is to run it in plan mode so you can read those decisions before it sets out on a path and have an opportunity to make corrections.
Instruments present a clear interface to a user, have predictable outputs, etc.
The only comparison that might work for me is that LLMs are very bad instruments where you are constantly forced to negotiate its idiosyncrasies in order to massage the output you want from it, and even then there is enough randomness that trying to do so is almost a fool's errand.
This has been my experience with most models. If you say "How do I do X? I was thinking maybe Y or Z" then the model will probably try to make Y or Z work. They will very likely not say some third option that is wildly different is better, even if it may be. And actually maybe less so with Claude because sometimes it pushes back.
Actually this seems like it would be an interesting test. Maybe I will come up with some contrived question and ask several models.
IME Claude is the most "creative" of the bunch, you can get surprising ideas out of it that were kinda tickling the back of your head but didn't really connect.
BUT it's also "relentlessly proactive" like simonw put it. It _will_ get the job done, it's the smartest idiot in town. Why use a library to parse $format when you can just write a custom 1000 line parser? Or if it can't access something, it'll pursue the goal of accessing it in the most creative ways - instead of stopping, asking the user "yo, can you give me access to X" and then continuing.
My solution is to use Claude as a pair programmer. I _very_ rarely just do /goal fix this shit, I watch what it does and interrupt if it gets to the "smart idiot" phase. Also I communicate with it like I would a coworker, never had it berate me or get combative. There's a Finnish proverb for that too[0]
As for Codex, Deepseek, GLM, those I use when the goal is 100% clear like "convert this Brewfile to a list of packages for Arch and Debian, use these two Docker containers to test that pacman and apt work correctly". Boom, done.
But I won't give any creative open-ended tasks to any other model than Claude.
[0] https://en.wiktionary.org/wiki/niin_mets%C3%A4_vastaa_kuin_s...
On one hand I’m glad Anthropic is only just now starting to get into infrastructure because it means there’s opportunity there, but it’d be great for their models to be more knowledgeable or able to seek out that knowledge on their own, or for the UX of Claude code to be more amenable to launching 5 in parallel and picking the best one, so I don’t have to spend time arguing with a robot. I think there’s a much better balance to strike between just charging ahead towards the goal at all costs vs being lazy and pushing everything back up to the user. Basically they write too much code that’s too contingent/brittle outside its exact current context and don’t do a good job distilling out the essence of the problem “cleanly”. Almost all of them are like this right now, it’s partially a problem with long-range planning but I think a real bias from over optimization for certain RLVR outcomes vs others.
> this is phenomenal work, genuinely! I feel like you read my mind! <next instruction here>
can go a long way.
of course, I would only say that when I mean it, because Claude can get superficial and cut corners which is why I prefer GPT for raw implementation.
Classify under non-reproducible artifacts of LLM generation.
this is what 'tokens are commodities' and 'there is no moat' people miss. the models are in general not easily swapped out. you always have to run evals before you can swap them around, tune prompts etc. even minor versions of models from same providers need this process.
there was something on HN a few weeks ago about how most/all models perform better the more rude you are to them.
(i still say "please", i can't help it)
... That does sound like something that Anthropic would deliberately aim for, yeah.
> With GPT, you have to be precise and reduce ambiguity.
I have found that it occasionally makes a wild misinterpretation, that makes a bit of sense in retrospect given how I worded something but is still surprising.
It also sometimes tries to loop in and tie together ideas from earlier in the conversation that really shouldn't still appear relevant. But that might be a general LLM thing.
Fable seemed less apt to do so but I didn't get enough time with it before it was yanked away to know for sure. It may have had mixed results on the benchmarks but it was finding bugs opus never found.
The one thing I feel it seems to under estimate is the likelihood of improvement. Even the authors acknowledge it's not even worth comparing local models from a year ago to what we have now. In fact, people widely see Opus 4.5 in November last year - 8 months ago - as the first time agentic coding became viable broadly viable even with frontier hosted models.
So why would we lock in hard on any concept at this point of what a local model is and isn't good for? Whatever it is right now, it probably won't be that in a year. It might be naive optimism to think we'll ever get to long horizon tasks with models that run on consumer / pro grade hardware. But so far the naive optimists are winning.
You can try it by using Opus through Github Copilot vs official Anthropic tools. You'll get very different results and experience (in my opinion).
I have found claude models, especially fable, to be impossible to work with when the work requires reading papers from days ago and reasoning on top of the findings in it. I have multiple long sessions with opus (not as many with fable as it got taken down quickly) where it keeps fighting me on problems, sayings "that's not how it works" / "that is not possible", followed by me linking the paper (after i've told it to actually read up on the latest research in this field), and it hits me with the usual "You were right.". If your workflow is using the exact tools, frameworks, git layouts that claude expects, it can be magical, yes. But it is very heavily optimized to never say 'I am not sure' (as that gives 'bad vibes') and instead lean on its (nowadays with the speed of things DOE) knowledge to formulate a reasonable sounding answer, dissectible only if you already know the answer beforehand (which defeats the purpose of using it in the first place).
Qwen3.6 27B (the only <100B model worth looking at in my experience) is dumb, knows it, and will fight tooth and nail to complete the task it was given, gaining the needed context (online or file-wise) in the meantime. If you mention it should read papers, it goes and reads a pile of papers. If you tell it 'implement MCP in my app', the result will (probably) be catastrophic. If you instead describe where the feature should sit, how it should handle edge cases, what use cases it needs to attend to, and to first look online for reference implementations, it does it and does it well.
Knowing what is in context, what should and shouldn't be there, and how to manage it for the specific model you are using (as every model, even in the same family, behaves differently to differently worded prompts) is what makes or breaks them. They are just auto-complete, they complete text based on what is already there, it's not magic.
So yes, while this small open-weights models are not opus 4.5, it's good precisely because if that, because it is a good tool and a bad 'coworker replacement'. If you want the latter, kimi is already there, it has started to not believe the user and do what it was taught just like claude models (which is helpful when you don't care about implementation specifics or performance/security). GLM models (mostly 5.1, i haven't tested 5.2 extensively yet) have fixed a lot of low-level programming issues I've had that opus just walks in circles and writes reports that "it doesn't/can't work". That is to say, open-weights, in many cases, have already surpassed Opus. I can't comment on gpt 5.5, but while I used 5.4, it also performed a lot more tasks without being fussy than opus 4.6/4.7.
It's like buying a car: I drive that car and get attuned to its characteristics; I don't think how that car (or similar cars) may improve. That's my tool and I want to make the most of it.
It is true that switching a local models it technically very cheap, but there's a considerable time investment in squeezing the most out of it, which may not work on a newer version of that model.
Where they shine is in your ability to control them, their privacy, their predictability (e.g. if you are doing a repetitive task, like classifying your photo/video library), and depending on your energy bill - their costs.
It would have 99% reliable tool calling - and most importantly - the ability to go "this task is beyond my skills" and refer to a Big Boy Online Model in a gigantic datacenter somewhere.
This way all of the simple stuff would be done on-device, gathering data, figuring out the context of the problem etc. And when that's done, the "smart" model would come in to work on the issue when all of the easy stuff is already done.
It feels super stupid that my /commit skill calls an online model when that is something a local model can 100% do. Mostly this is a harness issue though and mostly solvable.
Qwen 3.6 27B can do that today, but setup properly and in a good quant, I run an autoround [0] with weights in int8 and attention heads in f16 on a single RTX 6000 Pro Blackwell Max-Q via vllm with mtp=2 and full context, --max-num-seqs 3, KV in f16, mamba f32.
>It would have 99% reliable tool calling
I managed to score 93/100 in tool-eval-bench [1]. For me this is very good already, at least in the pi coding harness I've never had an issue that wasn't auto-fixed in the next turn(s).
>the ability to go "this task is beyond my skills" and refer to a Big Boy Online Model in a gigantic datacenter somewhere
This is heavy on the harness engineering side I think, but also quite contrary to the nature of LLMs today. If you figure this out I'd love to know.
[0] https://huggingface.co/Minachist/Qwen3.6-27B-INT8-AutoRound/...
They really are fantastic for a lot of use cases and I think most people do not need SOTA. When I run that qwen model in my measly 4070 12 GB for my personal email agent that I build and experiment with, I need privacy more than anything else. It does a great job. Even for coding tasks, given you know how to use them instead of dumping a grand plan, it's great.
SOTA can code but can also prove theorems and teach you about music theory or ancient Greece's substrate language or botany. Speaking in tens of different languages. I wonder how many hundreds of billions of parameters can be saved just by removing much of the general knowledge parts while keeping logical and programming abilities the exact same.
https://github.com/cptskippy/battlemage-llm-gateway
Opencode has been a huge productivity accelerator. I have two Hermes agents that I'm training to support my workflow with pretty good success. One is a personal assistant who manages my backlog and keeps me on task, follows up with me on items, and will put together research briefs. The other I use a general purpose coder and research and it's about 50:50 with the tasks I've given it. In fairness though, the task it failed at left me scratching my head to figure out as well.
How many tokens/sec do you get with 27b? Are you using MTP?
Hermes reported 18.45 tok/s consuming the llama-swap endpoint across the wire. Locally I got 19-19.1 tok/s on the gateway. I'm running the Qwen 3.6 27B Q6 model (qwen3-6-27b-q6-k) off LM Studio and it's less than 0.3s to first token.
It's not good for conversational use cases as it can take 1-2 minutes to respond to a prompt.
I have two Hermes Profiles running, one is a personal assistant that manages my backlog and provides me morning reminders, solicits for evening updates, and will run overnight research projects for me. The other profile is a coding helper for personal projects. I can ask it to make changes and it will churn for 15 minutes, submit a PR, and notify me that the PR is ready to review. It's faster than me at basic coding tasks.
So in terms of raw power Nvidia is effortlessly still king, but in price-to-capacity Intel is best in class.
Intel's Battlemage GPUs also natively support SR-IOV and GPU partitioning which allows you to isolate workloads. This is useful in homelab environments if you have workloads that benefit from GPU acceleration. I was able to split the B70 into 4 virtual GPUs and hand them out to Frigate NVR, Plex, and other workloads.
I am not sure whether you can find those in stock anywhere.
> These products use very low level Linux primitives like containers, Kubernetes, Firecracker microVMs, and networked protocols.
Out of anything that is a "low level linux primitive" I could maybe argue that networking? protocols fit the bill.
And it's obviously fully AI-generated! Which I wouldn't even care about if I could actually trust the content, which I can't!
The post is not AI generated, I use AI for code generation and write my own articles.
Which part of the post are you struggling with? This is a post describing our own experience and journey. Happy to back up any specific claim.
What model are you again?
I actually find this somewhat interesting, because it seems that a lot of people who weren't comfortable with expressing themselves verbally are feeling more empowered in that area. We're hearing new voices for the first time, albeit heavily-filtered ones, and I have to believe that's a good thing.
But part of me still finds it offputting for some reason. It's interesting to think about whether that's more of a "you" problem, or more of a "me" problem.
IME vLLM is quite a bit faster than llama.cpp but where it really wipes the floor with it is in batching concurrent load. The downside is that it is dramatically less flexible in terms of tweaking. It gives you very few options for running quantized weights. It takes a lot longer to start up because it optimizes the compute graph. So for single user experimentation on a model that's a bit too big for your box, vLLM is just going to be frustrating.
Dismissed is a strong term, but let me give you some more details.
It took a good 4 minutes plus to load up on the 2x 3090 rig, and served a single request 3 tokens/second slower.
And the worst bit? With all that work - setting it up and tuning it - it still looped. I was hoping "use just vLLM" advice that we get touted everywhere was the silver bullet.
The only thing I'd caution here is that we don't start bashing on llama.cpp like people did with Ollama. It's a very capable tool and for the use-cases we actually want the card for makes more sense.
For a large team replacing their Claude Subs perhaps vLLM is the only option, but you really need to add about 5 more RTX 6000 cards into the mix, so you can load something like GLM 5.2.
That's not _nothing_, but it's pretty close to nothing, and for the prosumer crowd it edges towards "just gets in the way".
They are similar, but for different use cases.
later:
> My latest experiment was setting up vLLM (the gold standard for production and concurrent serving) and even with an NVLink (175GBP) and tensor parallelism turned on, it was 3 tokens/second slower than llama.cpp during generation for an equivalent setup.
In all my tests, getting vllm to run is worth it. It was the single biggest thing, that helped for looping issues, agents going whack and losing focus on the task, long context being essentially useless.
FP8 model, unquantized cache in vllm an you have a league better overall experience, with any other stack I tested. Then, you can actually focus on using the model for other things and stop tinkering with settings.
My weekend project is going to be building a home inference server (from ancient datacenter parts) and I'm still massaging in my head what the end result will be.
https://github.com/noonghunna/club-3090
You can find info about running a patched version of vllm for 1x24gb, 2x and 4x. There's also quite a few "blackwell" subreddits, where people seem to share a lot of substantial information, if you're going the 6000 route.
- tool calling
- code base exploration
- anonymizing / abstracting your request
Such that your local AI communicates to frontier model like an expensive consultant giving high level advice.
I think due to the lower latency of a local model that this could be faster.
> Local models can quickly read and explain codebases, even if they can't write them - this is a superpower
Might have been buried lower down.
And yes latency of local on a fast card with MTP enabled can be blistering 130-200 tokens per second sustained at full context on Q5. About 100+ on Q8.
On tool calling
> Agent Skills can help immensely - we had a local agent set up Slicer completely from scratch on a new mini PC. It even gave feedback on the usability of slicer CLI which we integrated
There's a link to a post showing some examples.
Occasionally, we'll also have the local model _review_ the changes of GPT/Opus - and it can return duds, but also insights the larger model overlooked, or was too intelligent to pick out.
So yes - absolutely blazing fast at understanding a codebase, very good at running skills "cheaply" and could be used with larger models as a "helper" / sub-agent.
In every way, the cloud products from the big two seem optimised for speed and speed of initial response even.
I don’t think most people are running local models for speed. More for control, privacy, interest, bloody-mindedness and general principle.
IMHO, the author could have done two things better:
- vllm instead of llama.cpp. With NVIDIA HW, there is huge difference in multi-user loads and caching with vllm; when he was complaining about what happens when more than one user uses the model, and about losing caching, I was "well, duh".
- The budget he used for a single card could have instead be put to far, far better use with SPARKs. I have access to a cluster of 2 x GX10 - total cost less than half what he paid, even today - and I am running vllm and Deepseek v4 Flash. The difference compared to any Qwen is tremendous - I've NEVER seen it loop, and in all my experiments so far, it's the most Sonnet-y model I've ever tried (antirez seems to agree, hence his ds4 fork).
If you're wondering about how I set it up in the 2 GX10s: https://forums.developer.nvidia.com/t/deepseek-v4-flash-offi...
Performance: 2K t/s prefill ( very useful for feeding tons of source code into its massive context window ) and around 50-60 tg/s in my coding sessions in the pi.dev harness. With the money the author paid, he could have bought 4 GX10s, and double both numbers ( vllm basically scales almost linearly with tensor parallelism ).
It's the right call for concurrent batched serving (barrkel's point downthread is spot on), but for how we use it llama.cpp is still better for us.
The Spark/GX10 route is a genuinely different bet though and appreciate you sharing your numbers. At the time (several months ago) the consensus was that GX10s were for fine-tuning only, and the numbers were severely low.
..and the card was never about replacing a Claude Max sub. For the workloads we actually bought it for, it's giving us 140-200 tok/s (which matters).
But mostly I wanted to raise awareness to readers of your article that no, if you want to do inference, paying 15K for a single 96GB card almost certainly makes no sense. Buy 4 GX10s with the same money, and enjoy dramatically better models and user scalability.
Regardless - thanks for putting the effort to share your findings! I keep postponing doing the same... there's tons of things everyone is re-discovering on their own.
I do however now know that they're a totally cool dude building stuff physically and as software + that other people give them money for it.
Does that have anything to do with the topic suggested by the headline? Not sure.
Is it bad software? Idk. Probably not.
Should you treat it as a grassroots Foss thing maintained by fellow sane hackers? No sir.
ChatGPT and Anthropic will never, ever get me to tie my Health Data to their systems, but I still believe in the capabilities of AI in identifying patterns from data I would otherwise overlook, and sorely want a local-only ecosystem where I can expose this data safely, privately, and securely to something like Qwen or Gemma for processing.
Same goes for Smart Homes, and Personal Assistants. The corporate approach of letting Company A access your data stored at Company B and processed by Companies D and E while also sold to Advertisers and Data Brokers with no way for you to extract or view it on your local hardware - just isn’t tenable for these sorts of intimate use cases. I want my data to be owned and controlled and exposed on my terms, to be used to improve my life first rather than someone else’s bottom line. I want technology to give me back more of my time and improve my outcomes again, and I’ve been burned enough by Big Tech in the past that I flatly reject any presumption of nobility or public good from their AI-as-a-Service business model.
The capability is there, and I definitely think the folks working to build local tooling that supports and unlocks the potential for local models are the ones in the right. I love seeing what they build.
The strength with AI is with open-source models. We need to keep away from vendor lock-in and use models that allow both local usage and hosting by independent providers.
How big of a deal is looping, practically? Or, I mean, I see thinking models loop occasionally. But it seems to me that every token in the loop should be in the KV cache already, is there really no way to either power through a loop because of the 100% cache hit rate, or identify that you are in a loop that way? (As a human, when thinking hard I sometimes loop, but it is easy enough to identify…)
The cache only makes generation fast, it doesn't influence what gets chosen next. The loops that hurt the most (point 2 below) are when the model re-decides to do the same thing in different words, which is much harder to detect automatically. We're experimenting with repetition penalty and turning thinking off to solve for the 1st kind of looping (below)
2. On "why is looping a problem" for us
Practical example, which I covered in the post: "add --json to every command that does a get or list in faas-cli" - this was a small-ish, open source CLI written with Cobra a very common framework.
If I send that to Claude (any of their models) or Codex (GPT), I would have a fully working solution the next time I opened that terminal - a few seconds - a few minutes.
With the local model, when it loops, you get some progress and start working on something else. Come back, maybe even 30 minutes later and see it's been printing the same 5 lines over and over constantly.
Trust is important for a tool like this, that eroded it.
The other type of loop I mention in the blog post is "unable to solve it" loop - Han ran into that more.
"Oh I need to fix the indent from 8 to 5 characters in main.py" "Wait I don't know how to write Python code" "Oh now it's broken and I don't know what to do, maybe I should stop" "Let me edit ... " etc, etc
https://x.com/ggerganov/status/2067539416436867230?s=20
We also use it with 200-256k (native) context length.
The issue could be that folks that don't see looping aren't pushing the model as hard, or as enthusiastically.
We also had far fewer issues when thinking was turned off, than with a reasoning budget capped at 2048.
Some fine-tunes like Qwopus-Coder just seem prone to looping - google it, you'll see plenty of reports, even on Reddit.
For what it's worth seen the RTX 6000 Pro loop even at fp16 on the KV cache - and with vLLM.
And ZDR is still data sharing with a third party. This is the essence of an enterprise agreement, it's not allowed, even if they pinkie promise not to store it.
If your customers allow you to share their data with third parties, then ZDR may be an option for you. I am not a laywer.
Where I see ZDR as being more relevant is in protecting your employer's IP - not allowing a missed setting to mean AI labs can train, retain, and publish/resell your work. It's what we'll consider when the subsidies stop being available - open-router, ZDR - but for coding - not for customer data. Very important distinction.
35B-A3B is what we started out with in the days of only having the 3090, but the quality is not as good, and the speed from the cards we have now can blaze at 130-200 tokens per second of generation with q5 and a full context in fp16.
Not to say that MoEs don't have their place. For people running on unified RAM, they're sometimes the only viable option due to the slowness of dense models.
Why is a dense model slower? All model weights have to be loaded and exercised. Passing through 27B vs 3B (active) is maths. So yes you will always get more tokens per second of generation.
You must (just as we did) evaluate on your own products and daily work. If the MoE gives the results you need with only 3B parameters then you have your answer.
Not prescriptive at all. This is experience based, from the trenches of a actual software business so hopefully a different perspective for folks than "Ran Qwen on my macbook, generated a great python script for me"
> One of the cards would only show up if I crossed my fingers when turning it on. Even reboots wouldn't cure it - I had to A/C power off and remove the power cable each time for 30 seconds.
This is ridiculous. Of course we are living through supply crunch, but that card is clearly defective hardware.
The RTX 3090 in question was used from eBay, no way to return it. The RTX 6000 Pro is the "new card" in question here. The 3090s remain an interesting playground for testing things like VFIO passthrough for SlicerVM and other models whilst not interrupting people on the newer card.
In the end, the most stable fix I've found is to install the older proprietary driver and disable the GSP firmware. Have had no issues since.
So "clearly defective hardware" seems like it may not be quite correct. And the thing that kept me coming back - along with not having a suitable replacement - or having to gamble on eBay again was the reliability once it showed up in nvidia-smi.
I don't follow how it supports the decision of buying the card, I would even say using online SOTA models would had caught it earlier without usd12k and monthly electricity being spent
As explained in the post - the 3090s were what were the test bed that proved the investment was worth it. Customer support, architecture reviews, telemetry to check license compliance. None of that could be done with online models. The amount of time we can spend going backwards and forth with enterprise customers over email can really amplify costs to our team. A few actual issues we found and fixed were listed on the linked blog post: https://www.openfaas.com/blog/painless-support-with-diag/
Having recovered revenue using it in an airgap, to preserve data agreements was more of a cherry on the cake. No need to worry about the investment, it's covered itself.
Hope that helps.
Uh, so, yeah. Im running local Qwen, but Qwen3.5-122B using Krasis https://github.com/brontoguana/krasis
Its by far better than Opus.
In fact with a phone migration, I was using an OLD android 2fa app "andOTP". Backup files it emitted were JSON but not any sort of standard.
I needed the standards version using otpauth:// to upload in my current 2fa. And gave it to my local qwen3.5-122b.
It responded with a scary "you uploaded credentials to a public instance LLM! And, it emitted standards compliant URLs. The new app "Tokn" ingested just fine. When side by side was tested, everything was 100% correct.
I coukd have did it myself, but it was a one-off. And asking local Qwen worked perfectly. Took like 6 minutes. Would have taken me 1h.
To make 4-bit fit on one card with reasonable (100k+) context needs a bit more care though. And tuning can be highly specific to your machine, gpu and use-case. But I use a headless server, offload multi-modal to CPU, use fit-target to reduce wasted memory and use q8_0 kv since the 4090 performs well with it... In addition to most of the same config as the author elsewhere. I get 50-60tps generation with a power limit of 275W (450W is default), more than enough to offer a roughly an Opus-speed feedback loop.
I haven't seen many of the issues with looping the author mentions. But I did with Qwen3.5 and in particular other 4-bit quants in the past. But the difference is probably a mix of the improvements above, as well as habits changing to avoid cases where models will loop. For what I'm doing, it seems like I loop Qwen3.6 on the same kind of prompts I'll make Haiku or Sonnet loop on (the latter hide some of their existential loops behind "thinking"). Usually it's cause I was too vague about some aspect of what I'm wanting them to do or I forgot to include some context that smaller models just don't have access to in their smaller knowledge base. But at least for what I'm doing (Rust, React, kubernetes) it's not been a notable problem at all with the latest iteration of this whole stack. And knowledge of standard libraries and default k8s resource kinds has been almost flawless.
There's still plenty of more complex stuff where I'll choose to jump straight to Claude or GLM-5.2, but if it's not worth that jump I've stopped paying for the middle ground as it's usually not much better than just one more iteration through qwen.
All this to say, if you have a 3090/4090, feel free to give the same setup a go. It's come a long way in recent weeks.
I feel pretty averagely smart but give me some good tooling like a good editor, a good type system, semantic grep, good testing and some solvers and I can actually deliver some work.
Maybe the trick isn't 500 billion parameters but a model super integrated with the task at hand for iteration and debugging?
FWIW the article really mirrors my own experience. I can run a small gemma4 for quick edits (and it's fast!) or data cleanup but for other tasks you do need a different tool (claude).
It only really happens if you allow the thinking directive though. If you can switch it off with what you’re using it on, you’re mostly fine.
Means underinvesting in engineering
Look into it