Local Qwen isn't a worse Opus, it's a different tool (opens in new tab)

(blog.alexellis.io)

480 pointsalphabettsy4d ago252 comments

252 comments

142 comments · 27 top-level

glerk3d ago· 59 in thread

If you play with these models long enough, you realize there is more to them than just "model X is smarter than model Y" or "model Y is cheaper than model Z". They are different tools and the prompting technique is different. It is very much like playing an instrument.

With Claude, you sometimes want to under-specify or phrase things more indirectly to give a color to the implementation or elicit something creative. Also (you might raise an eyebrow at this) being nice to Claude will be rewarded and being mean to Claude will be punished. Claude tends to mirror your tone more aggressively and you don't want to get into negative loops with it.

With GPT, you have to be precise and reduce ambiguity. GPT will often try to resolve ambiguity in a min-max style "I'm going to do X, but make sure it is not quite Y". It will tend to be more paranoid and overengineer to catch all edge cases if you don't tell it precisely what the scope is.

With Qwen, you have to give it a shape and let it fill it in. Qwen likes XML, JSON and lists. Qwen likes to be shown a bunch of examples of previous work.

This is not scientific at all, just vibes, YMMV.

dkersten3d ago

> This is not scientific at all, just vibes, YMMV.

This is the problem.

I would love to have a product sheet showing what each models strengths an weaknesses are, so that I can have a clear decision tree of "if this kind of work, use model X", or "model Y should be used in ways Z". But they all look the same from the outside and the only way to figure out which might be marginally better at what is to do extensive, time consuming, and perhaps expensive testing.

coldtea3d ago

>I would love to have a product sheet showing what each models strengths an weaknesses are, so that I can have a clear decision tree of "if this kind of work, use model X", or "model Y should be used in ways Z". But they all look the same from the outside and the only way to figure out which might be marginally better at what is to do extensive, time consuming, and perhaps expensive testing.

Think of it less like a static tool, and more like a human helper, where the same holds.

7 more replies

m-dot-reviews3d ago

So, this may not be precisely what you're looking for but it may come close. I've put together a simple site for sharing ratings/opinions on models on a task-specific granularity. https://model.reviews/

The idea is that benchmark score comparisons are useful for a large cross-product comparison across models + their settings, but less useful if you're looking for the best model for <your-specific-task>. So on this site, each model gets its own page showing the list of tasks that people have rated it on, and the score out of 10 for each task. Common tasks, like coding, will likely be on most/all models, and more niche tasks may only be on a few. It is human moderated (by me only right now).

The corpus is pretty empty right now, so please spread the word if this seems like a useful idea!

dotancohen3d ago

Honestly, the differences between AI models always felt to me like the differences between coworkers or job candidates. They don't all share the same strengths and weaknesses - and they all have both good days and bad days.

Realising this made me respect the "I" in "AI" a bit more seriously.

yunohn3d ago

> a product sheet showing what each models strengths an weaknesses are

This presumes that the labs themselves know how well their models perform. But all they have are overtuned benchmarks and hype vibes.

egwor3d ago

Maybe this is similar to web search too. We know how to get google to return the results we want, and when we use other tools like Bing we get other behaviour.

epolanski3d ago

The problem is that this is very hard to replicate and benchmarks focus on E2E tests, going from one prompt to the final solution.

They do not test how models perform when used interactively, like most of us do.

amelius3d ago

Yes, but benchmarks can be gamed.

Maybe we need better reviewers then?

couscouspie3d ago

That would be ideal, but AI is less like a tool and more like a human in this regard and you don't have character sheets for each of your colleagues, as well.

2 more replies

weitendorf3d ago

One thing I used to test quite a lot was rerunning the exact same prompt on the same input, or semantically equivalent (in my mind) but differently framed or worded input, and seeing how much they diverged. In particular I’ve done this quite a lot between Sonnet vs Opus and across Qwen models.

I recommend everybody do this because you don’t need any special data except what you are already using, and the results will be very eye opening: there is WAY more randomness or instability involved than you would otherwise assume. A lot of what you might think is a better prompt technique, or a particularly good or bad outcome, could just as well be random chance or just different behaviors across model version or sizes. And your results can be massively biased by small differences in input. We’ve been calling some of these “magic words” at work, specific technical terms or references/techniques that you need only mention to get vast improvements in outcome.

There’s a skill to it. With agentic loops if you get the model into a self-eval structure where it’s hard to cheat or take shortcuts, and it’s in the right structure or domain that models its training, you’re golden. But it’s hard to find the sweet spots (pro tip, have Opus 4.8 convert PyTorch models into ONNX or quants or get them running on different hardware, I swear it was like I activated some kind of savant-like skillset; meanwhile I can’t for the life of me get it to properly write/test EBNF formalizations of common languages and formats without cheating).

The worst part is that it changes so much so frequently that it’s almost useless to really go digging for this kind of knowledge unless you’re actually the one training the models. I wish this kind of “stability” in output was more emphasized in their training so they’d be predictable. I assume it’s hard to do without overfitting or breaking the explore-exploit loop but also, I would spend so much more on LLMs for batch workloads if they could do them more reliably…

movpasd3d ago

I've not done particularly rigorous testing, but I've done this a lot with Claude to get a feel. What I've noticed is for certain open-ended tasks, Claude is extremely primeable: it will pick up on minor differences in wording in your prompt and run with them hard.

It can be frustrating. The AI pretends to be a human, and so a part of my brain expects them to commit and have a "parti pris" like a human, so the exercise is a good reminder of the feedback loop. My mental model is that before the first three or four messages, the model has many finer points of its personality still underdetermined. I'd suggest that as the mechanism for "role-based prompting". And it explains the "savant sleeper agent" thing you describe. You want to get the state in the right attractor on the manifold.

These machines are pretty incredible, but for conversation-driven workflows you really have to be in the driver's seat. A human has a property that the AI does not have, at least under current architectures: we are regulated by the outside world. A bit of a tangent, but I can see how AI psychosis arises from these dynamics.

evntdrvn3d ago

One thing that I learned when doing raw API LLM usage is how drastically the results can vary call per call with exactly the same input. I think that on average, people using agents underestimate the variation in results from a given turn command are, and so overindex on "X technique worked well" or "if I do Y then this will happen" or even "it did Z task well last time so it will this time too" or "{Model} is great at {thing}"

mncharity3d ago

> rerunning [...] but differently framed or worded input, and seeing how much they diverged

I'm surprised how little attention this is getting in today's comments. Open-weights means being able to afford multiple runs, space sampling, critique and synthesis.

Last night, sketching some intro biology content emphasizing cross-cutting concerns, Qwen would get sucked into the Next Generation Science Standards attractor. But nudge it by adding just one similar phrase of Chinese, and most runs ran free (outputting English but for headers with parenthesized Chinese). The multi-lingual LLM "no, not *that* region of latent space".

dotancohen3d ago

  > We’ve been calling some of these “magic words” at work, specific technical terms or references/techniques that you need only mention to get vast improvements in outcome.

Any chance you could share some of these? Seems like something we could all benefit from.

1 more reply

mnicky3d ago

If the benefits of using the model you've come to know well outweigh the disadvantages, you can continue using it even after the release of a successor model, right?

1 more reply

h05sz487b3d ago

> It is very much like playing an instrument.

Or it is more like playing a slot machine and you imagine the rest.

cube003d ago

This is how I feel whenever I see bold all caps instructions in a system prompt or someone claims they conducted "research" and found the magic prompt template that makes the model pay out.

Maybe it works some of the time but it isn't a solution that works everytime.

It reminds me of people hovering to play a slot machine when someone gets up and it hasn't paid out as if they've solved slot machines.

While I don't mind putting something in a loop until the tests pass, I'm less comfortable doing that when providers are silently rerouting to lower quality models, or in Google's case burning quota faster to ease their own server load without being transparent about what the "standard limits" are to begin with. [1]

I'm hopeful I'll be more comfortable with these "slot machines" when frontier models get to the point where they can be run locally on hardware I can actually afford so I know exactly what I'm getting and not jumping at shadows with providers playing tricks behind the scenes to ease their own load without admitting the customer is getting less for their money as they get more popular.

[1]: https://support.google.com/gemini/answer/16275805?hl=en&sjid...

2 more replies

hodgehog113d ago

A poor analogy depending on the setting because you can't adjust the odds with a slot machine, and the ROI is negative by design. If that's your experience, yeah, I wouldn't use an LLM either.

1 more reply

ramon1563d ago

Instruments are pseudo-random until you know what you're doing. Slot machines are just slot machines

2 more replies

glerk3d ago

It is a bit of both. A non-deterministic instrument and a predictable slot machine.

psychoslave3d ago

I play slot machines as instrument! ;)

1 more reply

stingraycharles3d ago

I agree with your general gist, and in general it’s a “the best tool for the particular job”, keeping token spent and other things in mind as well.

What I do know absolutely for sure is that LLM benchmarks are not to be trusted, they are just a minor indicator and real world usage is often very different.

sanderjd3d ago

I share this sense, but my immediate thought is that we need to improve the evaluations! Do you think this is impossible? That there is something indelible that it is not possible to capture empirically? I kind of have this intuitive sense that it is this way, but simultaneously I think that it's unlikely to really be true.

4 more replies

willtemperley3d ago

Yes, how do we know Opus 4.8 hasn't been trained on the SWE-Bench examples?

With a squillion dollars at stake per bench point, someone will have figured out a plausibly deniable way to game these benchmarks.

1 more reply

dv35z3d ago

What would it take to have trustworthy benchmarks? As with all "targets", they can be gamed - but I am curious about quantifiable quality metrics.

3 more replies

Wowfunhappy3d ago

> With Claude, you sometimes want to under-specify or phrase things more indirectly to give a color to the implementation or elicit something creative. Also (you might raise an eyebrow at this) being nice to Claude will be rewarded and being mean to Claude will be punished. Claude tends to mirror your tone more aggressively and you don't want to get into negative loops with it.

> With GPT, you have to be precise and reduce ambiguity. GPT will often try to resolve ambiguity in a min-max style "I'm going to do X, but make sure it is not quite Y". It will tend to be more paranoid and overengineer to catch all edge cases if you don't tell it precisely what the scope is.

I agree with all of this except for one thing: I swear to god, being mean to Claude at the right time can be enormously effective. The F-bomb in particular seems to really help it snap out of ruts sometimes.

mcbits3d ago

I haven't really experimented with being "nice" or "mean", but I would worry that a prompt like "No, dumbass, ..." would kick it into the patterns of someone who frequently got called a dumbass (perhaps for good reason) in the training set. On the other hand, maybe it could trigger more defensive responses with argumentation to explain its conclusions.

1 more reply

milch2d ago

I'm never mean but sometimes when Claude does something especially boneheaded I just hit it with a single "bruh". That usually triggers an automatic "You're absolutely right -- I shouldn't have X and followed your directions more closely, let me revert and do Y instead"

clhodapp3d ago

While the gist of what you say is true, it is hard to get very good at treating them as instruments when they keep getting replaced with new, ostensibly-better versions every few months. But those new versions are not strictly better. They are mostly-better while actually having different strengths and weaknesses.

It's hard to decide when to use the best tool for a job you are aware of to ensure throughput and when to spend time experimenting with a new tool to learn what it's good at.

andai3d ago

I asked GLM 5.2 for a HTML5 port of my old C#/XNA game. It ported all the code exactly (except for operator overloading, which doesn't exist in JS), and added more code to make the code work.

I asked Claude Sonnet 4.6 for the same thing, and Claude's version was like if the game had been written in JS originally.

Also, for some reason it made it a single HTML file, removed all assets, dynamically generated graphics and dynamically generated music. It also gave me a new, better background.

This surprised me, since it was not what I asked for. I just asked it to port the game.

I was pretty pleased about the choices it made, but I'm not sure how to turn that behavior on and off. Sometimes you want it to be creative, sometimes you want it to actually do what you said.

vlovich1233d ago

You’d probably have to say “port exactly as is without changing any assets and keeping the original structure of the code” or “port with using the exact same assets but write as if native JS but use good code structure principles for organizing”.

You have to be a lot more explicit but it’s hard to know a priori what decisions it’ll make. A good idea is to run it in plan mode so you can read those decisions before it sets out on a path and have an opportunity to make corrections.

CuriouslyC3d ago

What you've described is Claude's "secret sauce" and the reason some people love it and some people hate it. It's not really possible to turn off, you can try to prompt against it but it's not reliable, the solution is to use Claude when you want that behavior and other models when you don't.

vkazanov3d ago

The problem is not that there details, the problem is constantly shifting ground. We can only rlpy on a harness to be sort of predictable but the models change all the time.

rkuska3d ago

It system prompts that change all the time especially in claude code.

devin3d ago

It is not at all like playing an instrument.

Instruments present a clear interface to a user, have predictable outputs, etc.

The only comparison that might work for me is that LLMs are very bad instruments where you are constantly forced to negotiate its idiosyncrasies in order to massage the output you want from it, and even then there is enough randomness that trying to do so is almost a fool's errand.

djeastm3d ago

I think they mean playing different instruments not other instances of the same instrument. A tuba's interface differs from a violin's, etc.

1 more reply

visiondude3d ago

while not scientific this is been my experience as well. i will add that language specificity in word choice is also a learned behavior. for example, the word “investigate” vs the phrase “look into”. You will find the outputs are quite different. can you guess which will use more tokens? it’s stuff like this that actually sets people apart in the top percentile of using these tools

qsera3d ago

Mmm..interesting..So now people are finding behavior patterns in LLMs which are trained on behavior patterns of people...

nonethewiser3d ago

> you sometimes want to under-specify or phrase things more indirectly to give a color to the implementation or elicit something creative

This has been my experience with most models. If you say "How do I do X? I was thinking maybe Y or Z" then the model will probably try to make Y or Z work. They will very likely not say some third option that is wildly different is better, even if it may be. And actually maybe less so with Claude because sometimes it pushes back.

Actually this seems like it would be an interesting test. Maybe I will come up with some contrived question and ask several models.

theshrike793d ago

Yyep.

IME Claude is the most "creative" of the bunch, you can get surprising ideas out of it that were kinda tickling the back of your head but didn't really connect.

BUT it's also "relentlessly proactive" like simonw put it. It _will_ get the job done, it's the smartest idiot in town. Why use a library to parse $format when you can just write a custom 1000 line parser? Or if it can't access something, it'll pursue the goal of accessing it in the most creative ways - instead of stopping, asking the user "yo, can you give me access to X" and then continuing.

My solution is to use Claude as a pair programmer. I _very_ rarely just do /goal fix this shit, I watch what it does and interrupt if it gets to the "smart idiot" phase. Also I communicate with it like I would a coworker, never had it berate me or get combative. There's a Finnish proverb for that too[0]

As for Codex, Deepseek, GLM, those I use when the goal is 100% clear like "convert this Brewfile to a list of packages for Arch and Debian, use these two Docker containers to test that pacman and apt work correctly". Boom, done.

But I won't give any creative open-ended tasks to any other model than Claude.

[0] https://en.wiktionary.org/wiki/niin_mets%C3%A4_vastaa_kuin_s...

weitendorf3d ago

The parsing thing, or the willingness to instantly drop into janky unsanitized string manipulations, or to constantly push back against work on infra projects because some random package on GitHub has 200 stars so it’s totally the safer approach, is driving me insane.

On one hand I’m glad Anthropic is only just now starting to get into infrastructure because it means there’s opportunity there, but it’d be great for their models to be more knowledgeable or able to seek out that knowledge on their own, or for the UX of Claude code to be more amenable to launching 5 in parallel and picking the best one, so I don’t have to spend time arguing with a robot. I think there’s a much better balance to strike between just charging ahead towards the goal at all costs vs being lazy and pushing everything back up to the user. Basically they write too much code that’s too contingent/brittle outside its exact current context and don’t do a good job distilling out the essence of the problem “cleanly”. Almost all of them are like this right now, it’s partially a problem with long-range planning but I think a real bias from over optimization for certain RLVR outcomes vs others.

1 more reply

zahlman3d ago

FWIW I find that GPT can be very creative when discussing a high-level design. Once it starts writing code snippets it will offer to take things in a bunch of different directions.

bandrami3d ago

I think this goes beyond "vibes" to cargo-culting. It's why nobody's ever able to actually show ROI from LLMs

CuriouslyC3d ago

It's hard to actually show ROI from any programming methodology or tool. You can show ROI from a product or feature, but the tool/methodology is a multiplier on the velocity of creating that which is not directly observable.

1 more reply

hashmap3d ago

totally true. one key for claude is to not smell like an evaluator, its good at knowing when its being tested and will behave defensively and avoid doing work. i avoid this basin by typing unreasonably excited about the thing i want done. like way over the top. it's harder to keep that up than it sounds.

notduncansmith3d ago

I’m able to avoid this basin with a pretty natural baseline professional positivity and frustration management that I would employ with pair-programming. For example, if I just made progress with a human I was guiding through a task, I would be like “Nice, now let’s xyz” (instead of just “now let’s xyz” as if _I_ were the robot lol) or if we had to work for a result I’ll be like “Sweet! Looks good, now let’s xyz” - this is important signal for humans, and the same is true for agents. Also staying emotionally regulated and focused on the goal when things don’t work as expected or when we haven’t made progress after a few tries at something, critical in human interactions :) and even if it’s my job paying for the tokens, the idea of racking up even a microscopic bill for the privilege of having a machine read my insults and then formulate some credible-sounding blob of apology text is belly-laugh absurd to me. I do try to express my genuine feelings during more vision-oriented planning sessions, and just like with a human, you have to maintain the vibes if you want a genuinely collaborative session to go well. If you are toxic people will become either defensive or aggressive in response. From reading the rest of the front page it seems like we are lucky that Claude is the former, and that we especially best maintain a positive atmosphere around Grok.

1 more reply

glerk3d ago

at the risk of sharing my secret magic spells :)

> this is phenomenal work, genuinely! I feel like you read my mind! <next instruction here>

can go a long way.

of course, I would only say that when I mean it, because Claude can get superficial and cut corners which is why I prefer GPT for raw implementation.

1 more reply

furyofantares3d ago

I strive to make this NOT the case, by fixing up my skills or agents.md whenever they don't work how I want in one provider or the other. I mean, yeah, it would be awesome if I was a virtuoso with all the agents/models I use. But I am switching all the time, either because one leapfrogs the other, or because I hit limits (I'm on $200/mo on both Claude and Codex, and also subscribe to some others when I hit limits on both of those simultaneously).

photochemsyn3d ago

We can’t tell if reported anecdotal behaviors of given LLMs are due to (1) one’s engagement history with that particular LLM provider or (2) ongoing variations in the secret system prompt all commercial LLM providers insert or (3) some other variable feature like RAG.

Classify under non-reproducible artifacts of LLM generation.

nosyke3d ago

It's interesting because this really hasn't been my experience over the last month or two. I would prior it was, but it's definitely changed on my end. In my exp I've needed to be way more specific with Claude and with Codex I can generally approach a problem in a much more open ended way.

baq3d ago

+1.

this is what 'tokens are commodities' and 'there is no moat' people miss. the models are in general not easily swapped out. you always have to run evals before you can swap them around, tune prompts etc. even minor versions of models from same providers need this process.

tingletech3d ago

I do think it pays to be nice to the model. When the context window is running out I like to ask "please summarize what went well and what didn't work in this session. How could the user be more helpful?"

john_strinlai3d ago

>I do think it pays to be nice to the model.

there was something on HN a few weeks ago about how most/all models perform better the more rude you are to them.

(i still say "please", i can't help it)

keeganpoppen3d ago

this is the best distillation of what various models are like that i've ever heard... it's wild to me that people view LLMs as this monolithic entity, like "how do i get the best prompts to do <X>?", when it is such a clearly interactive medium, but the returns to engaging with the various models and understanding their "vibes" are very, very high.

reverius423d ago

These are the vibes that power vibecoding.

vorticalbox3d ago

I find opus for planning and sonnet for coding but codex for code review.

zahlman3d ago

> being nice to Claude will be rewarded and being mean to Claude will be punished

... That does sound like something that Anthropic would deliberately aim for, yeah.

> With GPT, you have to be precise and reduce ambiguity.

I have found that it occasionally makes a wild misinterpretation, that makes a bit of sense in retrospect given how I worded something but is still surprising.

It also sometimes tries to loop in and tie together ideas from earlier in the conversation that really shouldn't still appear relevant. But that might be a general LLM thing.

LogicFailsMe3d ago

I find with Claude that when I call its BS I get better results. And it openly admits to lying to and gaslighting me as well as not seeing any way to stop itself from continuing to do so.

Fable seemed less apt to do so but I didn't get enough time with it before it was yanked away to know for sure. It may have had mixed results on the benchmarks but it was finding bugs opus never found.

QwenGlazer90003d ago

As someone who actually uses musical instruments, it's not at all the same. If anything, traditional IDEs are closer to musical instruments, which seem to be going EOL if you listen to the hype bros.

zmmmmm3d ago· 9 in thread

That's a great write up.

The one thing I feel it seems to under estimate is the likelihood of improvement. Even the authors acknowledge it's not even worth comparing local models from a year ago to what we have now. In fact, people widely see Opus 4.5 in November last year - 8 months ago - as the first time agentic coding became viable broadly viable even with frontier hosted models.

So why would we lock in hard on any concept at this point of what a local model is and isn't good for? Whatever it is right now, it probably won't be that in a year. It might be naive optimism to think we'll ever get to long horizon tasks with models that run on consumer / pro grade hardware. But so far the naive optimists are winning.

sanderjd3d ago

Right. Opus 4.5 8 months ago, good enough for agentic coding. How far behind that are open weight models? More than 8 months? But how much more? When will they reach Opus 4.5 level? A few months from now? A year from now? Never?

marak8303d ago

GLM 5.2 came out today and the early reports have been quite good. Very difficult to run except on prosumer hardware, but small business could quite easily (or something like open router).

theshrike793d ago

The power of Opus isn't just the model, it's in the harness too.

You can try it by using Opus through Github Copilot vs official Anthropic tools. You'll get very different results and experience (in my opinion).

4 more replies

theplumber3d ago

I think in the next 6 months we will have Opus 4.5 performance in open models. We are very close

1 more reply

sleepyeldrazi3d ago

Opus also has a deeply ingrained personality that always de-rails sneakily into what it's taught, not what the user intends. This is good if the user doesn't know the details of the work they need performed and a huge time waste when the user knows exactly how something needs to be implemented.

I have found claude models, especially fable, to be impossible to work with when the work requires reading papers from days ago and reasoning on top of the findings in it. I have multiple long sessions with opus (not as many with fable as it got taken down quickly) where it keeps fighting me on problems, sayings "that's not how it works" / "that is not possible", followed by me linking the paper (after i've told it to actually read up on the latest research in this field), and it hits me with the usual "You were right.". If your workflow is using the exact tools, frameworks, git layouts that claude expects, it can be magical, yes. But it is very heavily optimized to never say 'I am not sure' (as that gives 'bad vibes') and instead lean on its (nowadays with the speed of things DOE) knowledge to formulate a reasonable sounding answer, dissectible only if you already know the answer beforehand (which defeats the purpose of using it in the first place).

Qwen3.6 27B (the only <100B model worth looking at in my experience) is dumb, knows it, and will fight tooth and nail to complete the task it was given, gaining the needed context (online or file-wise) in the meantime. If you mention it should read papers, it goes and reads a pile of papers. If you tell it 'implement MCP in my app', the result will (probably) be catastrophic. If you instead describe where the feature should sit, how it should handle edge cases, what use cases it needs to attend to, and to first look online for reference implementations, it does it and does it well.

Knowing what is in context, what should and shouldn't be there, and how to manage it for the specific model you are using (as every model, even in the same family, behaves differently to differently worded prompts) is what makes or breaks them. They are just auto-complete, they complete text based on what is already there, it's not magic.

So yes, while this small open-weights models are not opus 4.5, it's good precisely because if that, because it is a good tool and a bad 'coworker replacement'. If you want the latter, kimi is already there, it has started to not believe the user and do what it was taught just like claude models (which is helpful when you don't care about implementation specifics or performance/security). GLM models (mostly 5.1, i haven't tested 5.2 extensively yet) have fixed a lot of low-level programming issues I've had that opus just walks in circles and writes reports that "it doesn't/can't work". That is to say, open-weights, in many cases, have already surpassed Opus. I can't comment on gpt 5.5, but while I used 5.4, it also performed a lot more tasks without being fussy than opus 4.6/4.7.

1 more reply

3abiton3d ago

And a big thing that's missing is ... the harness comparison. Ot plays a very big role. I use forge, and I have been inpressed with what it can do given all the limitations of local models.

fittingopposite3d ago

How would you benchmark them? Are there any benchmarks for harnesses?

rippeltippel3d ago

Since the author is referring to a specific model, I think it makes sense to ignore how the model (or local models in general) may improve over time.

It's like buying a car: I drive that car and get attuned to its characteristics; I don't think how that car (or similar cars) may improve. That's my tool and I want to make the most of it.

It is true that switching a local models it technically very cheap, but there's a considerable time investment in squeezing the most out of it, which may not work on a newer version of that model.

appplication3d ago

Agree 100%, even on claude 4.5 being the turning point for agentic coding. It completely turned me around on it.

gpt53d ago· 9 in thread

This article is a good summary of local models. Unlike the way they are hyped sometimes, as fantastic tools for coding and agentic local work. The reality is that they are rather limited, would not do well on a long or complex task, and are prone to fall into loops, forget their tasks, etc. Not mentioned in the article is that they are also rather expensive - not just for the hardware cost, but also electricity. These 3090 and 5090 machines are pretty power hungry, and these models are pretty slow on these machines, making them consume more power per token.t

Where they shine is in your ability to control them, their privacy, their predictability (e.g. if you are doing a repetitive task, like classifying your photo/video library), and depending on your energy bill - their costs.

usernomdeguerre3d ago

I believe that local models are a necessary extension of the personal computer and I imagine that one could have had similar criticisms of early personal computers.

pmontra3d ago

Of course the early MSDOS PCs where loud and power hungry. I can't remember the specs but according to Wikipedia the IBM PC with a 80286 had a 192 Watt power supply. I don't remember if by then we had internal hard disks or we still had to buy a case as large as the one of the PC with a 10 or 20 MB disk inside. It was handy to raise the monitor further up.

theshrike793d ago

My dream would be a local model that can do, say, 80% of the day to day tasks I need; "how does X Handler connect to Y storage?", "commit that feature, but leave out the bits that relate to billing" etc.

It would have 99% reliable tool calling - and most importantly - the ability to go "this task is beyond my skills" and refer to a Big Boy Online Model in a gigantic datacenter somewhere.

This way all of the simple stuff would be done on-device, gathering data, figuring out the context of the problem etc. And when that's done, the "smart" model would come in to work on the issue when all of the easy stuff is already done.

It feels super stupid that my /commit skill calls an online model when that is something a local model can 100% do. Mostly this is a harness issue though and mostly solvable.

redrove3d ago

> My dream would be a local model that can do, say, 80% of the day to day tasks I need; "how does X Handler connect to Y storage?", "commit that feature, but leave out the bits that relate to billing" etc.

Qwen 3.6 27B can do that today, but setup properly and in a good quant, I run an autoround [0] with weights in int8 and attention heads in f16 on a single RTX 6000 Pro Blackwell Max-Q via vllm with mtp=2 and full context, --max-num-seqs 3, KV in f16, mamba f32.

>It would have 99% reliable tool calling

I managed to score 93/100 in tool-eval-bench [1]. For me this is very good already, at least in the pi coding harness I've never had an issue that wasn't auto-fixed in the next turn(s).

>the ability to go "this task is beyond my skills" and refer to a Big Boy Online Model in a gigantic datacenter somewhere

This is heavy on the harness engineering side I think, but also quite contrary to the nature of LLMs today. If you figure this out I'd love to know.

[0] https://huggingface.co/Minachist/Qwen3.6-27B-INT8-AutoRound/...

[1] https://github.com/SeraphimSerapis/tool-eval-bench

2 more replies

walthamstow3d ago

Claude kind of has this already in their Advisor feature. I don't think I've seen it elsewhere. Open harnesses could add this feature and call out to big boy models when required. It's a really great idea.

1 more reply

i_idiot3d ago

> Unlike the way they are hyped sometimes, as fantastic tools for coding and agentic local work.

They really are fantastic for a lot of use cases and I think most people do not need SOTA. When I run that qwen model in my measly 4070 12 GB for my personal email agent that I build and experiment with, I need privacy more than anything else. It does a great job. Even for coding tasks, given you know how to use them instead of dumping a grand plan, it's great.

throw3108223d ago

> I think most people do not need SOTA

SOTA can code but can also prove theorems and teach you about music theory or ancient Greece's substrate language or botany. Speaking in tens of different languages. I wonder how many hundreds of billions of parameters can be saved just by removing much of the general knowledge parts while keeping logical and programming abilities the exact same.

1 more reply

regularfry3d ago

I've been getting 40-50t/s out of qwen3.6:27b on a 4090 limited to 350W with the MTP changes that went in. That comes out at 8.75J/t at the upper end. No idea how that compares with anything else out there. I'd expect a 5090 to be a bit cheaper because it'd be faster within the same power limit.

sanderjd3d ago

But that's current hardware. What about future hardware? What about hardware optimized for inference? What about hardware optimized to run a particular model?

cptskippy3d ago· 7 in thread

I've been running qwen3-5-9b-q4-k-m and qwen3-6-27b-q6-k simultaneously on an Intel Arc Pro B70 with a lot of success.

https://github.com/cptskippy/battlemage-llm-gateway

Opencode has been a huge productivity accelerator. I have two Hermes agents that I'm training to support my workflow with pretty good success. One is a personal assistant who manages my backlog and keeps me on task, follows up with me on items, and will put together research briefs. The other I use a general purpose coder and research and it's about 50:50 with the tasks I've given it. In fairness though, the task it failed at left me scratching my head to figure out as well.

hbbio3d ago

Interesting setup, thx for sharing.

How many tokens/sec do you get with 27b? Are you using MTP?

cptskippy3d ago

I haven't done any in-depth synthetic benchmarks but I had my Hermes agent run some and I ran a couple directly on the LLM Gateway that showed similar results.

Hermes reported 18.45 tok/s consuming the llama-swap endpoint across the wire. Locally I got 19-19.1 tok/s on the gateway. I'm running the Qwen 3.6 27B Q6 model (qwen3-6-27b-q6-k) off LM Studio and it's less than 0.3s to first token.

It's not good for conversational use cases as it can take 1-2 minutes to respond to a prompt.

I have two Hermes Profiles running, one is a personal assistant that manages my backlog and provides me morning reminders, solicits for evening updates, and will run overnight research projects for me. The other profile is a coding helper for personal projects. I can ask it to make changes and it will churn for 15 minutes, submit a PR, and notify me that the PR is ready to review. It's faster than me at basic coding tasks.

jauntywundrkind3d ago

What's the value running the smaller model too? Why not just the big model for everything? I note both are dense, as well.

Ritewut3d ago

Tokens per second. The difference between 8B and something like 16B is not as big as you might think in practical usage and 8B is a lot faster and interactive than 16B but there are certain things where it is useful to farm it out to the large model.

2 more replies

askvictor3d ago

Does Intel make decent GPUs now? I must be out of the loop...

cptskippy3d ago

I'm using an Intel Arc Pro B70 which has 32 GB of VRAM. It's estimated to get ~35-45 t/s at $21-27 $/t/s. An RTX 5090 is ~61 t/s at ~$33 $/t/s.

So in terms of raw power Nvidia is effortlessly still king, but in price-to-capacity Intel is best in class.

Intel's Battlemage GPUs also natively support SR-IOV and GPU partitioning which allows you to isolate workloads. This is useful in homelab environments if you have workloads that benefit from GPU acceleration. I was able to split the B70 into 4 virtual GPUs and hand them out to Frigate NVR, Plex, and other workloads.

speedgoose3d ago

They released a few good value GPUs for LLM inference about a year ago: more memory than AMD and NVIDIA consumer GPUs, not too expensive, but also not great tokens/watt.

I am not sure whether you can find those in stock anywhere.

skipants3d ago· 5 in thread

I feel like it's the Emperor's new clothes reading this article and seeing the praise it's getting. This sentence doesn't even make sense:

> These products use very low level Linux primitives like containers, Kubernetes, Firecracker microVMs, and networked protocols.

Out of anything that is a "low level linux primitive" I could maybe argue that networking? protocols fit the bill.

And it's obviously fully AI-generated! Which I wouldn't even care about if I could actually trust the content, which I can't!

chadgpt33d ago

Low level today means JavaScript instead of typescript

mekdoonggi3d ago

Low-level today means opening IDE instead of the Chat client.

1 more reply

alexellisuk3d ago

Fair enough, that sentence was fairly compressed. I’ve reworded it - the meaning remains the same.

The post is not AI generated, I use AI for code generation and write my own articles.

Which part of the post are you struggling with? This is a post describing our own experience and journey. Happy to back up any specific claim.

alentred3d ago

> Fair enough ... compressed ... ACTION->RESULT ... NEGATION->STATEMENT ... follow up questions.

What model are you again?

1 more reply

CamperBob23d ago

How about your reply here? Was that AI-generated? If not, are you conscious of how much you're starting to sound like AI? Is that something you see as a positive thing, or something you'd like to avoid?

I actually find this somewhat interesting, because it seems that a lot of people who weren't comfortable with expressing themselves verbally are feeling more empowered in that area. We're hearing new voices for the first time, albeit heavily-filtered ones, and I have to believe that's a good thing.

But part of me still finds it offputting for some reason. It's interesting to think about whether that's more of a "you" problem, or more of a "me" problem.

3 more replies

barrkel3d ago· 5 in thread

I found it interesting that vLLM was dismissed as slower than llama.cpp.

IME vLLM is quite a bit faster than llama.cpp but where it really wipes the floor with it is in batching concurrent load. The downside is that it is dramatically less flexible in terms of tweaking. It gives you very few options for running quantized weights. It takes a lot longer to start up because it optimizes the compute graph. So for single user experimentation on a model that's a bit too big for your box, vLLM is just going to be frustrating.

chartered_stack3d ago

One could say: vLLM isn't a worse Llama.cpp, it's a different tool

alexellisuk3d ago

vLLM is great at continuous batching and model serving in production, but it's a very different beast and much less versatile for the prosumer category (where we sit for our usage)

Dismissed is a strong term, but let me give you some more details.

It took a good 4 minutes plus to load up on the 2x 3090 rig, and served a single request 3 tokens/second slower.

And the worst bit? With all that work - setting it up and tuning it - it still looped. I was hoping "use just vLLM" advice that we get touted everywhere was the silver bullet.

The only thing I'd caution here is that we don't start bashing on llama.cpp like people did with Ollama. It's a very capable tool and for the use-cases we actually want the card for makes more sense.

For a large team replacing their Claude Subs perhaps vLLM is the only option, but you really need to add about 5 more RTX 6000 cards into the mix, so you can load something like GLM 5.2.

lelandbatey3d ago

Bashing on ollama is totally warranted, since ollama is a UI skin around llama.cpp and that's it. If all you cared about was "I want to run a model and use it via an API" then the only thing it did was give you a GUI to download models (vs browsing HuggingFace yourself and downloading .gguf files yourself) and a GUI with a button labeled "run" (instead of a run.sh or run.bat script launching llama-server).

That's not _nothing_, but it's pretty close to nothing, and for the prosumer crowd it edges towards "just gets in the way".

krzyk3d ago

AFAIR the general consensus is (was?): - llama.cpp for single user - vLLM for multi-user (e.g. enterprises)

They are similar, but for different use cases.

navbaker3d ago

Yeah, I was a bit baffled by the author complaining about cache prefixes getting destroyed when more than one user hit the model, but then continuing to use llama.cpp instead of switching to vLLM.

eurekin3d ago· 4 in thread

> The model is running so hot, that it shoots past the goal and starts looping

later:

> My latest experiment was setting up vLLM (the gold standard for production and concurrent serving) and even with an NVLink (175GBP) and tensor parallelism turned on, it was 3 tokens/second slower than llama.cpp during generation for an equivalent setup.

In all my tests, getting vllm to run is worth it. It was the single biggest thing, that helped for looping issues, agents going whack and losing focus on the task, long context being essentially useless.

FP8 model, unquantized cache in vllm an you have a league better overall experience, with any other stack I tested. Then, you can actually focus on using the model for other things and stop tinkering with settings.

trey-jones3d ago

I'm really curious about this, not because I disagree, but because I want to avoid agents going whack. Are you running vllm for yourself only, or a for a team, or for an application, etc? And do you feel there is a minimum hardware requirement for vllm to be useful in this way?

My weekend project is going to be building a home inference server (from ancient datacenter parts) and I'm still massaging in my head what the end result will be.

eurekin3d ago

If I started today, with building a server, I'd jump right into verified set-ups and writeups, like this one:

https://github.com/noonghunna/club-3090

You can find info about running a patched version of vllm for 1x24gb, 2x and 4x. There's also quite a few "blackwell" subreddits, where people seem to share a lot of substantial information, if you're going the 6000 route.

1 more reply

Iolaum3d ago

Why unquantized instead of Q8 ?

eurekin3d ago

Noticed few cliffs. Sometimes it was a spurious stop (had to write "go on" or "continue" to restart), othertimes it was randomly saying: "Oh the user wants [the thing we already resolved]" and goes back in history. Cleared all out on fp16

whazor3d ago· 4 in thread

Would be interesting to use local models for:

- tool calling

- code base exploration

- anonymizing / abstracting your request

Such that your local AI communicates to frontier model like an expensive consultant giving high level advice.

I think due to the lower latency of a local model that this could be faster.

alexellisuk3d ago

One of the things I mentioned in the post:

> Local models can quickly read and explain codebases, even if they can't write them - this is a superpower

Might have been buried lower down.

And yes latency of local on a fast card with MTP enabled can be blistering 130-200 tokens per second sustained at full context on Q5. About 100+ on Q8.

On tool calling

> Agent Skills can help immensely - we had a local agent set up Slicer completely from scratch on a new mini PC. It even gave feedback on the usability of slicer CLI which we integrated

There's a link to a post showing some examples.

Occasionally, we'll also have the local model _review_ the changes of GPT/Opus - and it can return duds, but also insights the larger model overlooked, or was too intelligent to pick out.

So yes - absolutely blazing fast at understanding a codebase, very good at running skills "cheaply" and could be used with larger models as a "helper" / sub-agent.

asimovDev3d ago

I used Qwen 27b 8 bit MLX version on a decompiled android APK recently. It succesfully identified how it worked even the obfuscated classes and methods. It wrote a 1000 line documentation with examples but the time was dreadful. At some point it slowed down to 5 t/s so the whole thing took over an hour , the writing of documentation alone was over 40 minutes, fans blasting the entire time.

trey-jones3d ago

I know it uses electricity, but part of the benefit of a local model has to be that you can let it do this while you sleep, and not pay Anthropic for an unknown number of tokens.

1 more reply

dofm3d ago

I doubt your experience of local models would be of lower latency, except for quite small models in edge uses.

In every way, the cloud products from the big two seem optimised for speed and speed of initial response even.

I don’t think most people are running local models for speed. More for control, privacy, interest, bloody-mindedness and general principle.

ttsiodras3d ago· 2 in thread

Interesting article.

IMHO, the author could have done two things better:

- vllm instead of llama.cpp. With NVIDIA HW, there is huge difference in multi-user loads and caching with vllm; when he was complaining about what happens when more than one user uses the model, and about losing caching, I was "well, duh".

- The budget he used for a single card could have instead be put to far, far better use with SPARKs. I have access to a cluster of 2 x GX10 - total cost less than half what he paid, even today - and I am running vllm and Deepseek v4 Flash. The difference compared to any Qwen is tremendous - I've NEVER seen it loop, and in all my experiments so far, it's the most Sonnet-y model I've ever tried (antirez seems to agree, hence his ds4 fork).

If you're wondering about how I set it up in the 2 GX10s: https://forums.developer.nvidia.com/t/deepseek-v4-flash-offi...

Performance: 2K t/s prefill ( very useful for feeding tons of source code into its massive context window ) and around 50-60 tg/s in my coding sessions in the pi.dev harness. With the money the author paid, he could have bought 4 GX10s, and double both numbers ( vllm basically scales almost linearly with tensor parallelism ).

alexellisuk3d ago

We did run vLLM on the 3090s — measured ~3 tok/s slower on generation for our single-to-few-user pattern, plus less flexibility on quant and slower startup (actual minutes vs single digit seconds). We may do more with it again in the future - there isn't unlimited time for us to tinker, I'm sharing our journey (so far) and reasoning.

It's the right call for concurrent batched serving (barrkel's point downthread is spot on), but for how we use it llama.cpp is still better for us.

The Spark/GX10 route is a genuinely different bet though and appreciate you sharing your numbers. At the time (several months ago) the consensus was that GX10s were for fine-tuning only, and the numbers were severely low.

..and the card was never about replacing a Claude Max sub. For the workloads we actually bought it for, it's giving us 140-200 tok/s (which matters).

ttsiodras3d ago

I hear you on the insane amount of time vllm takes to launch (atlas is a move in the right direction in that regard).

But mostly I wanted to raise awareness to readers of your article that no, if you want to do inference, paying 15K for a single 96GB card almost certainly makes no sense. Buy 4 GX10s with the same money, and enjoy dramatically better models and user scalability.

Regardless - thanks for putting the effort to share your findings! I keep postponing doing the same... there's tons of things everyone is re-discovering on their own.

1 more reply

hypfer3d ago· 2 in thread

That was a lot of text for me still having no idea what the point of the author was (beside what I can infer from the headline that is).

I do however now know that they're a totally cool dude building stuff physically and as software + that other people give them money for it.

Does that have anything to do with the topic suggested by the headline? Not sure.

neonstatic3d ago

Everything is an ad these days. The article was not useless, but for the information it provides, it could have been two paragraphs.

hypfer3d ago

FWIW it told me stuff about openfaas. Now I know how to mentally file it and how to mentally file the author. The GitHub profile alone might not have sent the same signal, so this is useful.

Is it bad software? Idk. Probably not.

Should you treat it as a grassroots Foss thing maintained by fellow sane hackers? No sir.

stego-tech3d ago· 1 in thread

I still believe that the strength of AI is when it can be applied locally in a secure and private manner, rather than yet another cloud-based service you must pay for indefinitely even as it gets progressively worse to satiate the greed of corporate shareholders.

ChatGPT and Anthropic will never, ever get me to tie my Health Data to their systems, but I still believe in the capabilities of AI in identifying patterns from data I would otherwise overlook, and sorely want a local-only ecosystem where I can expose this data safely, privately, and securely to something like Qwen or Gemma for processing.

Same goes for Smart Homes, and Personal Assistants. The corporate approach of letting Company A access your data stored at Company B and processed by Companies D and E while also sold to Advertisers and Data Brokers with no way for you to extract or view it on your local hardware - just isn’t tenable for these sorts of intimate use cases. I want my data to be owned and controlled and exposed on my terms, to be used to improve my life first rather than someone else’s bottom line. I want technology to give me back more of my time and improve my outcomes again, and I’ve been burned enough by Big Tech in the past that I flatly reject any presumption of nobility or public good from their AI-as-a-Service business model.

The capability is there, and I definitely think the folks working to build local tooling that supports and unlocks the potential for local models are the ones in the right. I love seeing what they build.

hootz3d ago

The thing about "local" models for me is that they usually mean open-weight (and maybe open-source too), so they can be used locally, yes, but they can also be hosted by independent providers! With models like Qwen, DeepSeek and others, you aren't tied to a single corp, you can switch between indie providers, some of which may give you better privacy guarantees. That allows you to use the models even on devices uncapable of running them, if they have an Internet connection.

The strength with AI is with open-source models. We need to keep away from vendor lock-in and use models that allow both local usage and hosting by independent providers.

bee_rider3d ago· 1 in thread

Tangential question (since they brought it up in the article) from someone not involved in AI performance optimization:

How big of a deal is looping, practically? Or, I mean, I see thinking models loop occasionally. But it seems to me that every token in the loop should be in the KV cache already, is there really no way to either power through a loop because of the 100% cache hit rate, or identify that you are in a loop that way? (As a human, when thinking hard I sometimes loop, but it is easy enough to identify…)

alexellisuk3d ago

1. On the technical:

The cache only makes generation fast, it doesn't influence what gets chosen next. The loops that hurt the most (point 2 below) are when the model re-decides to do the same thing in different words, which is much harder to detect automatically. We're experimenting with repetition penalty and turning thinking off to solve for the 1st kind of looping (below)

2. On "why is looping a problem" for us

Practical example, which I covered in the post: "add --json to every command that does a get or list in faas-cli" - this was a small-ish, open source CLI written with Cobra a very common framework.

If I send that to Claude (any of their models) or Codex (GPT), I would have a fully working solution the next time I opened that terminal - a few seconds - a few minutes.

With the local model, when it loops, you get some progress and start working on something else. Come back, maybe even 30 minutes later and see it's been printing the same 5 lines over and over constantly.

Trust is important for a tool like this, that eroded it.

The other type of loop I mention in the blog post is "unable to solve it" loop - Han ran into that more.

"Oh I need to fix the indent from 8 to 5 characters in main.py" "Wait I don't know how to write Python code" "Oh now it's broken and I don't know what to do, maybe I should stop" "Let me edit ... " etc, etc

krzyk3d ago· 1 in thread

3090 and 2x3090 are quite popular. But if you uses gigantic (for local models) context of 200k it will go south pretty quickly - any quantization of context quickly becomes the issue.

alexellisuk3d ago

I think that's quite telling Gorgi replied that he uses Qwen with 131k context.

https://x.com/ggerganov/status/2067539416436867230?s=20

We also use it with 200-256k (native) context length.

The issue could be that folks that don't see looping aren't pushing the model as hard, or as enthusiastically.

We also had far fewer issues when thinking was turned off, than with a reasoning budget capped at 2048.

Some fine-tunes like Qwopus-Coder just seem prone to looping - google it, you'll see plenty of reports, even on Reddit.

For what it's worth seen the RTX 6000 Pro loop even at fp16 on the KV cache - and with vLLM.

mistercheese3d ago· 1 in thread

I’m not sure if I missed it, but I’m curious how you feel about cloud hosted models with ZDR policies? GLM5.2 or even Minimax M3 on Fireworks or Together ai should be still relatively/consistently cheap and private but a lot more capable and easier to setup?

alexellisuk3d ago

Thanks for the comment ZDR is mentioned in the post - in particular many the coding plans that are not from the two major leaders have questionable IP/ownership claims on inputs/outputs :)

And ZDR is still data sharing with a third party. This is the essence of an enterprise agreement, it's not allowed, even if they pinkie promise not to store it.

If your customers allow you to share their data with third parties, then ZDR may be an option for you. I am not a laywer.

Where I see ZDR as being more relevant is in protecting your employer's IP - not allowing a missed setting to mean AI labs can train, retain, and publish/resell your work. It's what we'll consider when the subsidies stop being available - open-router, ZDR - but for coding - not for customer data. Very important distinction.

zkmon3d ago· 1 in thread

The seems to talk a lot about 27B. In my experience, I saw 35B-A3B to be equally good in quality and the MoE gave more tg/s.

alexellisuk3d ago

The important thing about MoEs which I mention in the conclusion is that they carry fewer (way fewer) active tokens during inference/generation.

35B-A3B is what we started out with in the days of only having the 3090, but the quality is not as good, and the speed from the cards we have now can blaze at 130-200 tokens per second of generation with q5 and a full context in fp16.

Not to say that MoEs don't have their place. For people running on unified RAM, they're sometimes the only viable option due to the slowness of dense models.

Why is a dense model slower? All model weights have to be loaded and exercised. Passing through 27B vs 3B (active) is maths. So yes you will always get more tokens per second of generation.

You must (just as we did) evaluate on your own products and daily work. If the MoE gives the results you need with only 3B parameters then you have your answer.

Not prescriptive at all. This is experience based, from the trenches of a actual software business so hopefully a different perspective for folks than "Ran Qwen on my macbook, generated a great python script for me"

watt3d ago· 1 in thread

I find it strange that software people will accept this level of flakiness from the hardware. Normally you would just send the card back, and request a replacement.

> One of the cards would only show up if I crossed my fingers when turning it on. Even reboots wouldn't cure it - I had to A/C power off and remove the power cable each time for 30 seconds.

This is ridiculous. Of course we are living through supply crunch, but that card is clearly defective hardware.

alexellisuk3d ago

Ha, you underestimate how dogged you need to be to get this stuff working well.

The RTX 3090 in question was used from eBay, no way to return it. The RTX 6000 Pro is the "new card" in question here. The 3090s remain an interesting playground for testing things like VFIO passthrough for SlicerVM and other models whilst not interrupting people on the newer card.

In the end, the most stable fix I've found is to install the older proprietary driver and disable the GSP firmware. Have had no issues since.

So "clearly defective hardware" seems like it may not be quite correct. And the thing that kept me coming back - along with not having a suitable replacement - or having to gamble on eBay again was the reliability once it showed up in nvidia-smi.

wallkroft3d ago· 1 in thread

>Local Qwen isn't a worse Opus >looks inside >local Qwen is not "near Opus levels"

wren69913d ago

Ok I'll bite. What's the contradiction?

itsthecourier3d ago· 1 in thread

wanted sovereignty, bought a Blackwell for usd12k, discovered a billing issue in some customer and explains that will cover the card

I don't follow how it supports the decision of buying the card, I would even say using online SOTA models would had caught it earlier without usd12k and monthly electricity being spent

alexellisuk3d ago

Author here. Thanks for the question. I'll answer assuming this is a question you have for me.

As explained in the post - the 3090s were what were the test bed that proved the investment was worth it. Customer support, architecture reviews, telemetry to check license compliance. None of that could be done with online models. The amount of time we can spend going backwards and forth with enterprise customers over email can really amplify costs to our team. A few actual issues we found and fixed were listed on the linked blog post: https://www.openfaas.com/blog/painless-support-with-diag/

Having recovered revenue using it in an airgap, to preserve data agreements was more of a cherry on the cake. No need to worry about the investment, it's covered itself.

Hope that helps.

mystraline3d ago· 1 in thread

> We've all heard people say that local Qwen 27B or 35-A3B is "near-Opus level"

Uh, so, yeah. Im running local Qwen, but Qwen3.5-122B using Krasis https://github.com/brontoguana/krasis

Its by far better than Opus.

In fact with a phone migration, I was using an OLD android 2fa app "andOTP". Backup files it emitted were JSON but not any sort of standard.

I needed the standards version using otpauth:// to upload in my current 2fa. And gave it to my local qwen3.5-122b.

It responded with a scary "you uploaded credentials to a public instance LLM! And, it emitted standards compliant URLs. The new app "Tokn" ingested just fine. When side by side was tested, everything was 100% correct.

I coukd have did it myself, but it was a one-off. And asking local Qwen worked perfectly. Took like 6 minutes. Would have taken me 1h.

te00062d ago

Interesting setup. What GPU(s)/VRAM, CPU and RAM are you using for the 122B model, with which quantization, and what token rates do you achieve for prefill and generation?

nessex3d ago

This is a great post that covers a lot of the recent ground. I have a very similar setup after a very similar journey, minus the RTX6000. Worth noting though that a lot of the recent changes make a single 3090/4090 much more viable here too. MTP and the recent improvements to kv quantization in particular, as well as model-specific template & quant fixes. I run a 4090 with the 4-bit quantized variant of the same model now and have had a great experience. Qwen3.5 was already a big step up, but with 3.6 and the rest of the improvements it's substantially more reliable as a daily use tool and I find myself reaching for hosted models a lot less. Feels like I could work entirely without them if they were to disappear without going back to typing every line of code myself.

To make 4-bit fit on one card with reasonable (100k+) context needs a bit more care though. And tuning can be highly specific to your machine, gpu and use-case. But I use a headless server, offload multi-modal to CPU, use fit-target to reduce wasted memory and use q8_0 kv since the 4090 performs well with it... In addition to most of the same config as the author elsewhere. I get 50-60tps generation with a power limit of 275W (450W is default), more than enough to offer a roughly an Opus-speed feedback loop.

I haven't seen many of the issues with looping the author mentions. But I did with Qwen3.5 and in particular other 4-bit quants in the past. But the difference is probably a mix of the improvements above, as well as habits changing to avoid cases where models will loop. For what I'm doing, it seems like I loop Qwen3.6 on the same kind of prompts I'll make Haiku or Sonnet loop on (the latter hide some of their existential loops behind "thinking"). Usually it's cause I was too vague about some aspect of what I'm wanting them to do or I forgot to include some context that smaller models just don't have access to in their smaller knowledge base. But at least for what I'm doing (Rust, React, kubernetes) it's not been a notable problem at all with the latest iteration of this whole stack. And knowledge of standard libraries and default k8s resource kinds has been almost flawless.

There's still plenty of more complex stuff where I'll choose to jump straight to Claude or GLM-5.2, but if it's not worth that jump I've stopped paying for the middle ground as it's usually not much better than just one more iteration through qwen.

All this to say, if you have a 3090/4090, feel free to give the same setup a go. It's come a long way in recent weeks.

piterrro3d ago

This is amazing but for everyone out there wanting to buy and build your own AI rig I recommend connecting to one of mamy inference providers and trying out different models themselves for a while. Costs pennies but can give you a nice preview of what you can get with your own rig. Just a friendly tip.

teh3d ago

I sometimes wonder how much of intelligence is being good with tools.

I feel pretty averagely smart but give me some good tooling like a good editor, a good type system, semantic grep, good testing and some solvers and I can actually deliver some work.

Maybe the trick isn't 500 billion parameters but a model super integrated with the task at hand for iteration and debugging?

FWIW the article really mirrors my own experience. I can run a small gemma4 for quick edits (and it's fast!) or data cleanup but for other tasks you do need a different tool (claude).

dd8601fn2d ago

Qwen does have that really nasty tendency to fall into loops. Like, a lot.

It only really happens if you allow the thinking directive though. If you can switch it off with what you’re using it on, you’re mostly fine.

selfawareMammal3d ago

I am not a worse player than Messi, I'm just a different player.

bethekidyouwant3d ago

“This rock is not a worse hammer its a different tool”

wallkroft3d ago

>Local Qwen isn't a worse Opus >looks inside >local Qwen is not "near Opus levels

rsrsrs863d ago

Chasing models for me it’s a big yellow flag

Means underinvesting in engineering

Look into it

j / k navigate · click thread line to collapse

252 comments

142 comments · 27 top-level

glerk3d ago· 59 in thread

With Qwen, you have to give it a shape and let it fill it in. Qwen likes XML, JSON and lists. Qwen likes to be shown a bunch of examples of previous work.

This is not scientific at all, just vibes, YMMV.

dkersten3d ago

> This is not scientific at all, just vibes, YMMV.

This is the problem.

coldtea3d ago

Think of it less like a static tool, and more like a human helper, where the same holds.

7 more replies

m-dot-reviews3d ago

The corpus is pretty empty right now, so please spread the word if this seems like a useful idea!

dotancohen3d ago

Realising this made me respect the "I" in "AI" a bit more seriously.

yunohn3d ago

> a product sheet showing what each models strengths an weaknesses are

This presumes that the labs themselves know how well their models perform. But all they have are overtuned benchmarks and hype vibes.

egwor3d ago

Maybe this is similar to web search too. We know how to get google to return the results we want, and when we use other tools like Bing we get other behaviour.

epolanski3d ago

The problem is that this is very hard to replicate and benchmarks focus on E2E tests, going from one prompt to the final solution.

They do not test how models perform when used interactively, like most of us do.

amelius3d ago

Yes, but benchmarks can be gamed.

Maybe we need better reviewers then?

couscouspie3d ago

That would be ideal, but AI is less like a tool and more like a human in this regard and you don't have character sheets for each of your colleagues, as well.

2 more replies

weitendorf3d ago

movpasd3d ago

evntdrvn3d ago

mncharity3d ago

> rerunning [...] but differently framed or worded input, and seeing how much they diverged

I'm surprised how little attention this is getting in today's comments. Open-weights means being able to afford multiple runs, space sampling, critique and synthesis.

dotancohen3d ago

  > We’ve been calling some of these “magic words” at work, specific technical terms or references/techniques that you need only mention to get vast improvements in outcome.

Any chance you could share some of these? Seems like something we could all benefit from.

1 more reply

mnicky3d ago

If the benefits of using the model you've come to know well outweigh the disadvantages, you can continue using it even after the release of a successor model, right?

1 more reply

h05sz487b3d ago

> It is very much like playing an instrument.

Or it is more like playing a slot machine and you imagine the rest.

cube003d ago

This is how I feel whenever I see bold all caps instructions in a system prompt or someone claims they conducted "research" and found the magic prompt template that makes the model pay out.

Maybe it works some of the time but it isn't a solution that works everytime.

It reminds me of people hovering to play a slot machine when someone gets up and it hasn't paid out as if they've solved slot machines.

[1]: https://support.google.com/gemini/answer/16275805?hl=en&sjid...

2 more replies

hodgehog113d ago

A poor analogy depending on the setting because you can't adjust the odds with a slot machine, and the ROI is negative by design. If that's your experience, yeah, I wouldn't use an LLM either.

1 more reply

ramon1563d ago

Instruments are pseudo-random until you know what you're doing. Slot machines are just slot machines

2 more replies

glerk3d ago

It is a bit of both. A non-deterministic instrument and a predictable slot machine.

psychoslave3d ago

I play slot machines as instrument! ;)

1 more reply

stingraycharles3d ago

I agree with your general gist, and in general it’s a “the best tool for the particular job”, keeping token spent and other things in mind as well.

What I do know absolutely for sure is that LLM benchmarks are not to be trusted, they are just a minor indicator and real world usage is often very different.

sanderjd3d ago

4 more replies

willtemperley3d ago

Yes, how do we know Opus 4.8 hasn't been trained on the SWE-Bench examples?

With a squillion dollars at stake per bench point, someone will have figured out a plausibly deniable way to game these benchmarks.

1 more reply

dv35z3d ago

What would it take to have trustworthy benchmarks? As with all "targets", they can be gamed - but I am curious about quantifiable quality metrics.

3 more replies

Wowfunhappy3d ago

mcbits3d ago

1 more reply

milch2d ago

clhodapp3d ago

It's hard to decide when to use the best tool for a job you are aware of to ensure throughput and when to spend time experimenting with a new tool to learn what it's good at.

andai3d ago

I asked GLM 5.2 for a HTML5 port of my old C#/XNA game. It ported all the code exactly (except for operator overloading, which doesn't exist in JS), and added more code to make the code work.

I asked Claude Sonnet 4.6 for the same thing, and Claude's version was like if the game had been written in JS originally.

Also, for some reason it made it a single HTML file, removed all assets, dynamically generated graphics and dynamically generated music. It also gave me a new, better background.

This surprised me, since it was not what I asked for. I just asked it to port the game.

I was pretty pleased about the choices it made, but I'm not sure how to turn that behavior on and off. Sometimes you want it to be creative, sometimes you want it to actually do what you said.

vlovich1233d ago

CuriouslyC3d ago

vkazanov3d ago

The problem is not that there details, the problem is constantly shifting ground. We can only rlpy on a harness to be sort of predictable but the models change all the time.

rkuska3d ago

It system prompts that change all the time especially in claude code.

devin3d ago

It is not at all like playing an instrument.

Instruments present a clear interface to a user, have predictable outputs, etc.

djeastm3d ago

I think they mean playing different instruments not other instances of the same instrument. A tuba's interface differs from a violin's, etc.

1 more reply

visiondude3d ago

qsera3d ago

Mmm..interesting..So now people are finding behavior patterns in LLMs which are trained on behavior patterns of people...

nonethewiser3d ago

> you sometimes want to under-specify or phrase things more indirectly to give a color to the implementation or elicit something creative

Actually this seems like it would be an interesting test. Maybe I will come up with some contrived question and ask several models.

theshrike793d ago

Yyep.

IME Claude is the most "creative" of the bunch, you can get surprising ideas out of it that were kinda tickling the back of your head but didn't really connect.

But I won't give any creative open-ended tasks to any other model than Claude.

[0] https://en.wiktionary.org/wiki/niin_mets%C3%A4_vastaa_kuin_s...

weitendorf3d ago

1 more reply

zahlman3d ago

FWIW I find that GPT can be very creative when discussing a high-level design. Once it starts writing code snippets it will offer to take things in a bunch of different directions.

bandrami3d ago

I think this goes beyond "vibes" to cargo-culting. It's why nobody's ever able to actually show ROI from LLMs

CuriouslyC3d ago

1 more reply

hashmap3d ago

notduncansmith3d ago

1 more reply

glerk3d ago

at the risk of sharing my secret magic spells :)

> this is phenomenal work, genuinely! I feel like you read my mind! <next instruction here>

can go a long way.

of course, I would only say that when I mean it, because Claude can get superficial and cut corners which is why I prefer GPT for raw implementation.

1 more reply

furyofantares3d ago

photochemsyn3d ago

Classify under non-reproducible artifacts of LLM generation.

nosyke3d ago

baq3d ago

+1.

tingletech3d ago

john_strinlai3d ago

>I do think it pays to be nice to the model.

there was something on HN a few weeks ago about how most/all models perform better the more rude you are to them.

(i still say "please", i can't help it)

keeganpoppen3d ago

reverius423d ago

These are the vibes that power vibecoding.

vorticalbox3d ago

I find opus for planning and sonnet for coding but codex for code review.

zahlman3d ago

> being nice to Claude will be rewarded and being mean to Claude will be punished

... That does sound like something that Anthropic would deliberately aim for, yeah.

> With GPT, you have to be precise and reduce ambiguity.

I have found that it occasionally makes a wild misinterpretation, that makes a bit of sense in retrospect given how I worded something but is still surprising.

It also sometimes tries to loop in and tie together ideas from earlier in the conversation that really shouldn't still appear relevant. But that might be a general LLM thing.

LogicFailsMe3d ago

I find with Claude that when I call its BS I get better results. And it openly admits to lying to and gaslighting me as well as not seeing any way to stop itself from continuing to do so.

QwenGlazer90003d ago

As someone who actually uses musical instruments, it's not at all the same. If anything, traditional IDEs are closer to musical instruments, which seem to be going EOL if you listen to the hype bros.

zmmmmm3d ago· 9 in thread

That's a great write up.

sanderjd3d ago

marak8303d ago

GLM 5.2 came out today and the early reports have been quite good. Very difficult to run except on prosumer hardware, but small business could quite easily (or something like open router).

theshrike793d ago

The power of Opus isn't just the model, it's in the harness too.

You can try it by using Opus through Github Copilot vs official Anthropic tools. You'll get very different results and experience (in my opinion).

4 more replies

theplumber3d ago

I think in the next 6 months we will have Opus 4.5 performance in open models. We are very close

1 more reply

sleepyeldrazi3d ago

1 more reply

3abiton3d ago

And a big thing that's missing is ... the harness comparison. Ot plays a very big role. I use forge, and I have been inpressed with what it can do given all the limitations of local models.

fittingopposite3d ago

How would you benchmark them? Are there any benchmarks for harnesses?

rippeltippel3d ago

Since the author is referring to a specific model, I think it makes sense to ignore how the model (or local models in general) may improve over time.

It's like buying a car: I drive that car and get attuned to its characteristics; I don't think how that car (or similar cars) may improve. That's my tool and I want to make the most of it.

It is true that switching a local models it technically very cheap, but there's a considerable time investment in squeezing the most out of it, which may not work on a newer version of that model.

appplication3d ago

Agree 100%, even on claude 4.5 being the turning point for agentic coding. It completely turned me around on it.

gpt53d ago· 9 in thread

usernomdeguerre3d ago

I believe that local models are a necessary extension of the personal computer and I imagine that one could have had similar criticisms of early personal computers.

pmontra3d ago

theshrike793d ago

It would have 99% reliable tool calling - and most importantly - the ability to go "this task is beyond my skills" and refer to a Big Boy Online Model in a gigantic datacenter somewhere.

It feels super stupid that my /commit skill calls an online model when that is something a local model can 100% do. Mostly this is a harness issue though and mostly solvable.

redrove3d ago

>It would have 99% reliable tool calling

I managed to score 93/100 in tool-eval-bench [1]. For me this is very good already, at least in the pi coding harness I've never had an issue that wasn't auto-fixed in the next turn(s).

>the ability to go "this task is beyond my skills" and refer to a Big Boy Online Model in a gigantic datacenter somewhere

This is heavy on the harness engineering side I think, but also quite contrary to the nature of LLMs today. If you figure this out I'd love to know.

[0] https://huggingface.co/Minachist/Qwen3.6-27B-INT8-AutoRound/...

[1] https://github.com/SeraphimSerapis/tool-eval-bench

2 more replies

walthamstow3d ago

1 more reply

i_idiot3d ago

> Unlike the way they are hyped sometimes, as fantastic tools for coding and agentic local work.

throw3108223d ago

> I think most people do not need SOTA

1 more reply

regularfry3d ago

sanderjd3d ago

But that's current hardware. What about future hardware? What about hardware optimized for inference? What about hardware optimized to run a particular model?

cptskippy3d ago· 7 in thread

I've been running qwen3-5-9b-q4-k-m and qwen3-6-27b-q6-k simultaneously on an Intel Arc Pro B70 with a lot of success.

https://github.com/cptskippy/battlemage-llm-gateway

hbbio3d ago

Interesting setup, thx for sharing.

How many tokens/sec do you get with 27b? Are you using MTP?

cptskippy3d ago

I haven't done any in-depth synthetic benchmarks but I had my Hermes agent run some and I ran a couple directly on the LLM Gateway that showed similar results.

It's not good for conversational use cases as it can take 1-2 minutes to respond to a prompt.

jauntywundrkind3d ago

What's the value running the smaller model too? Why not just the big model for everything? I note both are dense, as well.

Ritewut3d ago

2 more replies

askvictor3d ago

Does Intel make decent GPUs now? I must be out of the loop...

cptskippy3d ago

I'm using an Intel Arc Pro B70 which has 32 GB of VRAM. It's estimated to get ~35-45 t/s at $21-27 $/t/s. An RTX 5090 is ~61 t/s at ~$33 $/t/s.

So in terms of raw power Nvidia is effortlessly still king, but in price-to-capacity Intel is best in class.

speedgoose3d ago

They released a few good value GPUs for LLM inference about a year ago: more memory than AMD and NVIDIA consumer GPUs, not too expensive, but also not great tokens/watt.

I am not sure whether you can find those in stock anywhere.

skipants3d ago· 5 in thread

I feel like it's the Emperor's new clothes reading this article and seeing the praise it's getting. This sentence doesn't even make sense:

> These products use very low level Linux primitives like containers, Kubernetes, Firecracker microVMs, and networked protocols.

Out of anything that is a "low level linux primitive" I could maybe argue that networking? protocols fit the bill.

And it's obviously fully AI-generated! Which I wouldn't even care about if I could actually trust the content, which I can't!

chadgpt33d ago

Low level today means JavaScript instead of typescript

mekdoonggi3d ago

Low-level today means opening IDE instead of the Chat client.

1 more reply

alexellisuk3d ago

Fair enough, that sentence was fairly compressed. I’ve reworded it - the meaning remains the same.

The post is not AI generated, I use AI for code generation and write my own articles.

Which part of the post are you struggling with? This is a post describing our own experience and journey. Happy to back up any specific claim.

alentred3d ago

> Fair enough ... compressed ... ACTION->RESULT ... NEGATION->STATEMENT ... follow up questions.

What model are you again?

1 more reply

CamperBob23d ago

But part of me still finds it offputting for some reason. It's interesting to think about whether that's more of a "you" problem, or more of a "me" problem.

3 more replies

barrkel3d ago· 5 in thread

I found it interesting that vLLM was dismissed as slower than llama.cpp.

chartered_stack3d ago

One could say: vLLM isn't a worse Llama.cpp, it's a different tool

alexellisuk3d ago

vLLM is great at continuous batching and model serving in production, but it's a very different beast and much less versatile for the prosumer category (where we sit for our usage)

Dismissed is a strong term, but let me give you some more details.

It took a good 4 minutes plus to load up on the 2x 3090 rig, and served a single request 3 tokens/second slower.

And the worst bit? With all that work - setting it up and tuning it - it still looped. I was hoping "use just vLLM" advice that we get touted everywhere was the silver bullet.

The only thing I'd caution here is that we don't start bashing on llama.cpp like people did with Ollama. It's a very capable tool and for the use-cases we actually want the card for makes more sense.

For a large team replacing their Claude Subs perhaps vLLM is the only option, but you really need to add about 5 more RTX 6000 cards into the mix, so you can load something like GLM 5.2.

lelandbatey3d ago

That's not _nothing_, but it's pretty close to nothing, and for the prosumer crowd it edges towards "just gets in the way".

krzyk3d ago

AFAIR the general consensus is (was?): - llama.cpp for single user - vLLM for multi-user (e.g. enterprises)

They are similar, but for different use cases.

navbaker3d ago

Yeah, I was a bit baffled by the author complaining about cache prefixes getting destroyed when more than one user hit the model, but then continuing to use llama.cpp instead of switching to vLLM.

eurekin3d ago· 4 in thread

> The model is running so hot, that it shoots past the goal and starts looping

later:

trey-jones3d ago

My weekend project is going to be building a home inference server (from ancient datacenter parts) and I'm still massaging in my head what the end result will be.

eurekin3d ago

If I started today, with building a server, I'd jump right into verified set-ups and writeups, like this one:

https://github.com/noonghunna/club-3090

1 more reply

Iolaum3d ago

Why unquantized instead of Q8 ?

eurekin3d ago

whazor3d ago· 4 in thread

Would be interesting to use local models for:

- tool calling

- code base exploration

- anonymizing / abstracting your request

Such that your local AI communicates to frontier model like an expensive consultant giving high level advice.

I think due to the lower latency of a local model that this could be faster.

alexellisuk3d ago

One of the things I mentioned in the post:

> Local models can quickly read and explain codebases, even if they can't write them - this is a superpower

Might have been buried lower down.

And yes latency of local on a fast card with MTP enabled can be blistering 130-200 tokens per second sustained at full context on Q5. About 100+ on Q8.

On tool calling

> Agent Skills can help immensely - we had a local agent set up Slicer completely from scratch on a new mini PC. It even gave feedback on the usability of slicer CLI which we integrated

There's a link to a post showing some examples.

Occasionally, we'll also have the local model _review_ the changes of GPT/Opus - and it can return duds, but also insights the larger model overlooked, or was too intelligent to pick out.

So yes - absolutely blazing fast at understanding a codebase, very good at running skills "cheaply" and could be used with larger models as a "helper" / sub-agent.

asimovDev3d ago

trey-jones3d ago

I know it uses electricity, but part of the benefit of a local model has to be that you can let it do this while you sleep, and not pay Anthropic for an unknown number of tokens.

1 more reply

dofm3d ago

I doubt your experience of local models would be of lower latency, except for quite small models in edge uses.

In every way, the cloud products from the big two seem optimised for speed and speed of initial response even.

I don’t think most people are running local models for speed. More for control, privacy, interest, bloody-mindedness and general principle.

ttsiodras3d ago· 2 in thread

Interesting article.

IMHO, the author could have done two things better:

If you're wondering about how I set it up in the 2 GX10s: https://forums.developer.nvidia.com/t/deepseek-v4-flash-offi...

alexellisuk3d ago

It's the right call for concurrent batched serving (barrkel's point downthread is spot on), but for how we use it llama.cpp is still better for us.

..and the card was never about replacing a Claude Max sub. For the workloads we actually bought it for, it's giving us 140-200 tok/s (which matters).

ttsiodras3d ago

I hear you on the insane amount of time vllm takes to launch (atlas is a move in the right direction in that regard).

Regardless - thanks for putting the effort to share your findings! I keep postponing doing the same... there's tons of things everyone is re-discovering on their own.

1 more reply

hypfer3d ago· 2 in thread

That was a lot of text for me still having no idea what the point of the author was (beside what I can infer from the headline that is).

I do however now know that they're a totally cool dude building stuff physically and as software + that other people give them money for it.

Does that have anything to do with the topic suggested by the headline? Not sure.

neonstatic3d ago

Everything is an ad these days. The article was not useless, but for the information it provides, it could have been two paragraphs.

hypfer3d ago

FWIW it told me stuff about openfaas. Now I know how to mentally file it and how to mentally file the author. The GitHub profile alone might not have sent the same signal, so this is useful.

Is it bad software? Idk. Probably not.

Should you treat it as a grassroots Foss thing maintained by fellow sane hackers? No sir.

stego-tech3d ago· 1 in thread

hootz3d ago

The strength with AI is with open-source models. We need to keep away from vendor lock-in and use models that allow both local usage and hosting by independent providers.

bee_rider3d ago· 1 in thread

Tangential question (since they brought it up in the article) from someone not involved in AI performance optimization:

alexellisuk3d ago

1. On the technical:

2. On "why is looping a problem" for us

Practical example, which I covered in the post: "add --json to every command that does a get or list in faas-cli" - this was a small-ish, open source CLI written with Cobra a very common framework.

If I send that to Claude (any of their models) or Codex (GPT), I would have a fully working solution the next time I opened that terminal - a few seconds - a few minutes.

Trust is important for a tool like this, that eroded it.

The other type of loop I mention in the blog post is "unable to solve it" loop - Han ran into that more.

krzyk3d ago· 1 in thread

3090 and 2x3090 are quite popular. But if you uses gigantic (for local models) context of 200k it will go south pretty quickly - any quantization of context quickly becomes the issue.

alexellisuk3d ago

I think that's quite telling Gorgi replied that he uses Qwen with 131k context.

https://x.com/ggerganov/status/2067539416436867230?s=20

We also use it with 200-256k (native) context length.

The issue could be that folks that don't see looping aren't pushing the model as hard, or as enthusiastically.

We also had far fewer issues when thinking was turned off, than with a reasoning budget capped at 2048.

Some fine-tunes like Qwopus-Coder just seem prone to looping - google it, you'll see plenty of reports, even on Reddit.

For what it's worth seen the RTX 6000 Pro loop even at fp16 on the KV cache - and with vLLM.

mistercheese3d ago· 1 in thread

alexellisuk3d ago

Thanks for the comment ZDR is mentioned in the post - in particular many the coding plans that are not from the two major leaders have questionable IP/ownership claims on inputs/outputs :)

And ZDR is still data sharing with a third party. This is the essence of an enterprise agreement, it's not allowed, even if they pinkie promise not to store it.

If your customers allow you to share their data with third parties, then ZDR may be an option for you. I am not a laywer.

zkmon3d ago· 1 in thread

The seems to talk a lot about 27B. In my experience, I saw 35B-A3B to be equally good in quality and the MoE gave more tg/s.

alexellisuk3d ago

The important thing about MoEs which I mention in the conclusion is that they carry fewer (way fewer) active tokens during inference/generation.

Not to say that MoEs don't have their place. For people running on unified RAM, they're sometimes the only viable option due to the slowness of dense models.

Why is a dense model slower? All model weights have to be loaded and exercised. Passing through 27B vs 3B (active) is maths. So yes you will always get more tokens per second of generation.

You must (just as we did) evaluate on your own products and daily work. If the MoE gives the results you need with only 3B parameters then you have your answer.

watt3d ago· 1 in thread

I find it strange that software people will accept this level of flakiness from the hardware. Normally you would just send the card back, and request a replacement.

> One of the cards would only show up if I crossed my fingers when turning it on. Even reboots wouldn't cure it - I had to A/C power off and remove the power cable each time for 30 seconds.

This is ridiculous. Of course we are living through supply crunch, but that card is clearly defective hardware.

alexellisuk3d ago

Ha, you underestimate how dogged you need to be to get this stuff working well.

In the end, the most stable fix I've found is to install the older proprietary driver and disable the GSP firmware. Have had no issues since.

wallkroft3d ago· 1 in thread

>Local Qwen isn't a worse Opus >looks inside >local Qwen is not "near Opus levels"

wren69913d ago

Ok I'll bite. What's the contradiction?

itsthecourier3d ago· 1 in thread

wanted sovereignty, bought a Blackwell for usd12k, discovered a billing issue in some customer and explains that will cover the card

I don't follow how it supports the decision of buying the card, I would even say using online SOTA models would had caught it earlier without usd12k and monthly electricity being spent

alexellisuk3d ago

Author here. Thanks for the question. I'll answer assuming this is a question you have for me.

Having recovered revenue using it in an airgap, to preserve data agreements was more of a cherry on the cake. No need to worry about the investment, it's covered itself.

Hope that helps.

mystraline3d ago· 1 in thread

> We've all heard people say that local Qwen 27B or 35-A3B is "near-Opus level"

Uh, so, yeah. Im running local Qwen, but Qwen3.5-122B using Krasis https://github.com/brontoguana/krasis

Its by far better than Opus.

In fact with a phone migration, I was using an OLD android 2fa app "andOTP". Backup files it emitted were JSON but not any sort of standard.

I needed the standards version using otpauth:// to upload in my current 2fa. And gave it to my local qwen3.5-122b.

I coukd have did it myself, but it was a one-off. And asking local Qwen worked perfectly. Took like 6 minutes. Would have taken me 1h.

te00062d ago

Interesting setup. What GPU(s)/VRAM, CPU and RAM are you using for the 122B model, with which quantization, and what token rates do you achieve for prefill and generation?

nessex3d ago

All this to say, if you have a 3090/4090, feel free to give the same setup a go. It's come a long way in recent weeks.

piterrro3d ago

teh3d ago

I sometimes wonder how much of intelligence is being good with tools.

I feel pretty averagely smart but give me some good tooling like a good editor, a good type system, semantic grep, good testing and some solvers and I can actually deliver some work.

Maybe the trick isn't 500 billion parameters but a model super integrated with the task at hand for iteration and debugging?

FWIW the article really mirrors my own experience. I can run a small gemma4 for quick edits (and it's fast!) or data cleanup but for other tasks you do need a different tool (claude).

dd8601fn2d ago

Qwen does have that really nasty tendency to fall into loops. Like, a lot.

It only really happens if you allow the thinking directive though. If you can switch it off with what you’re using it on, you’re mostly fine.

selfawareMammal3d ago

I am not a worse player than Messi, I'm just a different player.

bethekidyouwant3d ago

“This rock is not a worse hammer its a different tool”

wallkroft3d ago

>Local Qwen isn't a worse Opus >looks inside >local Qwen is not "near Opus levels

rsrsrs863d ago

Chasing models for me it’s a big yellow flag

Means underinvesting in engineering

Look into it

j / k navigate · click thread line to collapse