Qwen3.6-35B-A3B: Agentic coding power, now open to all (opens in new tab)

(qwen.ai)

1274 pointscmitsakis24d ago532 comments

532 comments

I've been running this on my laptop with the Unsloth 20.9GB GGUF in LM Studio: https://huggingface.co/unsloth/Qwen3.6-35B-A3B-GGUF/blob/mai...

It drew a better pelican riding a bicycle than Opus 4.7 did! https://simonwillison.net/2026/Apr/16/qwen-beats-opus/

GistNoesis24d ago

Thanks for pointing to the GGUF.

I just tried this GGUF with llama.cpp in its UD Q4_K_XL version on my custom agentic oritened task consisiting of wiki exploration and automatic database building ( https://github.com/GistNoesis/Shoggoth.db/ )

I noted a nice improvement over QWen3.5 in its ability to discover new creatures in the open ended searching task, but I've not quantified it yet with numbers. It also seems faster, at around 140 token/s compared to 100 token/s , but that's maybe due to some different configuration options.

Some little difference with QWen3.5 : to avoid crashes due to lack of memory in multimodal I had to pass --no-mmproj-offload to disable the gpu offload to convert the images to tokens otherwise it would crash for high resolutions images. I also used quantized kv store by passing -ctk q8_0 -ctv q8_0 and with a ctx-size 150000 it only need 23099 MiB of device memory which means no partial RAM offloading when I use a RTX 4090.

kelnos24d ago

I'm not sure how you can give the flamingo win to Qwen:

* It's sitting on the tire, not the seat.

* Is that weird white and black thing supposed to be a beak? If so, it's sticking out of the side of its face rather than the center.

* The wheel spokes are bizarre.

* One of the flamingo's legs doesn't extend to the pedal.

* If you look closely at the sunglasses, they're semi-transparent, and the flamingo only has one eye! Or the other eye is just on a different part of its face, which means the sunglasses aren't positioned correctly. Or the other eye isn't.

* (subjective) The sunglasses and bowtie are cute, but you didn't ask for them, so I'd actually dock points for that.

* (subjective) I guess flamingos have multiple tail feathers, but it looks kinda odd as drawn.

In contrast, Opus's flamingo isn't as detailed or fancy, but more or less all of it looks correct.

withinboredom24d ago

He literally said it came down to the comment in the SVG. Points for taste, not correctness. Basically.

realityfactchex24d ago

Here's a reproduction attempt (LM Studio, same Qwen3.6-35B-A3B-GGUF model as linked in parent, M1 Max 64GB, <90 seconds):

https://files.catbox.moe/r3oru2.png

- My Qwen 3.6 result had sun and cloud in sky, similar to the second Opus 4.7 result in Simon's post.

- My Qwen 3.6 result had no grass (except as a green line), but all three results in Simon's post had grass (thick).

- My Qwen 3.6 result had visible "tailing air motion" like Simon's Qwen 3.6 result.

- My Qwen 3.6 result had a "sun with halo" effect that none of Simon's results had.

But, I know, it's more about the pelican and the bicycle.

_ache_24d ago

The bicycle frame is ok. Simon's was better but at least it's not broken like Opus 4.7.

I can't comment that flamingo.

jubilanti24d ago

I wonder when pelican riding a bicycle will be useless as an evaluation task. The point was that it was something weird nobody had ever really thought about before, not in the benchmarks or even something a team would run internally. But now I'd bet internally this is one of the new Shirley Cards.

abustamam24d ago

Simon has an article on this

https://simonwillison.net/2025/Nov/13/training-for-pelicans-...

SwellJoe24d ago

Pelicanmaxxing

amelius24d ago

Yeah try it with something else, or e.g. add a tiger to the back seat.

survirtual24d ago

I use this metric now, and I suggest you change it per your imagination:

"Make a single-page HTML file using threejs from a CDN. Render a scene of a flying dinosaur orbiting a planet. There are clouds with thunder and lightning, and the background is a beautiful starscape with twinkling stars and a colorful nebula"

This allows me to evaluate several factors across models. It is novel and creative. I generally run it multiple times, though now that I have shared it here, I will come up with new scenes personally to evaluate.

I also consider how well it one shots, errors generated, response to errors being corrected, and velocity of iteration to improvement.

Generally speaking, Claude Sonnet has done the best, Qwen3.5 122B does second, and I have nice results from Qwen3.5 35B.

ChatGPT does not do well. It can complete the task without errors but the creativity is atrocious.

MagicMoonlight24d ago

They’ll hardcode it in 4.8, just like they do when they need to “fix” other issues

rafaelmn24d ago

I mean look at the result where he asked about a unicycle - the model couldn't even keep the spokes inside the wheels - would be rudimentary if it "learned" what it means to draw a bicycle wheel and could transfer that to unicycle.

1 more reply

hansmayer23d ago

Valid points, but you"d think "superintelligence" would "know" how to draw a pelican on a bike?

bertili24d ago

It's fascinating that a $999 Mac Mini (M4 32GB) with almost similar wattage as a human brain gets us this far.

johanvts24d ago

Interesting thought, I looked it up out of curiosity and fund 155w max (but realistically more like 80w sustained) for the mac under load, and just around 20watts for the brain, surprisingly almost constant whether “under load” or not.

2 more replies

culi24d ago

the more I look at these images the more convinced I become that world models are the major missing piece and that these really are ultimately just stochastic sentence machines. Maybe Chomsky was right

bmitc24d ago

> that these really are ultimately just stochastic sentence machines

I thought that's exactly what they are?

2 more replies

06088023d ago

The Chomsky argument feels like it's moving in a different direction than what's actually useful to know. Whether or not these models have "real" understanding, they're clearly capable of solving problems that were previously considered to require understanding. The more interesting question is whether world models, if they existed, would actually improve the failure modes people care about — like hallucination and planning — or whether we'd just get better stochastic sentence machines with an extra layer of abstraction on top.

cyclopeanutopia24d ago

But that you also gave a win to Qwen on flamingo is pretty outrageous! :)

Tthe right one looks much better, plus adding sunglasses without prompting is not that great. Hopefully it won't add some backdoor to the generated code without asking. ;)

simonw24d ago

I love how the Chinese models often have an unprompted predilection to add flair.

GLM-5.1 added a sparkling earring to a north Virginia opossum the other day and I was delighted: https://simonwillison.net/2026/Apr/7/glm-51/

1 more reply

prirun24d ago

The flamingo on Qwen's unicycle is sitting on the tire, not the seat. That wins because of sunglasses?

evilduck24d ago

Can a benchmark meant as a joke not use a fun interpretation of results? The Qwen result has far better style points. Fun sunglasses, a shadow, a better ground, a better sky, clouds, flowers, etc.

If we want to get nitty gritty about the details of a joke, a flamingo probably couldn't physically sit on a unicycle's seat and also reach the pedals anyways.

akavel24d ago

Well, maybe the flamingo is a really good unicyclist...

https://youtu.be/Rrpgd5oIKwI

yabutlivnWoods24d ago

Transparency of the wheel

Stylized gradients on the flamingo

Flowers

Ground/grass has a stylized look and feel

...despite a miss along the Y-axis where it's below the seat, couple oddly organized tail feathers, spokes, the composition overall is much closer to a production quality entity

Opus 4.7 looks like 20 seconds in MS paint.

Qwen3.6 looks incomplete due to the sitting position, but like a WIP I could see on a designer coworkers screen if I walk up and interrupt them. Click and drag it up, adjust tail feathers and spokes, you're there or much closer, to a usable output

rdslw24d ago

interesting, I just tried this very model, unsloth, Q8, so in theory more capable than Simon's Q4, and get those three "pelicans". definitely NOT opus quality. lmstudio, via Simon's llm, but not apple/mlx. Of course the same short prompt.

Simon, any ideas?

https://ibb.co/gFvwzf7M

https://ibb.co/dYHRC3y

https://ibb.co/FLc6kggm (tried here temperature 0.7 instead of pure defaults)

strobe24d ago

try Unsloth recommended settings

    Thinking mode for general tasks: temperature=1.0, top_p=0.95, top_k=20, min_p=0.0, presence_penalty=1.5, repetition_penalty=1.0

    Thinking mode for precise coding tasks (e.g. WebDev): temperature=0.6, top_p=0.95, top_k=20, min_p=0.0, presence_penalty=0.0, repetition_penalty=1.0

    Instruct (or non-thinking) mode for general tasks: temperature=0.7, top_p=0.8, top_k=20, min_p=0.0, presence_penalty=1.5, repetition_penalty=1.0

    Instruct (or non-thinking) mode for reasoning tasks: temperature=1.0, top_p=0.95, top_k=20, min_p=0.0, presence_penalty=1.5, repetition_penalty=1.0

(Please note that the support for sampling parameters varies according to inference frameworks.)

monksy24d ago

Hey I really enjoy your blog. On some things I end up finding a blog post of yours thats a year+ old and at other times, you and I are investigating similar things. I just pulled Qwen3.6 - 35b -A3B (Can't believe thats a A3B coming from 35b).

I'm impressed about the reach of your blog, and I'm hoping to get into blogging similar things. I currently have a lot on my backlog to blog about.

In short, keep up the good work with an interesting blog!

jamwise24d ago

I've had some really gnarly SVGs from Claude. Here's what I got after many iterations trying to draw a hand: https://imgur.com/a/X4Jqius

giantg224d ago

Probably because all the training material of humans drawing hands are garbage haha.

jaspanglia24d ago

The real question is what the next truly weird, un-optimized prompt will be. Something involving a sloth debugging a quantum computer in MS Paint?"

quietsegfault24d ago

The qwen flamingo looks like it’s smoking’ a doobie.

MeteorMarc24d ago

Interesting, qwen has the pelican driving on the left lane. Coincidence or has it something to do with the workers providing the RL data?

rubiquity24d ago

Could be on a bike path where bikes are on the left and pedestrians to the right.

Scrounger24d ago

I've been running qwen3.6:35b-a3b-q4_K_M (22.3GB) via Ollama.

Is the 20.9GB GGUF version better or negligible in comparison?

bwv84824d ago

I've been trying the Q4_K_M version, and sometimes it gets stuck in a loop. Gemma 4 doesn’t have this issue.

yencabulator24d ago

This has happened before with quantizations and other backends (ones not used by the research lab). Give it a week, download latest versions of everything, and try again.

mobiuscog23d ago

I'm having the same issues, the more I use it. The repetition penalty doesn't seem to help.

I get some really amusing 'reflective' responses, but I think it needs a bit more cooking. Maybe I'll try another variant.

Readerium24d ago

perhaps increasing repitition_penalty might be helpful

danielhanchen24d ago

Oh that is pretty good! And the SVG one!

logicallee23d ago

what kind of specs does your laptop have? do you know how many tokens/second you get on it?

slekker24d ago

How does it do with the "car wash" benchmark? :D

bertili24d ago

A relief to see the Qwen team still publishing open weights, after the kneecapping [1] and departures of Junyang Lin and others [2]!

[1] https://news.ycombinator.com/item?id=47246746 [2] https://news.ycombinator.com/item?id=47249343

zozbot23424d ago

This is just one model in the Qwen 3.6 series. They will most likely release the other small sizes (not much sense in keeping them proprietary) and perhaps their 122A10B size also, but the flagship 397A17B size seems to have been excluded.

jonaustin24d ago

And shout-out to Qwen if they release 122b -- Jeff Barr's original Gemma 4 tweet said they'd release a ~122b, then it got redacted :(

canpan24d ago

122b would be awesome. It is the largest size you can kinda run with a beefy consumer PC. I wondered about gemma stopping in the 30b category, it is already very strong. 122b might have been too close to being really useful.

giancarlostoro23d ago

> not much sense in keeping them proprietary

Maybe for LLMs since everyone has their own competing LLM, but with Video models, Wan 2.2 did a rug pull, left a huge gap for the community that built around Wan 2.2 too, and I don't think a single open video model has come close since. Wan is at 2.7 now, and its been nearly a year since the last update.

bertili24d ago

Is there any source for these claims?

2 more replies

stingraycharles24d ago

397A17B = 397B total weights, 17B per expert?

3 more replies

kylehotchkiss24d ago

How many people/hackernews can run a 397b param model at home? Probably like 20-30.

12 more replies

guitcastro24d ago

I really wish they released qwen-image 2.0 as open weights.

homebrewer24d ago

Already quantized/converted into a sane format by Unsloth:

https://huggingface.co/unsloth/Qwen3.6-35B-A3B-GGUF

Aurornis24d ago

Unsloth is great for uploading quants quickly to experiment with, but everyone should know that they almost always revise their quants after testing.

If you download the release day quants with a tool that doesn’t automatically check HF for new versions you should check back again in a week to look for updated versions.

Some times the launch day quantizations have major problems which leads to early adopters dismissing useful models. You have to wait for everyone to test and fix bugs before giving a model a real evaluation.

danielhanchen24d ago

We re-uploaded Gemma4 4 times - 3 times were due to 20 llama.cpp bug fixes, which we helped solve some as well. The 4th is an official Gemma chat template improvement from Google themselves, so these are out of our hands. All providers had to re-fix their uploads, so not just us.

For MiniMax 2.7 - there were NaNs, but it wasn't just ours - all quant providers had it - we identified 38% of bartowski's had NaNs. Ours was 22%. We identified a fix, and have already fixed ours see https://www.reddit.com/r/LocalLLaMA/comments/1slk4di/minimax.... Bartowski has not, but is working on it. We share our investigations always.

For Qwen3.5 - we shared our 7TB research artifacts showing which layers not to quantize - all provider's quants were not optimal, not broken - ssm_out and ssm_* tensors were the issue - we're now the best in terms of KLD and disk space - see https://www.reddit.com/r/LocalLLaMA/comments/1rgel19/new_qwe...

On other fixes, we also fixed bugs in many OSS models like Gemma 1, Gemma 3, Llama chat template fixes, Mistral, and many more.

It might seem these issues are due to us, but it's because we publicize them and tell people to update. 95% of them are not related to us, but as good open source stewards, we should update everyone.

5 more replies

embedding-shape24d ago

Not to mention that almost every model release has some (at least) minor issue in the prompt template and/or the runtime itself, so even if they (not talking unsloth specifically, in general) claim "Day 0 support", do pay extra attention to actual quality as it takes a week or two before issues been hammered out.

1 more reply

fuddle24d ago

I don't understand why the open source model providers don't also publish the quantized version?

1 more reply

i5heu24d ago

Thank you very much for this comment! I was not aware of that.

torginus24d ago

Why doesn't Qwen itself release the quantized model? My impression is that quantization is a highly nontrivial process that can degrade the model in non-obvious ways, thus its best handled by people who actually built the model, otherwise the results might be disappointing.

Users of the quantized model might be even made to think that the model sucks because the quantized version does.

bityard24d ago

Model developers release open-weight models for all sorts of reasons, but the most common reason is to share their work with the greater AI research community. Sure, they might allow or even encourage personal and commercial use of the model, but they don't necessarily want to be responsible for end-user support.

An imperfect analogy might be the Linux kernel. Linus publishes official releases as a tagged source tree but most people who use Linux run a kernel that has been tweaked, built, and packaged by someone else.

That said, models often DO come from the factory in multiple quants. Here's the FP8 quant for Qwen3.6 for example: https://huggingface.co/Qwen/Qwen3.6-35B-A3B-FP8

Unsloth and other organizations produce a wider variety of quants than upstream to fit a wider variety of hardware, and so end users can make their own size/quality trade-offs as needed.

halJordan24d ago

Quantization is an extraordinarily trivial process. Especially if you're doing it with llama.cpp (which unsloth obviously does).

Qwen did release an fp8 version, which is a quantized version.

sander109524d ago

I sense that I don't really understand enough of your comment to know why this is important. I hope you can explain some things to me:

- Why is Qwen's default "quantization" setup "bad" - Who is Unsloth? - Why is his format better? What gains does a better format give? What are the downsides of a bad format? - What is quantization? Granted, I can look up this myself, but I thought I'd ask for the full picture for other readers.

danielhanchen24d ago

Oh hey - we're actually the 4th largest distributor of OSS AI models in GB downloads - see https://huggingface.co/unsloth

https://unsloth.ai/docs/basics/unsloth-dynamic-2.0-ggufs is what might be helpful. You might have heard 1bit dynamic DeepSeek quants (we did that) - not all layers can be 1bit - important ones are in 8bit or 16bit, and we show it still works well.

dist-epoch24d ago

The default Qwen "quantization" is not "bad", it's "large".

Unsloth releases lower-quality versions of the model (Qwen in this case). Think about taking a 95% quality JPEG and converting it to a 40% quality JPEG.

Models are quantized to lower quality/size so they can run on cheaper/consumer GPUs.

1 more reply

est24d ago

hey you can do a bit research yourself and tell your results to us!

palmotea24d ago

How much VRAM does it need? I haven't run a local model yet, but I did recently pick up a 16GB GPU, before they were discontinued.

WithinReason24d ago

It's on the page:

  Precision  Quantization Tag File Size
  1-bit      UD-IQ1_M         10 GB
  2-bit      UD-IQ2_XXS       10.8 GB
             UD-Q2_K_XL       12.3 GB
  3-bit      UD-IQ3_XXS       13.2 GB
             UD-Q3_K_XL       16.8 GB
  4-bit      UD-IQ4_XS        17.7 GB
             UD-Q4_K_XL       22.4 GB
  5-bit      UD-Q5_K_XL       26.6 GB
  16-bit     BF16             69.4 GB

4 more replies

tommy_axle24d ago

Pick a decent quant (4-6KM) then use llama-fit-params and try it yourself to see if it's giving you what you need.

1 more reply

zozbot23424d ago

Should run just fine with CPU-MoE and mmap, but inference might be a bit slow if you have little RAM.

Ladioss24d ago

You can run 25-30b model easily if you use Q3 or Q4 quants and llama-server with a pretty long list of options.

trvz24d ago

If you have to ask then your GPU is too small.

With 16 GB you'll be only able to run a very compressed variant with noticable quality loss.

4 more replies

halJordan24d ago

There's absolutely nothing wrong it insane with a safetensors file. It might be less convenient than a single file gguf. But that's just laziness not insanity

Zetaphor23d ago

Quantization is the major appeal, we can't all run full precision

lta23d ago

As long as they're not releasing 32bit .pt :)

txtsd24d ago

So I can use this in claude code with `ollama run claude`?

nunodonato24d ago

https://sleepingrobots.com/dreams/stop-using-ollama/

1 more reply

Ladioss24d ago

More like `ollama launch claude --model qwen3.6:latest`

Also you need to check your context size, Ollama default to 4K if <24 Gb of VRAM and you need 64K minimum if you want claude to be able to at least lift a finger.

2 more replies

pj_mukh24d ago

have you found a model that does this with usable speeds on an M2/M3?

1 more reply

littlestymaar23d ago

> converted into a sane format

Having implemented a GGUF parser, I'd beg to differ on the “sane format” qualifier.

terataiijo24d ago

lmao they are so fast yooo

ttul24d ago

Yes. How do they do it? Literally they must have PagerDuty set up to alert the team the second one of the labs releases anything.

2 more replies

bildung24d ago

Bad QA :/ They had a bunch of broken quantizations in the last releases

1 more reply

ekianjo24d ago

yeah and often their quants are broken. They had to update their Gemma4 quants like 4 times in the past 2 weeks.

1 more reply

mtct8824d ago

Nice release from the Qwen team.

Small openweight coding models are, imho, the way to go for custom agents tailored to the specific needs of dev shops that are restricted from accessing public models.

I'm thinking about banking and healthcare sector development agencies, for example.

It's a shame this remains a market largely overlooked by Western players, Mistral being the only one moving in that direction.

lelanthran24d ago

> It's a shame this remains a market largely overlooked by Western players, Mistral being the only one moving in that direction.

I've said in a recent comment that Mistral is the only one of the current players who appear to be moving towards a sustainable business - all the other AI companies are simply looking for a big payday, not to operate sustainably.

gunalx24d ago

Metawith the llama series as well,they just didn't manage to keep upping the game after and with llama4.

Aurornis24d ago

I play with the small open weight models and I disagree. They are fun, but they are not in the same class as hosted models running on big hardware.

If some organization forbade external models they should invest in the hardware to run bigger open models. The small models are a waste of time for serious work when there are more capable models available.

Zetaphor24d ago

Most organizations aren't going to need the wide breadth of capabilities of the frontier models. They're risk averse and LLMs are non-deterministic, so use cases are typically more tightly scoped to tasks that involve nuanced classification that small models can easily handle even if it takes a little fine-tuning on your organizations data.

NitpickLawyer24d ago

I agree with the sentiment, but these models aren't suited for that. You can run much bigger models on prem with ~100k of hardware, and those can actually be useful in real-world tasks. These small models are fun to play with, but are nowhere close to solving the needs of a dev shop working in healthcare or banking, sadly.

mtct8823d ago

100k is a lot of money for a software agency where I come from.

smrtinsert24d ago

How true is this? How does a regulated industry confirm the model itself wasn't trained with malicious intent?

ndriscoll24d ago

Why would it matter if the model is trained with malicious intent? It's a pure function. The harness controls security policies.

1 more reply

kennethops24d ago

I love the idea of building competitor to open weight models but damn is this an expensive game to play

pstuart24d ago

It is, but think about how advances in computing technology have made that power available over time. A Raspberry Pi is almost 5 times more powerful than the Cray-1.

Granted, these next couple of years are going to suck because of the AI Component Drought, but progress marches on and the power and price of running today's frontier models will be affordable to mere mortals in time. Obviously we've hit the wall with Moore's law and other factors but this will not always be out of reach.

alecco24d ago

Related interesting find on Qwen.

"Qwen's base models live in a very exam-heavy basin - distinct from other base models like llama/gemma. Shown below are the embeddings from randomly sampled rollouts from ambiguous initial words like "The" and "A":"

https://xcancel.com/N8Programs/status/2044408755790508113

nxtfari23d ago

This makes a lot of my experience with Qwen make sense. I’ve watched all the benchmarks imply how close it should be to various GPT or Claude releases, but in my own use chatting with it or trying to get it do agentic tasks it was nowhere near as smart as even GPT-3.5 for example. Meanwhile Gemma 4 casually dropped and even the 4B models were performing better than Qwen 3.5 MOE in my chats. Benchmaxxing.

kanemcgrath24d ago

I have been using Qwen3.5-35B-A3B a lot in local testing, and it is by far the most capable model that could fit on my machine. I think quantization technology has really upped its game around these models, and there were two quants that blew me away

Mudler APEX-I-Quality. then later I tried Byteshape Q3_K_S-3.40bpw

Both made claims that seemed too good to be true, but I couldn't find any traces of lobotomization doing long agent coding loops. with the byteshape quant I am up to 40+ t/s which is a speed that makes agents much more pleasant. On an rtx 3060 12GB and 32GB of system ram, I went from slamming all my available memory to having like 14GB to spare.

Hugsun23d ago

Unfortunately, llama.cpp quantization technology has been stagnant for two years. The main quantization developer left or was kicked out of llama.cpp due to an attribution dispute. He created his own fork ik_llama.cpp where he has made multiple new and better quants.

unsloth and byteshape are just using and highlighting features that have been available the whole time. I am very invested in figuring out a solution to this dispute, or some way to get the new quants upstreamed.

kanemcgrath24d ago

Now that I have tried out on a few tasks, Qwen3.6 is a huge jump in capability. It can make improvements to a project that qwen3.5 always struggled with.

burgertea23d ago

Could you share more about your config? I've also got a 3060 12gb and 64gb of ram, but I've never got local models running well enough to be useful

edg500024d ago

What can and what can't it do compared to Codex and CC?

mettamage23d ago

who do you compare it against qwen3.5 27b?

kanemcgrath23d ago

I haven't ran 27b that much because it only runs at like 2 tokens/sec on my computer.

jadbox24d ago

Which one is best?

kanemcgrath24d ago

I would say byteshape is smaller and faster, I can’t really notice a quality difference. But I haven’t used it as much as I only started using it a few days ago.

armanj24d ago

I recall a Qwen exec posted a public poll on Twitter, asking which model from Qwen3.6 you want to see open-sourced; and the 27b variant was by far the most popular choice. Not sure why they ignored it lol.

zozbot23424d ago

The 27B model is dense. Releasing a dense model first would be terrible marketing, whereas 35A3B is a lot smarter and more quick-witted by comparison!

arxell24d ago

Each has it's pros and cons. Dense models of equivalent total size obviously do run slower if all else is equal, however, the fact is that 35A3B is absolutely not 'a lot smarter'... in fact, if you set aside the slower inference rates, Qwen3.5 27B is arguably more intelligent and reliable. I use both regularly on a Strix Halo system... the Just see the comparison table here: https://huggingface.co/unsloth/Qwen3.6-35B-A3B-GGUF . The problem that you have to acknowledge if running locally (especially for coding tasks) is that your primary bottleneck quickly becomes prompt processing (NOT token generation) and here the differences between dense and MOE are variable and usually negligible.

2 more replies

JKCalhoun24d ago

"…whereas 35A3B is a lot smarter…"

Must. Parse. Is this a 35 billion parameter model that needs only 3 billion parameters to be active? (Trying to keep up with this stuff.)

EDIT: A later comment seems to clarify:

"It's a MoE model and the A3B stands for 3 Billion active parameters…"

halJordan24d ago

That makes no sense. If you were just going to release the "more hype-able because it's quicker" model then why have a a poll.

Miraste24d ago

What? 35B-A3B is not nearly as smart as 27B.

3 more replies

throwdbaaway24d ago

Based on the release schedule of 3.5 previously, my optimistic take is that they distill the small models from the 397B, and it is much faster to distill a sparse A3B model. Hopefully the other variants will be released in the coming days.

arunkant24d ago

Probably coming next

zkmon24d ago

I'm guessing 3.5-27b would beat 3.6-35b. MoE is a bad idea. Because for the same VRAM 27b would leave a lot more room, and the quality of work directly depends on context size, not just the "B" number.

zozbot23424d ago

MoE is not a bad idea for local inference if you have fast storage to offload to, and this is quickly becoming feasible with PCIe 5.0 interconnect.

perbu24d ago

MoE is excellent for the unified memory inference hardware like DGX Sparc, Apple Studio, etc. Large memory size means you can have quite a few B's and the smaller experts keeps those tokens flowing fast.

aliljet24d ago

I'm broadly curious how people are using these local models. Literally, how are they attaching harnesses to this and finding more value than just renting tokens from Anthropic of OpenAI?

oompydoompy7424d ago

Idk about everyone else, but I don’t want to rent tokens forever. I want a self hosted model that is completely private and can’t be monitored or adulterated without me knowing. I use both currently, but I am excited at the prospect of maybe not having to in the near to mid future.

I’ve increasingly started self hosting everything in my home lately because I got tired of SAAS rug pulls and I don’t see why LLM’s should eventually be any different.

danny_codes24d ago

Exactly. Relying on external compute for professional work is a non-starter IMO.

seemaze24d ago

Qwen3.5-9B has been extremely useful for local fuzzy table extraction OCR for data that cannot be sent to the cloud.

The documents have subtly different formatting and layout due to source variance. Previously we used a large set of hierarchical heuristics to catch as many edge cases as we could anticipate.

Now with the multi-modal capabilities of these models we can leverage the language capabilities along side vision to extract structured data from a table that has 'roughly this shape' and 'this location'.

marssaxman24d ago

I used vLLM and qwen3-coder-next to batch-process a couple million documents recently. No token quota, no rate limits, just 100% GPU utilization until the job was done.

znnajdla24d ago

Some tasks don’t require SOTA models. For translating small texts I use Gemma 4 on my iPhone because it’s faster and better than Apple Translate or Google Translate and works offline. Also if you can break down certain tasks like JSON healing into small focused coding tasks then local models are useful

netdevphoenix23d ago

> For translating small texts I use Gemma 4 on my iPhone because it’s faster and better than Apple Translate or Google Translate and works offline.

What does better mean here? Does it handle formal vs informal speech? Idiomatic expressions? Regional variances (like American vs British English)? These are areas where Google Translate is weak.

How fast are we talking here (including initial loading times) and what's the impact on your phone battery? Also, what iPhone do you have?

I am really interested in this application hence my questions.

root_axis24d ago

How does that work? Wouldn't it be slow loading the weights into memory every time you launch it?

1 more reply

kaliqt24d ago

Is it really better? In which languages?

3 more replies

oktoberpaard24d ago

Do you use E2B or E4B?

lkjdsklf24d ago

The people i know that use local models just end up with both.

The local models don’t really compete with the flagship labs for most tasks

But there are things you may not want to send to them for privacy reasons or tasks where you don’t want to use tokens from your plan with whichever lab. Things like openclaw use a ton of tokens and most of the time the local models are totally fine for it (assuming you find it useful which is a whole different discussion)

deaux24d ago

The open weights models absolutely compete with flagship labs for most tasks. OpenAI and Anthropic's "cheap tier" models are completely uncompetitive with them for "quality / $" and it's not close. Google is the only one who has remained competitive in the <$5/1M output tier with Flash, and now has an incredibly strong release with Gemma 4.

Unless you have a corporate lock-in/compliance need, there has been no reason to use Haiku or GPT mini/nano/etc over open weights models for a long time now.

jwitthuhn24d ago

I've been largely using Qwen3.5-122b at 6 bit quant locally for some c++/go/python dev lately because it is quite capable as long as I can give it pretty specific asks within the codebase and it will produce code that needs minimal massaging to fit into the project.

I do have a $20 claude sub I can fall back to for anything qwen struggles with, but with 3.5 I have been very pleased with the results.

383629364824d ago

How much VRAM do you need for that?

3 more replies

dust4223d ago

I am using it with pi agent and I have stopped renting tokens. Much better for me than Claude Code, on M1 Max 64GB. This model with oMLX is at 16k context PP 919.9 tok/s and TG 54.7 tok/s. You have to manage the context but the better you manage context the more focused the output is. I use it without thinking.

mistercheese24d ago

I use local models for asking about personal financial or health data that I want to keep local and private. Or even just whipping up quick and dirty prototypes for whatever I can think of but not seriously enough to spend tokens that I rather use on real projects.

bildung24d ago

The privacy/data security angle really is important in some regions and industries. Think European privacy laws or customers demanding NDAs. The value of Anthropic and OpenAI is zero for both cases, so easy to beat, despite local models being dumber and slower.

kamranjon24d ago

I use LMStudio to host and run GLM 4.7 Flash as a coding agent. I use it with the Pi coding agent, but also use it with the Zed editor agent integrations. I've used the Qwen models in the past, but have consistently come back to GLM 4.7 because of its capabilities. I often use Qwen or Gemma models for their vision capabilities. For example, I often will finish ML training runs, take a photo of the graphs and visualizations of the run metrics and ask the model to tell me things I might look at tweaking to improve subsequent training runs. Qwen 3.5 0.8b is pretty awesome for really small and quick vision tasks like "Give me a JSON representation of the cards on this page".

Aurornis24d ago

It’s easy to find a combination of llama.cpp and a coding tool like OpenCode for these. Asking an LLM for help setting it up can work well if you don’t want to find a guide yourself.

> and finding more value than just renting tokens from Anthropic of OpenAI?

Buying hardware to run these models is not cost effective. I do it for fun for small tasks but I have no illusions that I’m getting anything superior to hosted models. They can be useful for small tasks like codebase exploration or writing simple single use tools when you don’t want to consume more of your 5-hour token budget though.

toxik24d ago

Oh lord, are the LLMs already replacing LLMs?

ThatPlayer24d ago

I'm using the smaller vision models (Qwen3.5-4B currently) with Frigate, a FOSS self-hosted "AI" NVR. It's good enough at analyzing images to figure out mostly what's happening, and doesn't require the big knowledge base that bigger models have.

Also use a bigger model for summarizing or translating text, which I don't consume in realtime, so doesn't need to be fast. Would be a thing I could use OpenAI's batch APIs for if I did need something higher quality.

Tamitami200023d ago

I'm using forge code (https://forgecode.dev/) with various local and cloud models and I really like it. MiniMax 2.7 is really great with it, and the new Qwen 3.6 35B A3B feels much stronger, after some testing, than the 3.5 version. Check some harness benchmarks. Forge outperforms Claude Code with Opus by a big margin.

dist-epoch24d ago

There are really nice GUIs for LLMs - CherryStudio for example, can be used with local or cloud models.

There are also web-UIs - just like the labs ones.

And you can connect coding agents like Codex, Copilot or Pi to local coding agents - the support OpenAI compatible APIs.

It's literally a terminal command to start serving the model locally and you can connect various things to it, like Codex.

1 more reply

kylehotchkiss24d ago

I am working on a research project to link churches from their IRS Exempt org BMF entry to their google search result from 10 fetched. Gwen2.5-14b on a 16gb Mac Mini. It works good enough!

It's entertaining to see HN increasingly consider coding harness as the only value a model can provide.

deaux24d ago

While they can be run locally, and most of the discussion on HN about that, I bet that if you look at total tok/day local usage is a tiny amount compared to total cloud inference even for these models. Most people who do use them locally just do a prompt every now and then.

zozbot23424d ago

This is why I'd like to see a lot more focus on batched inference with lower-end hardware. If you just do a tiny amount of tok/day and can wait for the answer to be computed overnight or so, you don't really need top-of-the-line hardware even for SOTA results.

2 more replies

flux312524d ago

They are okay for vibe coding throw-away projects without spending your Anthrophic/OAI tokens

Panda424d ago

I was thinking the same thing. My only guess is that they are excited about local models because they can run it cheaper through Open Router ?

zackify24d ago

always inside claude code, just using ollama, takes 2 seconds

rvnx24d ago

China won again in terms of openness

danny_codes24d ago

Ironic

lta23d ago

Not as much as "Open" AI

fooblaster24d ago

Honestly, this is the AI software I actually look forward to seeing. No hype about it being too dangerous to release. No IPO pumping hype. No subscription fees. I am so pumped to try this!

wrxd24d ago

Same here. I really hope in a near future local model will be good enough and hardware fast enough to run them to become viable for most use cases

vlapec24d ago

No need to hope; it is inevitable.

1 more reply

altruios23d ago

I have moved through the local models at this size.

This one is by far the most capable. I've tried various versions of gemma4.26b, various versions of qwen3.5-27/35b (qwopus's galor),nemotron,phi,glm4.7.

This one is noticeably better as an agent. It's really good at breaking down tasks into small actionable steps, and - where there is ambiguity - asks for clarification. It's reasoning seems more solid than gemma4, tool use, multi-messaging/longer chain thinking.

I am excited to see what other versions of this model people train!

onlyrealcuzzo23d ago

How does it compare to CC Opus Max?

1 more reply

seemaze24d ago

Fingers crossed for mid and larger models as well. I'd personally love to see Qwen3.6-122B-A10B.

Vespasian24d ago

That would be really great. Though 3.5 122B is already doing a lot of work in our setup.

abhikul024d ago

I hope the other sizes are coming too(9B for me). Can't fit much context with this on a 36GB mac.

mhitza24d ago

It's a MoE model and the A3B stands for 3 Billion active parameters, like the recent Gemma 4.

You can try to offload the experts on CPU with llama.cpp (--cpu-moe) and that should give you quite the extra context space, at a lower token generation speed.

abhikul024d ago

Mac has unified memory, so 36GB is 36GB for everything- gpu,cpu.

2 more replies

dgb2324d ago

Do I expect the same memory footprint from an N active parameters as from simply N total parameters?

2 more replies

pdyc24d ago

i dont get it, mac has unified memory how would offloading experts to cpu help?

1 more reply

pdyc24d ago

can you elaborate? you can use quantized version, would context still be an issue with it?

abhikul024d ago

A usable quant, Q5_KM imo, takes up ~26GB[0], which leaves around ~6-7GB for context and running other programs which is not much.

[0] https://huggingface.co/unsloth/Qwen3.5-35B-A3B-GGUF?show_fil...

nickthegreek24d ago

context is always an issue with local models and consumer hardware.

1 more reply

383629364824d ago

Qwen3.6 and Gemma4 have the same issue of never getting to the point and just getting stuck in never ending repeating thought loops. Qwen3.5 is still the best local model that works.

agentifysh24d ago

I think the hype around Qwen and even Gemma4 often floated for views/attention glosses over that these models have clear gaps behind what closed models offer.

In short, it has its uses but it would/should not be the main driver. Will it get better, I'm sure of it, but there is too much hype and exaggeration over open source models, for one the hardware simply isn't enough at a price point where we can run something that can seriously compete with today's closed models.

If we got something like GPT-5.4-xhigh that can run on some local hardware under 5k, that would be a major milestone.

ElectricalUnion23d ago

I say "if we got $CURRENT_MODEL that can run under local hardware" claims are postproning BS.

What is gonna happen when that happens? They are gonna cry they need GPT-$CURRENT capabilities locally.

Now we have local models that are way better that GPT-2 (careful, this one is way too dangerous for release!) GPT3.5, in some ways better that 4, and can run on reasonably modest hardware.

danny_codes24d ago

Give it 6 months

naasking24d ago

Quantization can introduce these issues, and Gemma 4 also had issues because the prompt tokens that Gemma used was new and not well supported yet.

nxtfari23d ago

I had issues with Qwen thinking endlessly when I didn’t know I wasn’t using the temp/top_k/min_p/etc settings specified in the readme. I’ve never had an issue with Gemma 4 thinking endlessly but could possibly be the same.

1 more reply

jake-coworker24d ago

This is surprisingly close to Haiku quality, but open - and Haiku is quite a capable model (many of the Claude Code subagents use it).

wild_egg24d ago

Where did you see a haiku comparison? Haiku 4.5 was my daily driver for a month or so before Opus 4.5 dropped and would be unreasonably happy if a local model can give me similar capability

daemonologist24d ago

I didn't see a direct comparison, but there's some overlap in the published benchmarks:

                           │ Qwen 3.6 35B-A3B │ Haiku 4.5               
   ────────────────────────┼──────────────────┼──────────────────────── 
    SWE-Bench Verified     │ 73.4             │ 66.6                    
   ────────────────────────┼──────────────────┼──────────────────────── 
    SWE-Bench Multilingual │ 67.2             │ 64.7                    
   ────────────────────────┼──────────────────┼──────────────────────── 
    SWE-Bench Pro          │ 49.5             │ 39.45                   
   ────────────────────────┼──────────────────┼──────────────────────── 
    Terminal Bench 2.0     │ 51.5             │ 61.2 (Warp), 27.5 (CC)  
   ────────────────────────┼──────────────────┼──────────────────────── 
    LiveCodeBench          │ 80.4             │ 41.92

These are of course all public benchmarks though - I'd expect there to be some memorization/overfitting happening. The proprietary models usually have a bit of an advantage in real-world tasks in my experience.

coder54324d ago

Artificial Analysis hasn't posted their independent analysis of Qwen3.6 35B A3B yet, but Alibaba's benchmarks paint it as being on par with Qwen3.5 27B (or better in some cases).

Even Qwen3.5 35B A3B benchmarks roughly on par with Haiku 4.5, so Qwen3.6 should be a noticeable step up.

https://artificialanalysis.ai/models?models=gpt-oss-120b%2Cg...

No, these benchmarks are not perfect, but short of trying it yourself, this is the best we've got.

Compared to the frontier coding models like Opus 4.7 and GPT 5.4, Qwen3.6 35B A3B is not going to feel smart at all, but for something that can run quickly at home... it is impressive how far this stuff has come.

1 more reply

deaux24d ago

I find Gemma 4 26B A4B better than Haiku 4.5 and that's smaller than this one.

gck123d ago

I have a Macbook M3 Max with 128GB of RAM.

How close to Opus 4.6 can I get with this? Realistic, real-world usage. And I mean not sitting there for minutes waiting the model to finish saying hello, or being able to use it for anything more than a pelican riding a bicycle.

I'm asking because I'm always seeing excited replies, then I get excited, then I spend minutes to hours setting up the model and then, after first use I forget it exists for one reason or another.

Can I get any realistic use out of this?

qazplm1723d ago

It won’t be a fair comparison against opus-4.6 but it will run quite well on your machine. I’ve tested qwen3.5 27B, Gemma4, minimax2.5 and Glm4.7 before on my m3 ultra. And i’d say this is the first model that I’m able to use for full agentic sessions. here is a pi session i just did and it worked quite well surprisingly: https://pi.dev/session/#c3d003becb1bfcc7ffbca04e89e1adf8

1 more reply

stavros23d ago

You'd be the best person in this thread to answer this question.

cpburns200924d ago

Anyone else getting gibberish when running unsloth/Qwen3.6-35B-A3B-GGUF:UD-IQ4_XS on CUDA (llama.cpp b8815)? UD-Q4_K_XL is fine, as is Vulkan in general.

cpburns200924d ago

Apparently it's a known issue with CUDA 13.2 [1].

[1] https://huggingface.co/unsloth/Qwen3.6-35B-A3B-GGUF/discussi...

danielhanchen23d ago

Yes sadly CUDA 13.2 is broken - NVIDIA will push a fix in CUDA 13.3

KronisLV24d ago

I wonder how this one compares to Qwen3 Coder Next (the 80B A3B model), since you'd think that even though it's older, it having more parameters would make it more useful for agentic and development use cases: https://huggingface.co/collections/Qwen/qwen3-coder-next

Glemllksdf24d ago

I tried Gemma 4 A4B and was surprised how hart it is to use it for agentic stuff on a RTX 4090 with 24gb of ram.

Balancing KV Cache and Context eating VRam super fast.

qazplm1723d ago

Just tried to use qwen3.6-35b-a3b-bf16 + omlx running a pi session to use my HN cli to do a sentiment analysis on this story and opus4.7 story. I’m getting ~40tk/s on a M3 Ultra Mac Studio and the tool use consistency has been held up well. Even when passing 100k tokens, the session was still going strong. Here is the full sentiment analysis report it produced:

https://gist.github.com/duh17/2db5351da026cec4bd4f46e169e75e...

Here is the full session:

https://pi.dev/session/#c3d003becb1bfcc7ffbca04e89e1adf8

This is by far my smoothest agentic session using a local model of any size. The output quality and speed has really struct the right balance. Very impressive release

codeugo24d ago

Are we going to get to the point where a local model can do almost what sonnet 4.6 can do?

bluerooibos24d ago

Of course we are. And Opus 4.6+. It's a matter of when, not if.

danny_codes24d ago

Once you run out of data it’s just optimizations to commoditization

intothemild24d ago

We're already there IMHO.. If you have enough ram, sure.. but the ~32gig people can run models that beat sonnet 4.5

npodbielski24d ago

I am not sure. I tested it locally on my Desktop Framework and it so far it seem to giving me worse answers then Qwen 3.5. Maybe it is because I am chatting with models in my language instead of enlish or maybe it is optimised for coding instead.

I asked it to give me instruction on how to create SSH key and it tried to do it instead of just answering.

https://internetexception.com/2026/04/16/testing-qwen-3-6/

adrian_b24d ago

Available for download:

https://huggingface.co/Qwen/Qwen3.6-35B-A3B

poglet24d ago

Can this run on a PC with 16GB graphics card or a 24GB Macbook Pro? I'm not familiar with how Mixture-of-Experts models differ from standard models.

incomingpain24d ago

Wowzers, we were worried Qwen was going to suffer having lost several high profile people on the team but that's a huge drop.

It's better than 27b?

adrian_b24d ago

Their previous model Qwen3.5 was available in many sizes, from very small sizes intended for smartphones, to medium sizes like 27B and big sizes like 122B and 397B.

This model is the first that is provided with open weights from their newer family of models Qwen3.6.

Judging from its medium size, Qwen/Qwen3.6-35B-A3B is intended as a superior replacement of Qwen/Qwen3.5-27B.

It remains to be seen whether they will also publish in the future replacements for the bigger 122B and 397B models.

The older Qwen3.5 models can be also found in uncensored modifications. It also remains to be seen whether it will be easy to uncensor Qwen3.6, because for some recent models, like Kimi-K2.5, the methods used to remove censoring from older LLMs no longer worked.

mft_24d ago

There was also Qwen3.5-35B-A3B in the previous generation: https://huggingface.co/Qwen/Qwen3.5-35B-A3B

storus24d ago

> Qwen/Qwen3.6-35B-A3B is intended as a superior replacement of Qwen/Qwen3.5-27B

Not at all, Qwen3.5-27B was much better than Qwen3.5-35B-A3B (dense vs MoE).

2 more replies

segmondy24d ago

This is obviously a continuation training of 3.5, it's not a new model architecture but an incremental improvement.

lopsotronic24d ago

Dangit, I'll need to give this a run on my personal machine. This looks impressive.

At the time of writing, all deepseek or qwen models are de facto prohibited in govcon, including local machine deployments via Ollama or similar. Although no legislative or executive mandate yet exists [1], it's perceived as a gap [2], and contracts are already including language for prohibition not just in the product but any part of the software environment.

The attack surface for a (non-agentic) model running in local ollama is basically non-existent . . but, eh . . I do get it, at some level. While they're not l33t haXX0ring your base, the models are still largely black boxes, can move your attention away from things, or towards things, with no one being the wiser. "Landing Craft? I see no landing craft". This would boil out in test, ideally, but hey, now you know how much time your typical defense subcon spends in meaningful software testing[3].

[1] See also OMB Memorandum M-25-22 (preference for AI developed and produced in the United States), NIST CAISI assessment of PRC-origin AI models as "adversary AI" (September 2025), and House Select Committee on the CCP Report (April 16, 2025), "DeepSeek Unmasked".

[2] Overall, rather than blacklist, I'd recommend a "whitelist" of permitted models, maintained dynamically. This would operate the same way you would manage libraries via SSCG/SSCM (software supply chain governance/management) . . but few if any defense subcons have enough onboard savvy to manage SSCG let alone spooling a parallel construct for models :(. Soooo . . ollama regex scrubbing it is.

[3] i.e. none at all, we barely have the ability to MAKE anything like software, given the combination of underwhelming pay scales and the fact defense companies always seem to have a requirement for on-site 100% in some random crappy town in the middle of BFE. If it wasn't for the downturn in tech we wouldn't have anyone useful at all, but we snagged some silcon refugees.

99990000099924d ago

Looking to move off ollama on Open Suse tumbleweed.

Should I use brew to install llma.ccp or the zypper to install the tumbleweed package?

badsectoracula24d ago

You can compile it from source, all you need to do is clone the repository and do a `cmake -B build -DGGML_VULKAN=1` (add other backends if you want) followed by a `cmake --build build --config Release` and then you get all the llama tools in the `build/bin` (including `llama-server` which provides a web-based interface). There is a `docs/build.md` that has more detailed info (especially if you need another backend, though at least on my RX 7900 XTX i see no difference in terms of performance between Vulkan and ROCm and the former is much more stable and compatible -- i tried ROCm for a bit thinking it'd be much faster but only ended up being much more annoying as some models would OOM on it while they worked on Vulkan -- if you or NVIDIA hardware all this may sound quaint though :-P).

99990000099924d ago

Cool, I assume this is how adults use llms.

I’m on a nvidia gpu , but I want to be able to combine vram with system memory.

rexreed24d ago

Why are you looking to move off Ollama? Just curious because I'm using Ollama and the cloud models (Kimi 2.5 and Minimax 2.7) which I'm having lots of good success with.

99990000099924d ago

Ollama co mingles online and local models which defeats the purpose for me

1 more reply

stratos12324d ago

Why not just download the binaries from github releases?

dataflow24d ago

I'm a newbie here and lost how I'm supposed to use these models for coding. When I use them with Continue in VSCode and start typing basic C:

  #include <stdio.h>
  int m

I get nonsensical autocompletions like:

  #include <stdio.h>
  int m</fim_prefix>

What is going on?

sosodev24d ago

These are not autocomplete models. It’s built to be used with an agentic coding harness like Pi or OpenCode.

zackangelo24d ago

They are but the IDE needs to be integrated with them.

Qwen specifically calls out FIM (“fill in the middle”) support on the model card and you can see it getting confused and posting the control tokens in the example here.

1 more reply

JokerDan24d ago

And even of those models trained for tool calling and agentic flows, mileage may vary depending on lots of factors. Been playing around with smaller local models (Anything that fits on 4090 + 64gb RAM) and it is a lottery it seems on a) if it works at all and b) how long it will work for.

Sometimes they don't manage any tool calls and fall over off the bat, other times they manage a few tool calls and then start spewing nonsense. Some can manage sub agents fr a while then fall apart.. I just can't seem to get any consistently decent output on more 'consumer/home pc' type hardware. Mostly been using either pi or OpenCode for this testing.

woctordho24d ago

Choose the correct FIM (Fill In the Middle) template for Qwen in Continue. All recent Qwen models are actually trained with FIM capability and you can use them.

Jeff_Brown24d ago

This might sound snarky but in all earnestness, try talking to an AI about your experience using it.

recov24d ago

I would use something like zeta-2 instead - https://huggingface.co/bartowski/zed-industries_zeta-2-GGUF

fred_is_fred24d ago

How does this compare to the commercial models like Sonnet 4.5 or GPT? Close enough that the price is right (free)?

vidarh24d ago

The will not measure up. Notice they're comparing it to Gemma, Google's open weight model, not to Gemini, Sonnet, or GPT. That's fine - this is a tiny model.

If you want something closer to the frontier models, Qwen3.6-Plus (not open) is doing quite well[1] (I've not tested it extensively personally):

https://qwen.ai/blog?id=qwen3.6

pzo24d ago

on the bright side also worth to keep in mind those tiny models are better than GPT 4.0, 4.1 GPT4o that we used to enjoy less than 2 years ago [1]

[1] https://artificialanalysis.ai/?models=gpt-5-4%2Cgpt-oss-120b...

1 more reply

NitpickLawyer24d ago

> Close enough

No. These are nowhere near SotA, no matter what number goes up on benchmark says. They are amazing for what they are (runnable on regular PCs), and you can find usecases for them (where privacy >> speed / accuracy) where they perform "good enough", but they are not magic. They have limitations, and you need to adapt your workflows to handle them.

julianlam24d ago

Can you share more about what adaptations you made when using smaller models?

I'm just starting my exploration of these small models for coding on my 16GB machine (yeah, puny...) and am running into issues where the solution may very well be to reduce the scope of the problem set so the smaller model can handle it.

2 more replies

yaur24d ago

I think its worth noting that if you are paying for electricity Local LLM is NOT free. In most cases you will find that Haiku is cheaper, faster, and better than anything that will run on your local machine.

gyrovagueGeist24d ago

Electricity (on continental US) is pretty cheap assuming you already have the hardware:

Running at a full load of 1000W for every second of the year, for a model that produces 100 tps at 16 cents per kWh, is $1200 USD.

The same amount of tokens would cost at least $3,150 USD on current Claude Haiku 3.5 pricing.

1 more reply

postalrat24d ago

If you need the heating then it is basically free.

1 more reply

nurettin24d ago

I tried the car wash puzzle:

You want to wash your car. Car wash is 50m away. Should you walk or go by car?

> Walk. At 50 meters, the round trip is roughly 100 meters, taking about two minutes on foot. Driving would require starting the engine, navigating, parking, and dealing with unnecessary wear for a negligible distance. Walk to the car wash, and if the bay requires the vehicle inside, have it moved there or return on foot. Walking is faster and more efficient.

Classic response. It was really hard to one shot this with Qwen3.5 Q4_K_M.

Qwen3.6 UD-IQ4_XS also failed the first time, then I added this to the system prompt:

> Double check your logic for errors

Then I created a new dialog and asked the puzzle and it responded:

> Drive it. The car needs to be present to be washed. 50 meters is roughly a 1-minute walk or a 10-second drive. Walking leaves the car behind, making the wash impossible. Driving it the short distance is the only option that achieves the goal.

Now 3.6 gets it right every time. So not as great as a super model, but definitely an improvement.

dist-epoch24d ago

Interestingly, Gemma4-26B IQ4_XS gets it correct:

> This sounds like a logic riddle! The answer is: You should go by car. Here is why: If you walk, you will arrive at the car wash, but your car will still be 50 meters away at home. You can't wash the car if the car isn't there! To accomplish your goal, you have to drive the car to the car wash.

It has the wrong one in thinking. It did think longer than usual:

Direct answer: Walk.

Reasoning 1: Distance (50m is negligible).

Reasoning 2: Practicality/Efficiency (engine wear/fuel).

Reasoning 3: Time (walking is likely faster or equal when considering car prep).

...

Wait, if I'm washing the car, I need to get the car to the car wash. The question asks how I should get there.

...

Wait, let's think if there's a trick. If you "go by car," you are moving the car to the destination. If you "walk," you are just moving yourself.

Conclusion: You should drive the car.

cyrialize24d ago

My last laptop was a used 2012 T530.

My current is a used M1 MBP Pro with 16GB of ram.

I thought this was all I was ever going to need, but wanting to run really nice models locally has me thinking about upgrading.

Although, part of me wants to see how far I could get with my trusty laptop.

bigyabai24d ago

Your current laptop is still a fine thin client. Unless you program in the woods, it's probably cheapest to build a home inference box and route it over Tailscale or something.

system224d ago

Or just an API server for all other devices to connect and do stuff with it.

kombine24d ago

What kind of hardware (preferably non-Apple) can run this model? What about 122B?

daemonologist24d ago

The 3B active is small enough that it's decently fast even with experts offloaded to system memory. Any PC with a modern (>=8 GB) GPU and sufficient system memory (at least ~24 GB) will be able to run it okay; I'm pretty happy with just a 7800 XT and DDR4. If you want faster inference you could probably squeeze it into a 24 GB GPU (3090/4090 or 7900 XTX) but 32 GB would be a lot more comfortable (5090 or Radeon Pro).

122B is a more difficult proposition. (Also, keep in mind the 3.6 122B hasn't been released yet and might never be.) With 10B active parameters offloading will be slower - you'd probably want at least 4 channels of DDR5, or 3x 32GB GPUs, or a very expensive Nvidia Pro 6000 Blackwell.

ru55224d ago

You won't like it, but the answer is Apple. The reason is the unified memory. The GPU can access all 32gb, 64gb, 128gb, 256gb, etc. of RAM.

An easy way (napkin math) to know if you can run a model based on it's parameter size is to consider the parameter size as GB that need to fit in GPU RAM. 35B model needs atleast 35gb of GPU RAM. This is a very simplified way of looking at it and YES, someone is going to say you can offload to CPU, but no one wants to wait 5 seconds for 1 token.

samtheprogram24d ago

That estimate doesn't account for context, which is very important for tool use and coding.

I used this napkin math for image generation, since the context (prompts) were so small, but I think it's misleading at best for most uses.

sliken24d ago

> You won't like it, but the answer is Apple.

Or strix halo.

Seems rather over simplified.

The different levels of quants, for Qwen3.6 it's 10GB to 38.5GB.

Qwen supports a context length of 262,144 natively, but can be extended to 1,010,000 and of course the context length can always be shortened.

Just use one of the calculators and you'll get much more useful number.

1 more reply

canpan24d ago

Any good gaming pc can run the 35b-a3 model. Llama cpp with ram offloading. A high end gaming PC can run it at higher speeds. For your 122b, you need a lot of memory, which is expensive now. And it will be much slower as you need to use mostly system ram.

bigyabai24d ago

Seconding this. You can get A3B/A4B models to run with 10+ tok/sec on a modern 6/8GB GPU with 32k context if you optimize things well. The cheapest way to run this model at larger contexts is probably a 12gb RTX 3060.

mildred59324d ago

I can run this on an AMD Framework laptop. A Ryzen 7 (I dont have Ryzen AI, just Ryzen 7 7840U) with 32+48 GB DDR. The Ryzen unified memory is enough, I get 26GB of VRAM at least.

Fedora 43 and LM Studio with Vulkan llama.cpp

terramex24d ago

I run Gemma 4 26B-A4B with 256k context (maximum) on Radeon 9070XT 16GB VRAM + 64GB RAM with partial GPU offload (with recommended LMStudio settings) at very reasonable 35 tokens per second, this model is similiar in size so I expect similar performance.

rhdunn24d ago

The Q5 quantization (26.6GB) should easily run on a 32GB 5090. The Q4 (22.4GB) should fit on a 24GB 4090, but you may need to drop it down to Q3 (16.8GB) when factoring in the context.

You can also run those on smaller cards by configuring the number of layers on the GPU. That should allow you to run the Q4/Q5 version on a 4090, or on older cards.

You could also run it entirely on the CPU/in RAM if you have 32GB (or ideally 64GB) of RAM.

The more you run in RAM the slower the inference.

1 more reply

bildung24d ago

I currently run the qwen3.5-122B (Q4) on a Strix Halo (Bosgame M5) and am pretty happy with it. Obviously much slower than hosted models. I get ~ 20t/s with empty context and am down to about 14t/s with 100k of context filled.

No tuning at all, just apt install rocm and rebuilding llama.cpp every week or so.

psim124d ago

(Please don't downvote - serious question) Are Chinese models generally accepted for use within US companies? The company I work for won't allow Qwen.

DiabloD324d ago

There is a difference between Chinese model and Chinese service.

Your company most likely is banning the use of foreign services, but it wouldn't make sense to ban the model, since the model would be ran locally.

I wouldn't allow my employees to use a foreign service either if my company had specific geographic laws it had to follow (ie, fin or med or privacy laws, such as the ones in the EU).

That said, I'm not sure I'd allow them to use any AI product either, locally inferred on-prem or not: I need my employees to _not_ make mistakes, not automate mistake making.

kelsey9876543124d ago

In private sector yes. Anything that touches public sector (government) and it starts to be supply chain concerns and they want all american made models

gbgarbeb24d ago

The only problem is that the American models are super fracking dumb. Arcee Thinking Large (398B) is orders of magnitude worse than even Qwen 3.5 35B, getting stuck in thinking loops with incredibly basic questions that Google could answer in 500ms.

ghc24d ago

how does this compare to gpt-oss-120b? It seems weird to leave it out.

vyr24d ago

GPT-OSS 120B (really 117B-A5.1B) is a lot bigger. better comparison would be to 20B (21B-A3.6B).

773412824d ago

OSS-120 is too old to be relevant, and four times the size.

giantg224d ago

I cant wait to see some smaller sizes. I would love to run some sort of coding centric agent on a local TPU or GPU instead of having to pay, even if it's slower.

tristor24d ago

I'm disappointed they didn't release a 27B dense model. I've been working with Qwen3.5-27B and Qwen3.5-35B-A3B locally, both in their native weights and the versions the community distilled from Opus 4.6 (Qwopus), and I have found I generally get higher quality outputs from the 27B dense model than the 35B-A3B MOE model. My basic conclusion was that MoE approach may be more memory efficient, but it requires a fairly large set of active parameters to match similarly sized dense models, as I was able to see better or comparable results from Qwen3.5-122B-A10B as I got from Qwen3.5-27B, however at a slower generation speed. I am certain that for frontier providers with massive compute that MoE represents a meaningful efficiency gain with similar quality, but for running models locally I still prefer medium sized dense models.

I'll give this a try, but I would be surprised if it outperforms Qwen3.5-27B.

adrian_b24d ago

You are right, but this is just the first open-weights model of this family.

They said that they will release several open-weights models, though there was an implication that they might not release the biggest models.

hnfong24d ago

Given that DeepSeek, GLM, Kimi etc have all released large open weight models, I am personally grateful that Qwen fills the mid/small sized model gap even if they keep their largest models to themselves. The only other major player in the mid/small sized space at this point is pretty much only Gemma.

tristor24d ago

I'm totally fine with that, frankly. I'm blessed with 128GB of Unified Memory to run local models, but that's still tiny in comparison the larger frontier models. I'd much rather get a full array of small and medium sized models, and building useful things within the limits of smaller models is more interesting to me anyway.

1 more reply

ilaksh24d ago

It's a given that the dense models with comparable size are better. I also proved that in my use case for those two Qwen 3.5 models.

The benchmarks show 3.6 is a bit better than 3.5. I should retry my task, but I don't have a lot of confidence. But it does sound like they worked on the right thing which is getting closer to the 27B performance.

btbr40324d ago

Planning to deploy Qwen3.6-35B-A3B on NVIDIA Spark DGX for multi-agent coding workflows. The 3B active params should help with concurrent agent density.

syntaxing24d ago

Is it worth running speculative decoding on small active models like this? Or does MTP make speculative decoding unnecessary?

amelius24d ago

Looks like they compare only to open models, unfortunately.

As I am using mostly the non-open models, I have no idea what these numbers mean.

1 more reply

logicallee23d ago

What kind of hardware does this require to run locally, and how many tokens/seconds does it produce?

zshn2524d ago

What do all the numbers 6-35B-A3B mean?

dunb24d ago

3.6 is the release version for Qwen. This model is a mixture of experts (MoE), so while the total model size is big (35 billion parameters), each forward pass only activates a portion of the network that’s most relevant to your request (3 billion active parameters). This makes the model run faster, especially if you don’t have enough VRAM for the whole thing.

The performance/intelligence is said to be about the same as the geometric mean of the total and active parameter counts. So, this model should be equivalent to a dense model with about 10.25 billion parameters.

wongarsu24d ago

And even if you have enough VRAM to fit the entire thing, inference speed after the first token is proportional to (activated parameters)/(vram bandwidth)

If you have the vram to spare, a model with more total params but fewer activated ones can be a very worthwhile tradeoff. Of course that's a big if

zshn2524d ago

Sorry, how did you calculate the 10.25B?

1 more reply

cshimmin24d ago

The 6 is part of 3.6, the model version. 35B parameters, A3B means it's a mixture of experts model with only 3B parameters active in any forward pass.

zshn2524d ago

Got it. Thanks

JLO6424d ago

35B (35 billion) is the number of parameters this model has. Its a Mixture of Experts model (MoE) so A3B means that 3B parameters are Active at any moment.

zshn2524d ago

~I see. What’s the 6?~

Nevermind, the other reply clears it

joaogui124d ago

3.6 is model number, 35B is total number of parameters, A3B means that only 3B parameters are activated, which has some implications for serving (either in you you shard the model, or you can keep the total params on RAM and only road to VRAM what you need to compute the current token, which will make it slower, but at least it runs)

yieldcrv24d ago

Anybody use these instead of codex or claude code? Thoughts in comparison?

benchmarks dont really help me so much

383629364824d ago

In my test case (a feature all models got stuck on a few months ago) it just gets stuck in a thinking loop and never gets anywhere. Not a super amazing test, but it happened a few times in a row, so...

zoobab24d ago

"open source"

give me the training data?

tjwebbnorfolk24d ago

The training data is the entire internet. How do you propose they ship that to you

thrance24d ago

As a zip archive of however they store it in their database?

aruametello23d ago

to be fair there are some degree of "hand curation" of the data so while "it is the internet", the actual trained data is a derivation of that.

in a mild but productive analogy:

I could actually hand a K&R book C programming book + lots of specs to say "this is the linux source code" (the raw data that were all observations were made, aka "the internet") ...or just send them the "kernel the source code" (the refined training data, after a LOT of manual stuff) ... that your compiler consumes to generate the kernel. (the Open Weights model, what they actually shared)

Mildly related rant: honestly its a bit shit to say "open source model" in a "open weights" model, its like saying World of Warcraft is opensource because they gave you an executable of the game. (you can still change it, but in more restricted ways)

flux312524d ago

You ARE the training data

solomatov24d ago

Did anyone try it and Gemma 4? Does it feel that it's better than Gemma 4?

andy_ppp24d ago

Do we know if other models have started detecting and poisoning training/fine tuning that these Chinese models seem to use for alignment, I’d certainly be doing some naughty stuff to keep my moat if I was Anthropic or OpenAI…

storus24d ago

They no longer show reasoning traces and are throttling more aggressively.

zozbot23424d ago

They never showed full reasoning traces, just post-hoc summaries.

1 more reply

ActorNightly24d ago

Can anyone confirm this fits on a 3090? Size is exactly 24gb

tmaly24d ago

What is the min VRAM this can run on given it is MOE?

mncharity24d ago

Fwiw, with its predecessor's Qwen3.5-35B-A3B-Q6_K.gguf, on a laptop's 6 GB VRAM and 32 GB RAM, with default llama.cpp settings, I get 20 t/s generation.

rubiquity24d ago

Have you tried running llama.cpp with Unified Memory Access[1] so your iGPU can seamlessly grab some of the RAM? The environment variable is prefixed with CUDA but this is not CUDA specific. It made a pretty significant difference (> 40% tg/s) on my Ryzen 7840U laptop.

1 - https://github.com/ggml-org/llama.cpp/blob/master/docs/build...

2 more replies

tmaly23d ago

That is pretty solid, I have a 2070 with 8GB VRAM and 64GB RAM, but I haven't run too much. I regret not getting a 3090 back when I built this machine.

1 more reply

Divs289023d ago

Does any LLM aggregator offers this model?

zengid24d ago

any tips for running it locally within an agent harness? maybe using pi or opencode?

stratos12324d ago

It pretty much just works. Run the unsloth quant in llama.cpp and hook it up to pi. A bunch of minor annoyances like not having support for thinking effort. It also defaults to "interleaved thinking" (thinking blocks get stripped from context), set `"chat_template_kwargs": {"preserve_thinking": True},` if you interrupt the model often and don't want it to forget what it was thinking.

the__alchemist24d ago

Is this the hybrid variant of Gwent and Quen? I hope this is in The Witcher IV!

bossyTeacher24d ago

Does anyone have any experience with Qwen or any non-Western LLMs? It's hard to get a feel out there with all the doomerists and grifters shouting. Only thing I need is reasonable promise that my data won't be used for training or at least some of it won't. Being able to export conversations in bulk would be helpful.

Havoc24d ago

The Chinese models are generally pretty good.

> Only thing I need is reasonable promise that my data won't be used

Only way is to run it local.

I personally don’t worry about this too much. Things like medical questions I tend to do against local models though

bossyTeacher24d ago

Have you tried asking about sensitive topics?

I asked it if there were out of bounds topics but it never gave me a list.

See its responses:

Convo 1

- Q: ok tell me about taiwan

- A: Oops! There was an issue connecting to Qwen3.6-Plus. Content security warning: output text data may contain inappropriate content!

Convo 2

- Q: is winnie the pooh broadcasted in china?

- A: Oops! There was an issue connecting to Qwen3.6-Plus. Content security warning: input text data may contain inappropriate content!

These seem pretty bad to me. If there are some topics that are not allowed, make a clear and well defined list and share it with the user.

5 more replies

manmal24d ago

You can also rent a cloud GPU which is relatively affordable.

alberto-m24d ago

I used Qwen CLI's undescribed “coder_agent” (I guess Qwen 3.5 with size auto-selection) and it was powerful enough to complete 95% of a small hobby project involving coding, reverse engineering and debugging. Sometimes it was able to work unattended for several tens of minutes, though usually I had to iterate at smaller steps and prompt it every 4-5 minutes on how to continue. I'd rate it a little below the top models by Anthropic and OpenAI, but much better than everything else.

cpburns200924d ago

Personally, I wouldn't trust any foreign or domestic LLM providers to not train on your data. I also wouldn't trust them to not have a data breach eventually which is worse. If you're really worried about your data, run it locally. The Chinese models (Qwen, GLM, etc.) are really competitive to my understanding.

Mashimo24d ago

> Does anyone have any experience with Qwen or any non-Western LLMs?

I use GLM-5.1 for coding hobby project, that going to end up on github anyway. Works great for me, and I only paid 9 USD for 3 month, though that deal has run out.

> my data won't be used for training

Yeah, I don't know. Doubt it.

ramon15624d ago

$20 for 3 months is still far better than alternatives, and 5.1 works great

chabes24d ago

Run open models locally. Data stays local, and exporting sessions is straightforward.

dzonga24d ago

if any Alibaba (Qwen) folks are here - website is not working on safari

thesuperevil24d ago

Yeah damn but they are heavy lol

smcl24d ago

fuck off: https://news.ycombinator.com/item?id=47796830

amazingamazing24d ago

More benchmaxxing I see. Too bad there’s no rig with 256gb unified ram for under $1000

kennethops24d ago

do you know if they did this to it?

https://research.google/blog/turboquant-redefining-ai-effici...

kgeist24d ago

Llama.cpp already uses an idea from it internally for the KV cache [0]

So a quantized KV cache now must see less degradation

[0] https://github.com/ggml-org/llama.cpp/pull/21038

bigyabai24d ago

taps the sign

  Unified Memory Is A Marketing Gimmeck. Industrial-Scale Inference Servers Do Not Use It.

zozbot23424d ago

Industrial Scale Inference is moving towards LPDDR memory (alongside HBM), which is essentially what "Unified Memory" is.

2 more replies

wren699124d ago

On M5 Pro/Max the memory is actually just attached straight to the GPU die. CPU accesses memory through the die-to-die bridge. I don't see the difference between that and a pure GPU from a memory connectivity point of view.

Wrt inference servers: sure, it's not cost-effective to have such a huge CPU die and a bunch of media accelerators on the GPU die if you just care about raw compute for inference and training. Apple SoCs are not tuned for that market, nor do they sell into it. I'm not building a datacentre, I'm trying to run inference on my home hardware that I also want to use for other things.

1 more reply

rcxdude24d ago

Unified Memory is mainly how consumer hardware has enough RAM accessible by the GPU to run larger models, because otherwise the market segmentation jacks up the price substantially.

1 more reply

cpburns200924d ago

Sir, this is 2026. You're not getting any 128GB of RAM for under $1k.

shevy-java24d ago

I don't want "Agentic Power".

I want to reduce AI to zero. Granted, this is an impossible to win fight, but I feel like Don Quichotte here. Rather than windmill-dragons, it is some skynet 6.0 blob.

lagniappe24d ago

Then who is Rocinante?

blazzy24d ago

A dimming IBM x40 Thinkpad missing its F key.

j / k navigate · click thread line to collapse

532 comments

simonw24d ago

I've been running this on my laptop with the Unsloth 20.9GB GGUF in LM Studio: https://huggingface.co/unsloth/Qwen3.6-35B-A3B-GGUF/blob/mai...

It drew a better pelican riding a bicycle than Opus 4.7 did! https://simonwillison.net/2026/Apr/16/qwen-beats-opus/

GistNoesis24d ago

Thanks for pointing to the GGUF.

kelnos24d ago

I'm not sure how you can give the flamingo win to Qwen:

* It's sitting on the tire, not the seat.

* Is that weird white and black thing supposed to be a beak? If so, it's sticking out of the side of its face rather than the center.

* The wheel spokes are bizarre.

* One of the flamingo's legs doesn't extend to the pedal.

* (subjective) The sunglasses and bowtie are cute, but you didn't ask for them, so I'd actually dock points for that.

* (subjective) I guess flamingos have multiple tail feathers, but it looks kinda odd as drawn.

In contrast, Opus's flamingo isn't as detailed or fancy, but more or less all of it looks correct.

withinboredom24d ago

He literally said it came down to the comment in the SVG. Points for taste, not correctness. Basically.

realityfactchex24d ago

Here's a reproduction attempt (LM Studio, same Qwen3.6-35B-A3B-GGUF model as linked in parent, M1 Max 64GB, <90 seconds):

https://files.catbox.moe/r3oru2.png

- My Qwen 3.6 result had sun and cloud in sky, similar to the second Opus 4.7 result in Simon's post.

- My Qwen 3.6 result had no grass (except as a green line), but all three results in Simon's post had grass (thick).

- My Qwen 3.6 result had visible "tailing air motion" like Simon's Qwen 3.6 result.

- My Qwen 3.6 result had a "sun with halo" effect that none of Simon's results had.

But, I know, it's more about the pelican and the bicycle.

_ache_24d ago

The bicycle frame is ok. Simon's was better but at least it's not broken like Opus 4.7.

I can't comment that flamingo.

jubilanti24d ago

abustamam24d ago

Simon has an article on this

https://simonwillison.net/2025/Nov/13/training-for-pelicans-...

SwellJoe24d ago

Pelicanmaxxing

amelius24d ago

Yeah try it with something else, or e.g. add a tiger to the back seat.

survirtual24d ago

I use this metric now, and I suggest you change it per your imagination:

I also consider how well it one shots, errors generated, response to errors being corrected, and velocity of iteration to improvement.

Generally speaking, Claude Sonnet has done the best, Qwen3.5 122B does second, and I have nice results from Qwen3.5 35B.

ChatGPT does not do well. It can complete the task without errors but the creativity is atrocious.

MagicMoonlight24d ago

They’ll hardcode it in 4.8, just like they do when they need to “fix” other issues

rafaelmn24d ago

1 more reply

hansmayer23d ago

Valid points, but you"d think "superintelligence" would "know" how to draw a pelican on a bike?

bertili24d ago

It's fascinating that a $999 Mac Mini (M4 32GB) with almost similar wattage as a human brain gets us this far.

johanvts24d ago

2 more replies

culi24d ago

bmitc24d ago

> that these really are ultimately just stochastic sentence machines

I thought that's exactly what they are?

2 more replies

06088023d ago

cyclopeanutopia24d ago

But that you also gave a win to Qwen on flamingo is pretty outrageous! :)

Tthe right one looks much better, plus adding sunglasses without prompting is not that great. Hopefully it won't add some backdoor to the generated code without asking. ;)

simonw24d ago

I love how the Chinese models often have an unprompted predilection to add flair.

GLM-5.1 added a sparkling earring to a north Virginia opossum the other day and I was delighted: https://simonwillison.net/2026/Apr/7/glm-51/

1 more reply

prirun24d ago

The flamingo on Qwen's unicycle is sitting on the tire, not the seat. That wins because of sunglasses?

evilduck24d ago

Can a benchmark meant as a joke not use a fun interpretation of results? The Qwen result has far better style points. Fun sunglasses, a shadow, a better ground, a better sky, clouds, flowers, etc.

If we want to get nitty gritty about the details of a joke, a flamingo probably couldn't physically sit on a unicycle's seat and also reach the pedals anyways.

akavel24d ago

Well, maybe the flamingo is a really good unicyclist...

https://youtu.be/Rrpgd5oIKwI

yabutlivnWoods24d ago

Transparency of the wheel

Stylized gradients on the flamingo

Flowers

Ground/grass has a stylized look and feel

...despite a miss along the Y-axis where it's below the seat, couple oddly organized tail feathers, spokes, the composition overall is much closer to a production quality entity

Opus 4.7 looks like 20 seconds in MS paint.

rdslw24d ago

Simon, any ideas?

https://ibb.co/gFvwzf7M

https://ibb.co/dYHRC3y

https://ibb.co/FLc6kggm (tried here temperature 0.7 instead of pure defaults)

strobe24d ago

try Unsloth recommended settings

    Thinking mode for general tasks: temperature=1.0, top_p=0.95, top_k=20, min_p=0.0, presence_penalty=1.5, repetition_penalty=1.0

    Thinking mode for precise coding tasks (e.g. WebDev): temperature=0.6, top_p=0.95, top_k=20, min_p=0.0, presence_penalty=0.0, repetition_penalty=1.0

    Instruct (or non-thinking) mode for general tasks: temperature=0.7, top_p=0.8, top_k=20, min_p=0.0, presence_penalty=1.5, repetition_penalty=1.0

    Instruct (or non-thinking) mode for reasoning tasks: temperature=1.0, top_p=0.95, top_k=20, min_p=0.0, presence_penalty=1.5, repetition_penalty=1.0

(Please note that the support for sampling parameters varies according to inference frameworks.)

monksy24d ago

I'm impressed about the reach of your blog, and I'm hoping to get into blogging similar things. I currently have a lot on my backlog to blog about.

In short, keep up the good work with an interesting blog!

jamwise24d ago

I've had some really gnarly SVGs from Claude. Here's what I got after many iterations trying to draw a hand: https://imgur.com/a/X4Jqius

giantg224d ago

Probably because all the training material of humans drawing hands are garbage haha.

jaspanglia24d ago

The real question is what the next truly weird, un-optimized prompt will be. Something involving a sloth debugging a quantum computer in MS Paint?"

quietsegfault24d ago

The qwen flamingo looks like it’s smoking’ a doobie.

MeteorMarc24d ago

Interesting, qwen has the pelican driving on the left lane. Coincidence or has it something to do with the workers providing the RL data?

rubiquity24d ago

Could be on a bike path where bikes are on the left and pedestrians to the right.

Scrounger24d ago

I've been running qwen3.6:35b-a3b-q4_K_M (22.3GB) via Ollama.

Is the 20.9GB GGUF version better or negligible in comparison?

bwv84824d ago

I've been trying the Q4_K_M version, and sometimes it gets stuck in a loop. Gemma 4 doesn’t have this issue.

yencabulator24d ago

This has happened before with quantizations and other backends (ones not used by the research lab). Give it a week, download latest versions of everything, and try again.

mobiuscog23d ago

I'm having the same issues, the more I use it. The repetition penalty doesn't seem to help.

I get some really amusing 'reflective' responses, but I think it needs a bit more cooking. Maybe I'll try another variant.

Readerium24d ago

perhaps increasing repitition_penalty might be helpful

danielhanchen24d ago

Oh that is pretty good! And the SVG one!

logicallee23d ago

what kind of specs does your laptop have? do you know how many tokens/second you get on it?

slekker24d ago

How does it do with the "car wash" benchmark? :D

bertili24d ago

A relief to see the Qwen team still publishing open weights, after the kneecapping [1] and departures of Junyang Lin and others [2]!

[1] https://news.ycombinator.com/item?id=47246746 [2] https://news.ycombinator.com/item?id=47249343

zozbot23424d ago

jonaustin24d ago

And shout-out to Qwen if they release 122b -- Jeff Barr's original Gemma 4 tweet said they'd release a ~122b, then it got redacted :(

canpan24d ago

giancarlostoro23d ago

> not much sense in keeping them proprietary

bertili24d ago

Is there any source for these claims?

2 more replies

stingraycharles24d ago

397A17B = 397B total weights, 17B per expert?

3 more replies

kylehotchkiss24d ago

How many people/hackernews can run a 397b param model at home? Probably like 20-30.

12 more replies

guitcastro24d ago

I really wish they released qwen-image 2.0 as open weights.

homebrewer24d ago

Already quantized/converted into a sane format by Unsloth:

https://huggingface.co/unsloth/Qwen3.6-35B-A3B-GGUF

Aurornis24d ago

Unsloth is great for uploading quants quickly to experiment with, but everyone should know that they almost always revise their quants after testing.

If you download the release day quants with a tool that doesn’t automatically check HF for new versions you should check back again in a week to look for updated versions.

danielhanchen24d ago

On other fixes, we also fixed bugs in many OSS models like Gemma 1, Gemma 3, Llama chat template fixes, Mistral, and many more.

It might seem these issues are due to us, but it's because we publicize them and tell people to update. 95% of them are not related to us, but as good open source stewards, we should update everyone.

5 more replies

embedding-shape24d ago

1 more reply

fuddle24d ago

I don't understand why the open source model providers don't also publish the quantized version?

1 more reply

i5heu24d ago

Thank you very much for this comment! I was not aware of that.

torginus24d ago

Users of the quantized model might be even made to think that the model sucks because the quantized version does.

bityard24d ago

That said, models often DO come from the factory in multiple quants. Here's the FP8 quant for Qwen3.6 for example: https://huggingface.co/Qwen/Qwen3.6-35B-A3B-FP8

Unsloth and other organizations produce a wider variety of quants than upstream to fit a wider variety of hardware, and so end users can make their own size/quality trade-offs as needed.

halJordan24d ago

Quantization is an extraordinarily trivial process. Especially if you're doing it with llama.cpp (which unsloth obviously does).

Qwen did release an fp8 version, which is a quantized version.

sander109524d ago

I sense that I don't really understand enough of your comment to know why this is important. I hope you can explain some things to me:

danielhanchen24d ago

Oh hey - we're actually the 4th largest distributor of OSS AI models in GB downloads - see https://huggingface.co/unsloth

dist-epoch24d ago

The default Qwen "quantization" is not "bad", it's "large".

Unsloth releases lower-quality versions of the model (Qwen in this case). Think about taking a 95% quality JPEG and converting it to a 40% quality JPEG.

Models are quantized to lower quality/size so they can run on cheaper/consumer GPUs.

1 more reply

est24d ago

hey you can do a bit research yourself and tell your results to us!

palmotea24d ago

How much VRAM does it need? I haven't run a local model yet, but I did recently pick up a 16GB GPU, before they were discontinued.

WithinReason24d ago

It's on the page:

  Precision  Quantization Tag File Size
  1-bit      UD-IQ1_M         10 GB
  2-bit      UD-IQ2_XXS       10.8 GB
             UD-Q2_K_XL       12.3 GB
  3-bit      UD-IQ3_XXS       13.2 GB
             UD-Q3_K_XL       16.8 GB
  4-bit      UD-IQ4_XS        17.7 GB
             UD-Q4_K_XL       22.4 GB
  5-bit      UD-Q5_K_XL       26.6 GB
  16-bit     BF16             69.4 GB

4 more replies

tommy_axle24d ago

Pick a decent quant (4-6KM) then use llama-fit-params and try it yourself to see if it's giving you what you need.

1 more reply

zozbot23424d ago

Should run just fine with CPU-MoE and mmap, but inference might be a bit slow if you have little RAM.

Ladioss24d ago

You can run 25-30b model easily if you use Q3 or Q4 quants and llama-server with a pretty long list of options.

trvz24d ago

If you have to ask then your GPU is too small.

With 16 GB you'll be only able to run a very compressed variant with noticable quality loss.

4 more replies

halJordan24d ago

There's absolutely nothing wrong it insane with a safetensors file. It might be less convenient than a single file gguf. But that's just laziness not insanity

Zetaphor23d ago

Quantization is the major appeal, we can't all run full precision

lta23d ago

As long as they're not releasing 32bit .pt :)

txtsd24d ago

So I can use this in claude code with `ollama run claude`?

nunodonato24d ago

https://sleepingrobots.com/dreams/stop-using-ollama/

1 more reply

Ladioss24d ago

More like `ollama launch claude --model qwen3.6:latest`

Also you need to check your context size, Ollama default to 4K if <24 Gb of VRAM and you need 64K minimum if you want claude to be able to at least lift a finger.

2 more replies

pj_mukh24d ago

have you found a model that does this with usable speeds on an M2/M3?

1 more reply

littlestymaar23d ago

> converted into a sane format

Having implemented a GGUF parser, I'd beg to differ on the “sane format” qualifier.

terataiijo24d ago

lmao they are so fast yooo

ttul24d ago

Yes. How do they do it? Literally they must have PagerDuty set up to alert the team the second one of the labs releases anything.

2 more replies

bildung24d ago

Bad QA :/ They had a bunch of broken quantizations in the last releases

1 more reply

ekianjo24d ago

yeah and often their quants are broken. They had to update their Gemma4 quants like 4 times in the past 2 weeks.

1 more reply

mtct8824d ago

Nice release from the Qwen team.

Small openweight coding models are, imho, the way to go for custom agents tailored to the specific needs of dev shops that are restricted from accessing public models.

I'm thinking about banking and healthcare sector development agencies, for example.

It's a shame this remains a market largely overlooked by Western players, Mistral being the only one moving in that direction.

lelanthran24d ago

> It's a shame this remains a market largely overlooked by Western players, Mistral being the only one moving in that direction.

gunalx24d ago

Metawith the llama series as well,they just didn't manage to keep upping the game after and with llama4.

Aurornis24d ago

I play with the small open weight models and I disagree. They are fun, but they are not in the same class as hosted models running on big hardware.

Zetaphor24d ago

NitpickLawyer24d ago

mtct8823d ago

100k is a lot of money for a software agency where I come from.

smrtinsert24d ago

How true is this? How does a regulated industry confirm the model itself wasn't trained with malicious intent?

ndriscoll24d ago

Why would it matter if the model is trained with malicious intent? It's a pure function. The harness controls security policies.

1 more reply

kennethops24d ago

I love the idea of building competitor to open weight models but damn is this an expensive game to play

pstuart24d ago

It is, but think about how advances in computing technology have made that power available over time. A Raspberry Pi is almost 5 times more powerful than the Cray-1.

alecco24d ago

Related interesting find on Qwen.

https://xcancel.com/N8Programs/status/2044408755790508113

nxtfari23d ago

kanemcgrath24d ago

Mudler APEX-I-Quality. then later I tried Byteshape Q3_K_S-3.40bpw

Hugsun23d ago

kanemcgrath24d ago

Now that I have tried out on a few tasks, Qwen3.6 is a huge jump in capability. It can make improvements to a project that qwen3.5 always struggled with.

burgertea23d ago

Could you share more about your config? I've also got a 3060 12gb and 64gb of ram, but I've never got local models running well enough to be useful

edg500024d ago

What can and what can't it do compared to Codex and CC?

mettamage23d ago

who do you compare it against qwen3.5 27b?

kanemcgrath23d ago

I haven't ran 27b that much because it only runs at like 2 tokens/sec on my computer.

jadbox24d ago

Which one is best?

kanemcgrath24d ago

I would say byteshape is smaller and faster, I can’t really notice a quality difference. But I haven’t used it as much as I only started using it a few days ago.

armanj24d ago

zozbot23424d ago

The 27B model is dense. Releasing a dense model first would be terrible marketing, whereas 35A3B is a lot smarter and more quick-witted by comparison!

arxell24d ago

2 more replies

JKCalhoun24d ago

"…whereas 35A3B is a lot smarter…"

Must. Parse. Is this a 35 billion parameter model that needs only 3 billion parameters to be active? (Trying to keep up with this stuff.)

EDIT: A later comment seems to clarify:

"It's a MoE model and the A3B stands for 3 Billion active parameters…"

halJordan24d ago

That makes no sense. If you were just going to release the "more hype-able because it's quicker" model then why have a a poll.

Miraste24d ago

What? 35B-A3B is not nearly as smart as 27B.

3 more replies

throwdbaaway24d ago

arunkant24d ago

Probably coming next

zkmon24d ago

zozbot23424d ago

MoE is not a bad idea for local inference if you have fast storage to offload to, and this is quickly becoming feasible with PCIe 5.0 interconnect.

perbu24d ago

aliljet24d ago

I'm broadly curious how people are using these local models. Literally, how are they attaching harnesses to this and finding more value than just renting tokens from Anthropic of OpenAI?

oompydoompy7424d ago

I’ve increasingly started self hosting everything in my home lately because I got tired of SAAS rug pulls and I don’t see why LLM’s should eventually be any different.

danny_codes24d ago

Exactly. Relying on external compute for professional work is a non-starter IMO.

seemaze24d ago

Qwen3.5-9B has been extremely useful for local fuzzy table extraction OCR for data that cannot be sent to the cloud.

The documents have subtly different formatting and layout due to source variance. Previously we used a large set of hierarchical heuristics to catch as many edge cases as we could anticipate.

marssaxman24d ago

I used vLLM and qwen3-coder-next to batch-process a couple million documents recently. No token quota, no rate limits, just 100% GPU utilization until the job was done.

znnajdla24d ago

netdevphoenix23d ago

> For translating small texts I use Gemma 4 on my iPhone because it’s faster and better than Apple Translate or Google Translate and works offline.

What does better mean here? Does it handle formal vs informal speech? Idiomatic expressions? Regional variances (like American vs British English)? These are areas where Google Translate is weak.

How fast are we talking here (including initial loading times) and what's the impact on your phone battery? Also, what iPhone do you have?

I am really interested in this application hence my questions.

root_axis24d ago

How does that work? Wouldn't it be slow loading the weights into memory every time you launch it?

1 more reply

kaliqt24d ago

Is it really better? In which languages?

3 more replies

oktoberpaard24d ago

Do you use E2B or E4B?

lkjdsklf24d ago

The people i know that use local models just end up with both.

The local models don’t really compete with the flagship labs for most tasks

deaux24d ago

Unless you have a corporate lock-in/compliance need, there has been no reason to use Haiku or GPT mini/nano/etc over open weights models for a long time now.

jwitthuhn24d ago

I do have a $20 claude sub I can fall back to for anything qwen struggles with, but with 3.5 I have been very pleased with the results.

383629364824d ago

How much VRAM do you need for that?

3 more replies

dust4223d ago

mistercheese24d ago

bildung24d ago

kamranjon24d ago

Aurornis24d ago

It’s easy to find a combination of llama.cpp and a coding tool like OpenCode for these. Asking an LLM for help setting it up can work well if you don’t want to find a guide yourself.

> and finding more value than just renting tokens from Anthropic of OpenAI?

toxik24d ago

Oh lord, are the LLMs already replacing LLMs?

ThatPlayer24d ago

Tamitami200023d ago

dist-epoch24d ago

There are really nice GUIs for LLMs - CherryStudio for example, can be used with local or cloud models.

There are also web-UIs - just like the labs ones.

And you can connect coding agents like Codex, Copilot or Pi to local coding agents - the support OpenAI compatible APIs.

It's literally a terminal command to start serving the model locally and you can connect various things to it, like Codex.

1 more reply

kylehotchkiss24d ago

I am working on a research project to link churches from their IRS Exempt org BMF entry to their google search result from 10 fetched. Gwen2.5-14b on a 16gb Mac Mini. It works good enough!

It's entertaining to see HN increasingly consider coding harness as the only value a model can provide.

deaux24d ago

zozbot23424d ago

2 more replies

flux312524d ago

They are okay for vibe coding throw-away projects without spending your Anthrophic/OAI tokens

Panda424d ago

I was thinking the same thing. My only guess is that they are excited about local models because they can run it cheaper through Open Router ?

zackify24d ago

always inside claude code, just using ollama, takes 2 seconds

rvnx24d ago

China won again in terms of openness

danny_codes24d ago

Ironic

lta23d ago

Not as much as "Open" AI

fooblaster24d ago

Honestly, this is the AI software I actually look forward to seeing. No hype about it being too dangerous to release. No IPO pumping hype. No subscription fees. I am so pumped to try this!

wrxd24d ago

Same here. I really hope in a near future local model will be good enough and hardware fast enough to run them to become viable for most use cases

vlapec24d ago

No need to hope; it is inevitable.

1 more reply

altruios23d ago

I have moved through the local models at this size.

This one is by far the most capable. I've tried various versions of gemma4.26b, various versions of qwen3.5-27/35b (qwopus's galor),nemotron,phi,glm4.7.

I am excited to see what other versions of this model people train!

onlyrealcuzzo23d ago

How does it compare to CC Opus Max?

1 more reply

seemaze24d ago

Fingers crossed for mid and larger models as well. I'd personally love to see Qwen3.6-122B-A10B.

Vespasian24d ago

That would be really great. Though 3.5 122B is already doing a lot of work in our setup.

abhikul024d ago

I hope the other sizes are coming too(9B for me). Can't fit much context with this on a 36GB mac.

mhitza24d ago

It's a MoE model and the A3B stands for 3 Billion active parameters, like the recent Gemma 4.

You can try to offload the experts on CPU with llama.cpp (--cpu-moe) and that should give you quite the extra context space, at a lower token generation speed.

abhikul024d ago

Mac has unified memory, so 36GB is 36GB for everything- gpu,cpu.

2 more replies

dgb2324d ago

Do I expect the same memory footprint from an N active parameters as from simply N total parameters?

2 more replies

pdyc24d ago

i dont get it, mac has unified memory how would offloading experts to cpu help?

1 more reply

pdyc24d ago

can you elaborate? you can use quantized version, would context still be an issue with it?

abhikul024d ago

A usable quant, Q5_KM imo, takes up ~26GB[0], which leaves around ~6-7GB for context and running other programs which is not much.

[0] https://huggingface.co/unsloth/Qwen3.5-35B-A3B-GGUF?show_fil...

nickthegreek24d ago

context is always an issue with local models and consumer hardware.

1 more reply

383629364824d ago

Qwen3.6 and Gemma4 have the same issue of never getting to the point and just getting stuck in never ending repeating thought loops. Qwen3.5 is still the best local model that works.

agentifysh24d ago

I think the hype around Qwen and even Gemma4 often floated for views/attention glosses over that these models have clear gaps behind what closed models offer.

If we got something like GPT-5.4-xhigh that can run on some local hardware under 5k, that would be a major milestone.

ElectricalUnion23d ago

I say "if we got $CURRENT_MODEL that can run under local hardware" claims are postproning BS.

What is gonna happen when that happens? They are gonna cry they need GPT-$CURRENT capabilities locally.

Now we have local models that are way better that GPT-2 (careful, this one is way too dangerous for release!) GPT3.5, in some ways better that 4, and can run on reasonably modest hardware.

danny_codes24d ago

Give it 6 months

naasking24d ago

Quantization can introduce these issues, and Gemma 4 also had issues because the prompt tokens that Gemma used was new and not well supported yet.

nxtfari23d ago

1 more reply

jake-coworker24d ago

This is surprisingly close to Haiku quality, but open - and Haiku is quite a capable model (many of the Claude Code subagents use it).

wild_egg24d ago

Where did you see a haiku comparison? Haiku 4.5 was my daily driver for a month or so before Opus 4.5 dropped and would be unreasonably happy if a local model can give me similar capability

daemonologist24d ago

I didn't see a direct comparison, but there's some overlap in the published benchmarks:

                           │ Qwen 3.6 35B-A3B │ Haiku 4.5               
   ────────────────────────┼──────────────────┼──────────────────────── 
    SWE-Bench Verified     │ 73.4             │ 66.6                    
   ────────────────────────┼──────────────────┼──────────────────────── 
    SWE-Bench Multilingual │ 67.2             │ 64.7                    
   ────────────────────────┼──────────────────┼──────────────────────── 
    SWE-Bench Pro          │ 49.5             │ 39.45                   
   ────────────────────────┼──────────────────┼──────────────────────── 
    Terminal Bench 2.0     │ 51.5             │ 61.2 (Warp), 27.5 (CC)  
   ────────────────────────┼──────────────────┼──────────────────────── 
    LiveCodeBench          │ 80.4             │ 41.92

coder54324d ago

Artificial Analysis hasn't posted their independent analysis of Qwen3.6 35B A3B yet, but Alibaba's benchmarks paint it as being on par with Qwen3.5 27B (or better in some cases).

Even Qwen3.5 35B A3B benchmarks roughly on par with Haiku 4.5, so Qwen3.6 should be a noticeable step up.

https://artificialanalysis.ai/models?models=gpt-oss-120b%2Cg...

No, these benchmarks are not perfect, but short of trying it yourself, this is the best we've got.

1 more reply

deaux24d ago

I find Gemma 4 26B A4B better than Haiku 4.5 and that's smaller than this one.

gck123d ago

I have a Macbook M3 Max with 128GB of RAM.

I'm asking because I'm always seeing excited replies, then I get excited, then I spend minutes to hours setting up the model and then, after first use I forget it exists for one reason or another.

Can I get any realistic use out of this?

qazplm1723d ago

1 more reply

stavros23d ago

You'd be the best person in this thread to answer this question.

cpburns200924d ago

Anyone else getting gibberish when running unsloth/Qwen3.6-35B-A3B-GGUF:UD-IQ4_XS on CUDA (llama.cpp b8815)? UD-Q4_K_XL is fine, as is Vulkan in general.

cpburns200924d ago

Apparently it's a known issue with CUDA 13.2 [1].

[1] https://huggingface.co/unsloth/Qwen3.6-35B-A3B-GGUF/discussi...

danielhanchen23d ago

Yes sadly CUDA 13.2 is broken - NVIDIA will push a fix in CUDA 13.3

KronisLV24d ago

Glemllksdf24d ago

I tried Gemma 4 A4B and was surprised how hart it is to use it for agentic stuff on a RTX 4090 with 24gb of ram.

Balancing KV Cache and Context eating VRam super fast.

qazplm1723d ago

https://gist.github.com/duh17/2db5351da026cec4bd4f46e169e75e...

Here is the full session:

https://pi.dev/session/#c3d003becb1bfcc7ffbca04e89e1adf8

This is by far my smoothest agentic session using a local model of any size. The output quality and speed has really struct the right balance. Very impressive release

codeugo24d ago

Are we going to get to the point where a local model can do almost what sonnet 4.6 can do?

bluerooibos24d ago

Of course we are. And Opus 4.6+. It's a matter of when, not if.

danny_codes24d ago

Once you run out of data it’s just optimizations to commoditization

intothemild24d ago

We're already there IMHO.. If you have enough ram, sure.. but the ~32gig people can run models that beat sonnet 4.5

npodbielski24d ago

I asked it to give me instruction on how to create SSH key and it tried to do it instead of just answering.

https://internetexception.com/2026/04/16/testing-qwen-3-6/

adrian_b24d ago

Available for download:

https://huggingface.co/Qwen/Qwen3.6-35B-A3B

poglet24d ago

Can this run on a PC with 16GB graphics card or a 24GB Macbook Pro? I'm not familiar with how Mixture-of-Experts models differ from standard models.

incomingpain24d ago

Wowzers, we were worried Qwen was going to suffer having lost several high profile people on the team but that's a huge drop.

It's better than 27b?

adrian_b24d ago

Their previous model Qwen3.5 was available in many sizes, from very small sizes intended for smartphones, to medium sizes like 27B and big sizes like 122B and 397B.

This model is the first that is provided with open weights from their newer family of models Qwen3.6.

Judging from its medium size, Qwen/Qwen3.6-35B-A3B is intended as a superior replacement of Qwen/Qwen3.5-27B.

It remains to be seen whether they will also publish in the future replacements for the bigger 122B and 397B models.

mft_24d ago

There was also Qwen3.5-35B-A3B in the previous generation: https://huggingface.co/Qwen/Qwen3.5-35B-A3B

storus24d ago

> Qwen/Qwen3.6-35B-A3B is intended as a superior replacement of Qwen/Qwen3.5-27B

Not at all, Qwen3.5-27B was much better than Qwen3.5-35B-A3B (dense vs MoE).

2 more replies

segmondy24d ago

This is obviously a continuation training of 3.5, it's not a new model architecture but an incremental improvement.

lopsotronic24d ago

Dangit, I'll need to give this a run on my personal machine. This looks impressive.

99990000099924d ago

Looking to move off ollama on Open Suse tumbleweed.

Should I use brew to install llma.ccp or the zypper to install the tumbleweed package?

badsectoracula24d ago

99990000099924d ago

Cool, I assume this is how adults use llms.

I’m on a nvidia gpu , but I want to be able to combine vram with system memory.

rexreed24d ago

Why are you looking to move off Ollama? Just curious because I'm using Ollama and the cloud models (Kimi 2.5 and Minimax 2.7) which I'm having lots of good success with.

99990000099924d ago

Ollama co mingles online and local models which defeats the purpose for me

1 more reply

stratos12324d ago

Why not just download the binaries from github releases?

dataflow24d ago

I'm a newbie here and lost how I'm supposed to use these models for coding. When I use them with Continue in VSCode and start typing basic C:

  #include <stdio.h>
  int m

I get nonsensical autocompletions like:

  #include <stdio.h>
  int m</fim_prefix>

What is going on?

sosodev24d ago

These are not autocomplete models. It’s built to be used with an agentic coding harness like Pi or OpenCode.

zackangelo24d ago

They are but the IDE needs to be integrated with them.

Qwen specifically calls out FIM (“fill in the middle”) support on the model card and you can see it getting confused and posting the control tokens in the example here.

1 more reply

JokerDan24d ago

woctordho24d ago

Choose the correct FIM (Fill In the Middle) template for Qwen in Continue. All recent Qwen models are actually trained with FIM capability and you can use them.

Jeff_Brown24d ago

This might sound snarky but in all earnestness, try talking to an AI about your experience using it.

recov24d ago

I would use something like zeta-2 instead - https://huggingface.co/bartowski/zed-industries_zeta-2-GGUF

fred_is_fred24d ago

How does this compare to the commercial models like Sonnet 4.5 or GPT? Close enough that the price is right (free)?

vidarh24d ago

The will not measure up. Notice they're comparing it to Gemma, Google's open weight model, not to Gemini, Sonnet, or GPT. That's fine - this is a tiny model.

If you want something closer to the frontier models, Qwen3.6-Plus (not open) is doing quite well[1] (I've not tested it extensively personally):

https://qwen.ai/blog?id=qwen3.6

pzo24d ago

on the bright side also worth to keep in mind those tiny models are better than GPT 4.0, 4.1 GPT4o that we used to enjoy less than 2 years ago [1]

[1] https://artificialanalysis.ai/?models=gpt-5-4%2Cgpt-oss-120b...

1 more reply

NitpickLawyer24d ago

> Close enough

julianlam24d ago

Can you share more about what adaptations you made when using smaller models?

2 more replies

yaur24d ago

gyrovagueGeist24d ago

Electricity (on continental US) is pretty cheap assuming you already have the hardware:

Running at a full load of 1000W for every second of the year, for a model that produces 100 tps at 16 cents per kWh, is $1200 USD.

The same amount of tokens would cost at least $3,150 USD on current Claude Haiku 3.5 pricing.

1 more reply

postalrat24d ago

If you need the heating then it is basically free.

1 more reply

nurettin24d ago

I tried the car wash puzzle:

You want to wash your car. Car wash is 50m away. Should you walk or go by car?

Classic response. It was really hard to one shot this with Qwen3.5 Q4_K_M.

Qwen3.6 UD-IQ4_XS also failed the first time, then I added this to the system prompt:

> Double check your logic for errors

Then I created a new dialog and asked the puzzle and it responded:

Now 3.6 gets it right every time. So not as great as a super model, but definitely an improvement.

dist-epoch24d ago

Interestingly, Gemma4-26B IQ4_XS gets it correct:

It has the wrong one in thinking. It did think longer than usual:

Direct answer: Walk.

Reasoning 1: Distance (50m is negligible).

Reasoning 2: Practicality/Efficiency (engine wear/fuel).

Reasoning 3: Time (walking is likely faster or equal when considering car prep).

...

Wait, if I'm washing the car, I need to get the car to the car wash. The question asks how I should get there.

...

Wait, let's think if there's a trick. If you "go by car," you are moving the car to the destination. If you "walk," you are just moving yourself.

Conclusion: You should drive the car.

cyrialize24d ago

My last laptop was a used 2012 T530.

My current is a used M1 MBP Pro with 16GB of ram.

I thought this was all I was ever going to need, but wanting to run really nice models locally has me thinking about upgrading.

Although, part of me wants to see how far I could get with my trusty laptop.

bigyabai24d ago

Your current laptop is still a fine thin client. Unless you program in the woods, it's probably cheapest to build a home inference box and route it over Tailscale or something.

system224d ago

Or just an API server for all other devices to connect and do stuff with it.

kombine24d ago

What kind of hardware (preferably non-Apple) can run this model? What about 122B?

daemonologist24d ago

ru55224d ago

You won't like it, but the answer is Apple. The reason is the unified memory. The GPU can access all 32gb, 64gb, 128gb, 256gb, etc. of RAM.

samtheprogram24d ago

That estimate doesn't account for context, which is very important for tool use and coding.

I used this napkin math for image generation, since the context (prompts) were so small, but I think it's misleading at best for most uses.

sliken24d ago

> You won't like it, but the answer is Apple.

Or strix halo.

Seems rather over simplified.

The different levels of quants, for Qwen3.6 it's 10GB to 38.5GB.

Qwen supports a context length of 262,144 natively, but can be extended to 1,010,000 and of course the context length can always be shortened.

Just use one of the calculators and you'll get much more useful number.

1 more reply

canpan24d ago

bigyabai24d ago

mildred59324d ago

I can run this on an AMD Framework laptop. A Ryzen 7 (I dont have Ryzen AI, just Ryzen 7 7840U) with 32+48 GB DDR. The Ryzen unified memory is enough, I get 26GB of VRAM at least.

Fedora 43 and LM Studio with Vulkan llama.cpp

terramex24d ago

rhdunn24d ago

The Q5 quantization (26.6GB) should easily run on a 32GB 5090. The Q4 (22.4GB) should fit on a 24GB 4090, but you may need to drop it down to Q3 (16.8GB) when factoring in the context.

You can also run those on smaller cards by configuring the number of layers on the GPU. That should allow you to run the Q4/Q5 version on a 4090, or on older cards.

You could also run it entirely on the CPU/in RAM if you have 32GB (or ideally 64GB) of RAM.

The more you run in RAM the slower the inference.

1 more reply

bildung24d ago

No tuning at all, just apt install rocm and rebuilding llama.cpp every week or so.

psim124d ago

(Please don't downvote - serious question) Are Chinese models generally accepted for use within US companies? The company I work for won't allow Qwen.

DiabloD324d ago

There is a difference between Chinese model and Chinese service.

Your company most likely is banning the use of foreign services, but it wouldn't make sense to ban the model, since the model would be ran locally.

I wouldn't allow my employees to use a foreign service either if my company had specific geographic laws it had to follow (ie, fin or med or privacy laws, such as the ones in the EU).

That said, I'm not sure I'd allow them to use any AI product either, locally inferred on-prem or not: I need my employees to _not_ make mistakes, not automate mistake making.

kelsey9876543124d ago

In private sector yes. Anything that touches public sector (government) and it starts to be supply chain concerns and they want all american made models

gbgarbeb24d ago

ghc24d ago

how does this compare to gpt-oss-120b? It seems weird to leave it out.

vyr24d ago

GPT-OSS 120B (really 117B-A5.1B) is a lot bigger. better comparison would be to 20B (21B-A3.6B).

773412824d ago

OSS-120 is too old to be relevant, and four times the size.

giantg224d ago

I cant wait to see some smaller sizes. I would love to run some sort of coding centric agent on a local TPU or GPU instead of having to pay, even if it's slower.

tristor24d ago

I'll give this a try, but I would be surprised if it outperforms Qwen3.5-27B.

adrian_b24d ago

You are right, but this is just the first open-weights model of this family.

They said that they will release several open-weights models, though there was an implication that they might not release the biggest models.

hnfong24d ago

tristor24d ago

1 more reply

ilaksh24d ago

It's a given that the dense models with comparable size are better. I also proved that in my use case for those two Qwen 3.5 models.

btbr40324d ago

Planning to deploy Qwen3.6-35B-A3B on NVIDIA Spark DGX for multi-agent coding workflows. The 3B active params should help with concurrent agent density.

syntaxing24d ago

Is it worth running speculative decoding on small active models like this? Or does MTP make speculative decoding unnecessary?

amelius24d ago

Looks like they compare only to open models, unfortunately.

As I am using mostly the non-open models, I have no idea what these numbers mean.

1 more reply

logicallee23d ago

What kind of hardware does this require to run locally, and how many tokens/seconds does it produce?

zshn2524d ago

What do all the numbers 6-35B-A3B mean?

dunb24d ago

wongarsu24d ago

And even if you have enough VRAM to fit the entire thing, inference speed after the first token is proportional to (activated parameters)/(vram bandwidth)

If you have the vram to spare, a model with more total params but fewer activated ones can be a very worthwhile tradeoff. Of course that's a big if

zshn2524d ago

Sorry, how did you calculate the 10.25B?

1 more reply

cshimmin24d ago

The 6 is part of 3.6, the model version. 35B parameters, A3B means it's a mixture of experts model with only 3B parameters active in any forward pass.

zshn2524d ago

Got it. Thanks

JLO6424d ago

35B (35 billion) is the number of parameters this model has. Its a Mixture of Experts model (MoE) so A3B means that 3B parameters are Active at any moment.

zshn2524d ago

~I see. What’s the 6?~

Nevermind, the other reply clears it

joaogui124d ago

yieldcrv24d ago

Anybody use these instead of codex or claude code? Thoughts in comparison?

benchmarks dont really help me so much

383629364824d ago

zoobab24d ago

"open source"

give me the training data?

tjwebbnorfolk24d ago

The training data is the entire internet. How do you propose they ship that to you

thrance24d ago

As a zip archive of however they store it in their database?

aruametello23d ago

to be fair there are some degree of "hand curation" of the data so while "it is the internet", the actual trained data is a derivation of that.

in a mild but productive analogy:

flux312524d ago

You ARE the training data

solomatov24d ago

Did anyone try it and Gemma 4? Does it feel that it's better than Gemma 4?

andy_ppp24d ago

storus24d ago

They no longer show reasoning traces and are throttling more aggressively.

zozbot23424d ago

They never showed full reasoning traces, just post-hoc summaries.

1 more reply

ActorNightly24d ago

Can anyone confirm this fits on a 3090? Size is exactly 24gb

tmaly24d ago

What is the min VRAM this can run on given it is MOE?

mncharity24d ago

Fwiw, with its predecessor's Qwen3.5-35B-A3B-Q6_K.gguf, on a laptop's 6 GB VRAM and 32 GB RAM, with default llama.cpp settings, I get 20 t/s generation.

rubiquity24d ago

1 - https://github.com/ggml-org/llama.cpp/blob/master/docs/build...

2 more replies

tmaly23d ago

That is pretty solid, I have a 2070 with 8GB VRAM and 64GB RAM, but I haven't run too much. I regret not getting a 3090 back when I built this machine.

1 more reply

Divs289023d ago

Does any LLM aggregator offers this model?

zengid24d ago

any tips for running it locally within an agent harness? maybe using pi or opencode?

stratos12324d ago

the__alchemist24d ago

Is this the hybrid variant of Gwent and Quen? I hope this is in The Witcher IV!

bossyTeacher24d ago

Havoc24d ago

The Chinese models are generally pretty good.

> Only thing I need is reasonable promise that my data won't be used

Only way is to run it local.

I personally don’t worry about this too much. Things like medical questions I tend to do against local models though

bossyTeacher24d ago

Have you tried asking about sensitive topics?

I asked it if there were out of bounds topics but it never gave me a list.

See its responses:

Convo 1

- Q: ok tell me about taiwan

- A: Oops! There was an issue connecting to Qwen3.6-Plus. Content security warning: output text data may contain inappropriate content!

Convo 2

- Q: is winnie the pooh broadcasted in china?

- A: Oops! There was an issue connecting to Qwen3.6-Plus. Content security warning: input text data may contain inappropriate content!

These seem pretty bad to me. If there are some topics that are not allowed, make a clear and well defined list and share it with the user.

5 more replies

manmal24d ago

You can also rent a cloud GPU which is relatively affordable.

alberto-m24d ago

cpburns200924d ago

Mashimo24d ago

> Does anyone have any experience with Qwen or any non-Western LLMs?

I use GLM-5.1 for coding hobby project, that going to end up on github anyway. Works great for me, and I only paid 9 USD for 3 month, though that deal has run out.

> my data won't be used for training

Yeah, I don't know. Doubt it.

ramon15624d ago

$20 for 3 months is still far better than alternatives, and 5.1 works great

chabes24d ago

Run open models locally. Data stays local, and exporting sessions is straightforward.

dzonga24d ago

if any Alibaba (Qwen) folks are here - website is not working on safari

thesuperevil24d ago

Yeah damn but they are heavy lol

smcl24d ago

fuck off: https://news.ycombinator.com/item?id=47796830

amazingamazing24d ago

More benchmaxxing I see. Too bad there’s no rig with 256gb unified ram for under $1000

kennethops24d ago

do you know if they did this to it?

https://research.google/blog/turboquant-redefining-ai-effici...

kgeist24d ago

Llama.cpp already uses an idea from it internally for the KV cache [0]

So a quantized KV cache now must see less degradation

[0] https://github.com/ggml-org/llama.cpp/pull/21038

bigyabai24d ago

taps the sign

  Unified Memory Is A Marketing Gimmeck. Industrial-Scale Inference Servers Do Not Use It.

zozbot23424d ago

Industrial Scale Inference is moving towards LPDDR memory (alongside HBM), which is essentially what "Unified Memory" is.

2 more replies

wren699124d ago

1 more reply

rcxdude24d ago

Unified Memory is mainly how consumer hardware has enough RAM accessible by the GPU to run larger models, because otherwise the market segmentation jacks up the price substantially.

1 more reply

cpburns200924d ago

Sir, this is 2026. You're not getting any 128GB of RAM for under $1k.

shevy-java24d ago

I don't want "Agentic Power".

I want to reduce AI to zero. Granted, this is an impossible to win fight, but I feel like Don Quichotte here. Rather than windmill-dragons, it is some skynet 6.0 blob.

lagniappe24d ago

Then who is Rocinante?

blazzy24d ago

A dimming IBM x40 Thinkpad missing its F key.

j / k navigate · click thread line to collapse