I assume you mean outperforms in speed on the same model, not in usability compared to other more capable models.
(For those who are getting their hopes up on using local LLMs to be any replacement for Sonnet or Opus.)
Personally though, I find Qwen useless for anything but coding tasks because if its insufferable sycophancy. It's like 4o dialed up to 20, every reply starts with "You are absolutely right" with zero self awareness. And for coding, only the best model available is usually sensible to use otherwise it's just wasted time.
Is it really just more training data? I doubt it’s architecture improvements, or at the very least, I imagine any architecture improvements are marginal.
On my M3 Air w/ 24GB of memory 27B is 2 tok/s but 35B A3B is 14-22 tok/s which is actually usable.
At least 100k context without huge degradation is important for coding tasks. Most "I'm running this locally" reports only cover testing with very small context.
The models can be frustrating to use if you expect long contexts to behave like they do on SOTA models. In my trials I could give them strict instructions to NOT do something and they would follow it for a short time before ignoring my prompt and doing the things I told it not to do.
I have a 16GB GPU as well, but have never run a local model so far. According to the table in the article, 9B and 8-bit -> 13 GB and 27B and 3-bit seem to fit inside the memory. Or is there more space required for context etc?
Inference engines like llama.cpp will offload model and context to system ram for you, at the cost of performance. A MoE like 35B-A3B might serve you better than the ones mentioned, even if it doesn't fit entirely on the GPU. I suggest testing all three. Perhaps even 122-A10B if you have plenty of system ram.
Q4 is a common baseline for simple tasks on local models. I like to step up to Q5/Q6 for anything involving tool use on the smallish models I can run (9B and 35B-A3B).
Larger models tolerate lower quants better than small ones, 27B might be usable at 3 bpw where 9B or 4B wouldn't. You can also quantize the context. On llama.cpp you'd set the flags -fa on, -ctk x and ctv y. -h to see valid parameters. K is more sensitive to quantization than V, don't bother lowering it past q8_0. KV quantization is allegedly broken for Qwen 3.5 right now, but I can't tell.
Qwen3.5 is confusing a lot of newcomers because it is very confident in the answers it gives. It can also regurgitate solutions to common test requests like “make a flappy bird clone” which misleads users into thinking it’s genetically smart.
Using the Qwen3.5 models for longer tasks and inspecting the output is a little more disappointing. They’re cool for something I can run locally but I don’t agree with all of the claims about being Sonnet-level quality (including previous Sonnet versions) in my experience with the larger models. The 9B model is not going to be close to Sonnet in any way.
Here's how I got the 35B model to work: https://gist.github.com/danthedaniel/c1542c65469fb1caafabe13...
The 35B model is still pretty slow on my machine but it's cool to see it working.
Also seemed to ignore fairly simple instructions in CLAUDE.md about building and running tests.
It qwen3-coder is better for code generation and editing, strong at multi-file agentic tasks, and is purpose-built for coding workflows.
In contrast, qwen3.5 is more capable at general reasoning, better at planning and architecture decisions, good balance of coding and thinking.
what did work was passing / adding this json to the request body:
{ "chat_template_kwargs": {"enable_thinking": false}}
[0] https://github.com/QwenLM/Qwen3/discussions/1300Not disagreeing per se, but a quick look at the installation instructions confirms what I assumed:
Yeah, you can run a highly quantized version on your 2020 Nvidia GPU. But:
- When inferencing, it occupies your "whole machine.". At least you have a modern interactive heating feature in your flat.
- You need to follow eleven-thousand nerdy steps to get it running; my mum is really looking forward to that.
- Not to mention the pain you went through installing Nvidia drivers; nothing my mum will ever manage in the near future.
... and all this to get something that merely competes with Haiku.
Don't get me wrong - I am exaggerating, I know. It's important to have competition and the opportunity to run "AI" on your own metal. But this reminds me of the early days of smartphones and my old XDA Neo. Sure, it was damn smart, and I remember all those jealous faces because of my "device from the future." But oh boy, it was also a PITA maintaining it.
Here we are now. Running AI locally is a sneak peek into the future. But as long as you need a CS degree and hardware worth a small car to achieve reasonable results, it's far from mainstream. Therefore, "consumer-grade hardware" sounds like a euphemism here.
I like how we nerds are living in our buble celebrating this stuff while 99% of mankind still doomscroll through facebook and laughing at (now AI generated) brain rot.
(No offense (ʘ‿ʘ)╯)
IQ4_XS 5.17 GB, Q4_K_S 5.39 GB, IQ4_NL 5.37 GB, Q4_0 5.38 GB, Q4_1 5.84 GB, Q4_K_M 5.68 GB, UD-Q4_K_XL 5.97 GB
And no explanation for what they are and what tradeoffs they have, but in the turorial it explicitly used Q4_K_XL with llama.cpp .
I'm using a macmini m4 16GB and so far my prefered model is Qwen3-4B-Instruct-2507-Q4_K_M although a bit chat but my test with Qwen3.5-4B-UD-Q4_K_XL shows it's a lot more chat, I'm basically using it in chat mode for basic man style questions.
I understand that each user has it's own specific needs but would be nice to have a place that have a list of typical models/hardware listed with it's common config parameters and memory usage.
Even on redit specific channels it's a bit of nightmare of loot of talk but no concrete config/usage clear examples.
I'm floowing this topic heavilly for the last 3 months and I see more confusion than clarification.
Right now I'm getting good cost/benefit results with the qwen cli with coder-model in the cloud and watching constantly to see when a local model on affordable hardware with enviroment firendly energy comsumption arrives.
Q4_0 and Q4_1 were supposed to provide faster inference, but tests showed it reduced accuracy by quite a bit, so they are deprecated now.
Q4_K_M and UD-Q4_K_XL are the same, just _XL is slightly bigger than _M
The naming convention is _XL > _L > _M > _S > _XS
Do you think it's time for version numbers in filenames? Or at least a sha256sum of the merged files when they're big enough to require splitting?
Even with gigabit fiber, it still takes a long time to download model files, and I usually merge split files and toss the parts when I'm done. So by the time I have a full model, I've often lost track of exactly when I downloaded it, so I can't tell whether I have the latest. For non-split models, I can compare the sha256sum on HF, but not for split ones I've already merged. That's why I think we could use version numbers.
https://www.localscore.ai from Mozilla Builders was supposed to be this, but there are not enough users I guess, I didn't find any Qwen 3.5 entries yet
It may be interesting to try a 6bit quant of qwen3.5-35b-a3b - I had pretty good results with it running it on a single 4090 - for obvious reasons I didn’t try it on the old mac.
I am using 8bit quant of qwen3.5-27b as more or less the main engine for the past ~week and am quite happy with it - but that requires more memory/gpu power.
HTH.
M3 Ultra — 819 GB/s
M2 Ultra — 800 GB/s
M1 Ultra — 800 GB/s
M5 Max (40-core GPU) — 610 GB/s
M4 Max (16-core CPU / 40-core GPU) — 546 GB/s
M4 Max (14-core CPU / 32-core GPU) — 410 GB/s
M2 Max — 400 GB/s
M3 Max (16-core CPU / 40-core GPU) — 400 GB/s
M1 Max — 400 GB/s
Or, just counting portable/macbook chips: M5 max (top model, 64/128G) M4 max (top model, 64/128G), M1 max (64G). Everything else is slower for local LLM inference.
TLDR: An M1 max chip is faster than all M5 chips, with the sole exception of the 40-GPU-core M5 max, the top model, only available in 64 and 128G versions. An M5 pro, any M5 pro (or any M* pro, or M3/M2 max chip) will be slower than an M1 max on LLM inference, and any Ultra chip, even the M1 Ultra, will be faster than any max chip, including the M5 max (though you may want the M2 ultra for bfloat16 support, maybe. It doesn't matter much for quantized models)
https://www.siquick.com/blog/model-quantization-fine-tuning-...
1 │ DeepSeek API -- 100%
2 │ qwen3.5:35b-a3b-q8_0 (thinking) -- 92.5%
3 │ qwen3.5:35b-a3b-q4_K_M (thinking) -- 90.0%
4 │ qwen3.5:35b-a3b-q8_0 (no-think) -- 81.3%
5 │ qwen3.5:27b-q8_0 (thinking) -- 75.3%
I expected the 27B dense model to score higher. Disclaimer: those numbers are from one-shot replies evaluations, the model was not put in a context where it could reiterate as an agent.With the latest llama-cpp build from source and latest unsloth quants, the TG speed of Qwen3.5-30B-A3B is around half of Qwen3-30B-A3B (with 33K tokens initial Claude Code context), so the older Qwen3 is much more usable.
Qwen3-30B-A3B (Q4_K_M):
- PP: 272 tok/s | TG: 25 tok/s @ 33k depth
- KV cache: f16
- Cache reuse: follow-up delta processed in 0.4s
Qwen3.5-35B-A3B (Q4_K_M): - PP: 395 tok/s | TG: 12 tok/s @ 33k depth
- KV cache: q8_0
- Cache reuse: follow-up delta processed in 2.7s (requires --swa-full)
Qwen3.5's sliding window attention uses significantly less RAM and delivers better response quality, but at 33k context depth it generates at half the tok/s of the standard-attention Qwen3-30B.Full llama-server and Claude-Code setup details here for these and other open LLMs:
https://pchalasani.github.io/claude-code-tools/integrations/...
For running the server:
$ ./llama.cpp/build/bin/llama-server --host 0.0.0.0 \
--port 8001 \
-hf unsloth/Qwen3.5-35B-A3B-GGUF:Q3_K_S \
--ctx-size 131072 \
--temp 0.6 \
--top-p 0.95 \
--top-k 20 \
--min-p 0.00This was a qwen3-coder-next 35B model on M4 Max with 64GB which seems to be 51GB size according to ollama. Have not yet tried the variants from the TFA.
"I am learning Elixir, can you explain this code to me?" (And then I can also ask follow-up questions.)
"Here is a bunch of logs. Given that the symptom is that the system fails to process a message, what log messages jump out as suspicious for dropping a message?"
"Here is the code I want to test. <code> Here are the existing tests. <test code> What is one additional test you would add?"
"I am learning Elixir. Here is some code that fails to compile, here is the error message, can you walk me through what I did wrong?"
I haven't gotten much value out of "review this code", but maybe I'll have to try prompting for "persona: brief rude senior" as mentioned elsewhere.
The last thing I was having it build is a rust based app that essentially pulls data from a set of APIs every 2 minutes, processes it and stores the data in a local database, with a half hourly task that does further analysis. It has done a decent job.
It's definitely not as fast or as good as large online models, but it's fast enough and good enough, and using hardware I already had spare.
I had this issue which in my case was solved by installing a newer driver. YMMV.
sudo apt install nvidia-driver-570The basic rule of thumb is that more parameters is always better, with diminishing returns as you get down to 2-3 bits per parameter. This is purely based on model quality, not inference speed.
The combo of free long running tasks on Qwen overnight with steering and corrections from Opus works for me.
I guess I could just do Opus/Sonnet for my Claude Code back-end, but I specifically want to keep local open weights models in the loop just in case the hosted models decide to quit on e.g. non-US users.
- 4090 : 27b-q4_k_m
- A100: 27b-q6_k
- 3*A100: 122b-a10b-q6_k_L
Using the Qwen team's "thinking" presets, I found that non-agentic coding performance doesn't feel significant leap over unquantized GPT-OSS-120B. It shows some hallucination and repetition for mujoco codes with default presence penalty. 27b-q4_k_m with 4090 generates 30~35 tok/s in good quality.Imo though, going below 4 bits for anything that's less than 70B is not worth the degradation. BF/FP16 and Q8 are usually indistinguishable except for vision encoders (mmproj) and for really small models, like under 2B.
I still like and mainly use Qwen3-Coder-Next, though, as it seems to be generally more reliable.
If you're on a 16GB Mac mini, what's a good variant to run?
For vision qwen is the best, our goto vision model.
Also does 9b or 9b 8 bit or 6bit run with very low latency on a 4090?
FYI, this is what I am seeing for pure CPU inference so something is likely off with your setup.
Test setup is intel 13500 w/ 6 threads and 64GB DDR4 ram, a newer system should be much faster
For me, the 122b model is good enough on my own hardware that the downsides can be worked around for the sake of privacy and cost savings.
I disabled the thinking and configured the translate plugin on my browser to use the lmstudio API.
It performs way better than Google Translate in accuracy. The speed is a little slower, but sufficient for me.
made me laugh, especially in the context of LLMs.
I'm also a bit unsure of the trade offs between smaller quant vs smaller model
Capabilities
completion
vision
tools
thinking
Parameters
presence_penalty 1.5
temperature 1
top_k 20
top_p 0.95
License
Apache License
Version 2.0, January 2004I built https://github.com/brainless/dwata to submit for Google Gemini Hackathon, and focused on an agent that would replace email content with regex to extract financial data. I used Gemini 3 Flash.
After submitting to the contest, I kept working on branch: reverse-template-based-financial-data-extraction to use Ministral 3:3b. I moved away from regex detection to a reverse template generation. Like Jinja2 syntax but in reverse, from the source email.
Financial data extraction now works OK ish and I am constantly improving this to aim for a launch soon. I will try with Qwen 3.5 Small, maybe 4b model. Both Ministral 3:3b and Qwen 3.5 Small:4b will fit on the smallest Mac Mini M4 or a RTX 3060 6GB (I have these devices). dwata should be able to process all sorts of financial data, transaction and meta-data (vendor, reference #), at a pretty nice speed. Keep it running a couple hours and you can go through 20K or 30K emails. All local!
https://github.com/ollama/ollama/issues/14419
https://github.com/ollama/ollama/issues/14503
So for now I'm back to Qwen 3 30B A3B, kind of a bummer, because the previous model is pretty fast but kinda dumb, even for simple tasks like on-prem code review!
I mean, it's great that so many models are open-source and readily available. That is hugely important. Running models locally protects your data. But speed is a problem, and likely to remain a problem for the foreseeable future.
3. how earning bilion dolars in 2 week?