How to run Qwen 3.5 locally (opens in new tab)

(unsloth.ai)

490 pointsCuriositry2mo ago168 comments

168 comments

Running 3.5 9B on my ASUS 5070ti 16G with lm studio gives a stable ~100 tok/s. This outperforms the majority of online llm services and the actual quality of output matches the benchmark. This model is really something, first time ever having usable model on consumer-grade hardware.

smokel2mo ago

> This outperforms the majority of online llm services

I assume you mean outperforms in speed on the same model, not in usability compared to other more capable models.

(For those who are getting their hopes up on using local LLMs to be any replacement for Sonnet or Opus.)

moffkalast2mo ago

Obviously it's not going to be of a paid tier 2T sized SOTA model quality, but it can probably roughly match Haiku at the very least. And for tasks that aren't super complex that's already enough.

Personally though, I find Qwen useless for anything but coding tasks because if its insufferable sycophancy. It's like 4o dialed up to 20, every reply starts with "You are absolutely right" with zero self awareness. And for coding, only the best model available is usually sensible to use otherwise it's just wasted time.

Anduia2mo ago

That's why I start any prompt to Qwen 3.5 with:

persona: brief rude senior

6 more replies

andai2mo ago

>for coding, only the best model available is usually sensible to use otherwise it's just wasted time.

I had the opposite experience. Gave a small model and a big model the same 3 tasks. Small model was done in 30 sec. Large model took 90 sec 3x longer and cost 3x more. Depending on the task, the benchies just tell you how much you are over-paying and over-waiting.

1 more reply

itsTyrion2mo ago

oh? I used it in t3 chat before, with traits `concise` `avoid unnecessary flattery/affirmation/praise` `witty` `feel free to match potential user's sarcasm`

and it does use that sarcasm permission at times (I still dislike the way it generally communicates)

ggregoire2mo ago

> I find Qwen useless for anything but coding tasks because if its insufferable sycophancy

We use Qwen at work since 2.0 for text/image/video analysis (summarization, categorization, NER, etc), I think it's impressive. We ask for JSON and always ask "do not explain your response".

segmondy2mo ago

You can replace Sonnet and Opus with local models, you just need to run the larger ones.

throwdbaaway2mo ago

There are Qwen3.5 27B quants in the range of 4 bits per weight, which fits into 16G of VRAM. The quality is comparable to Sonnet 4.0 from summer 2025. Inference speed is very good with ik_llama.cpp, and still decent with mainline llama.cpp.

codemog2mo ago

Can someone explain how a 27B model (quantized no less) ever be comparable to a model like Sonnet 4.0 which is likely in the mid to high hundreds of billions of parameters?

Is it really just more training data? I doubt it’s architecture improvements, or at the very least, I imagine any architecture improvements are marginal.

alecco2mo ago

AFAIK post-training and distillation techniques advanced a lot in the past couple of years. SOTA big models get new frontier and within 6 months it trickles down to open models with 10x less parameters.

And mind the source pre-training data was not made/written for training LLMs, it's just random stuff from Internet, books, etc. So there's a LOT of completely useless an contradictory information. Better training texts are way better and you can just generate & curate from those huge frontier LLMs. This was shown in the TinyStories paper where GPT-4 generated children's stories could make models 3 orders of magnitude smaller achieve quite a lot.

This is why the big US labs complain China is "stealing" their work by distilling their models. Chinese labs save many billions in training with just a bunch of accounts. (I'm just stating what they say, not giving my opinion).

otabdeveloper42mo ago

There's diminishing returns bigly when you increase parameter count.

The sweet spot isn't in the "hundreds of billions" range, it's much lower than that.

Anyways your perception of a model's "quality" is determined by careful post-training.

2 more replies

spwa42mo ago

The short answer is that there are more things that matter than parameter count, and we are probably nowhere near the most efficient way to make these models. Also: the big AI labs have shown a few times that internally they have way more capable models

girvo2mo ago

Considering the full fat Qwen3.5-plus is good, but barely Sonnet 4 good in my testing (but incredibly cheap!) I doubt the quantised versions are somehow as good if not better in practice.

2 more replies

revolvingthrow2mo ago

It doesn’t. I’m not sure it outperforms chatgpt 3

2 more replies

zozbot2342mo ago

With MoE models, if the complete weights for inactive experts almost fit in RAM you can set up mmap use and they will be streamed from disk when needed. There's obviously a slowdown but it is quite gradual, and even less relevant if you use fast storage.

htrp2mo ago

any good packages you recommend for this?

teaearlgraycold2mo ago

Qwen3.5 35B A3B is much much faster and fits if you get a 3 bit version. How fast are you getting 27B to run?

On my M3 Air w/ 24GB of memory 27B is 2 tok/s but 35B A3B is 14-22 tok/s which is actually usable.

throwdbaaway2mo ago

Using ik_llama.cpp to run a 27B 4bpw quant on a RTX 3090, I get 1312 tok/s PP and 40.7 tok/s TG at zero context, dropping to 1009 tok/s PP and 36.2 tok/s TG at 40960 context.

35B A3B is faster but didn't do too well in my limited testing.

1 more reply

ece2mo ago

The 27B is rated slightly higher for SWE-bench.

ranger_danger2mo ago

Don't sleep on the 9B version either, I get much faster speeds and can't tell any difference in quality. On my 3070ti I get ~60tok/s with it, and half that with the 35B-A3B.

andai2mo ago

27B needs less memory and does better on benchmarks, but 35B-A3B seems to run roughly twice as fast.

ljosifov2mo ago

Say more please if you can. How/why is ik_llama.cpp faster then mainline, for the 27B dense? I'd like to be able to run 27B dense faster on a 24GB vram gpu, and also on an M2 max.

ac292mo ago

ik_llama.cpp was about 2x faster for CPU inference of Qwen3.5 versus mainline until yesterday. Mainline landed a PR that greatly increased speed for Qwen3.5, so now ik_llama.cpp is only 10% faster on token generation.

the_duke2mo ago

What context length and related performance are you getting out if this setup?

At least 100k context without huge degradation is important for coding tasks. Most "I'm running this locally" reports only cover testing with very small context.

Aurornis2mo ago

Long context degradation is a problem with the Qwen3.5 models for me. They have some clever tricks to accelerate attention that favor more recent context.

The models can be frustrating to use if you expect long contexts to behave like they do on SOTA models. In my trials I could give them strict instructions to NOT do something and they would follow it for a short time before ignoring my prompt and doing the things I told it not to do.

vardalab2mo ago

Q4 quants on 32G VRAM gives you 131K context for 35BA3B and 27B models who are pretty capable. On 5090 one gets 175 tg and ~7K pp with 35BA3B, 27B isaround 90 tg. So speed is awesome. Even Strix 395 gives 40 tk/s and 256K context. Pretty amazing, there is a reason people are excited about qwen 3.5

lukan2mo ago

What exact model are you using?

I have a 16GB GPU as well, but have never run a local model so far. According to the table in the article, 9B and 8-bit -> 13 GB and 27B and 3-bit seem to fit inside the memory. Or is there more space required for context etc?

vasquez2mo ago

It depends on the task, but you generally want some context. These models can do things like OCR and summarize a pdf for you, which takes a bit of working memory. Even more so for coding CLIs like opencode-ai, qwen code and mistral ai.

Inference engines like llama.cpp will offload model and context to system ram for you, at the cost of performance. A MoE like 35B-A3B might serve you better than the ones mentioned, even if it doesn't fit entirely on the GPU. I suggest testing all three. Perhaps even 122-A10B if you have plenty of system ram.

Q4 is a common baseline for simple tasks on local models. I like to step up to Q5/Q6 for anything involving tool use on the smallish models I can run (9B and 35B-A3B).

Larger models tolerate lower quants better than small ones, 27B might be usable at 3 bpw where 9B or 4B wouldn't. You can also quantize the context. On llama.cpp you'd set the flags -fa on, -ctk x and ctv y. -h to see valid parameters. K is more sensitive to quantization than V, don't bother lowering it past q8_0. KV quantization is allegedly broken for Qwen 3.5 right now, but I can't tell.

yangikan2mo ago

Do you point claude code to this? The orchestration seems to be very important.

tommyjepsen2mo ago

I ran the Qwen3 Coder 30B through LM Studio and with OpenCode(Instead of Claude code). Did decent on M4 Max 32GB. https://www.tommyjepsen.com/blog/run-llm-locally-for-coding

Aurornis2mo ago

The 9B models are not useful for coding outside of very simple requests.

Qwen3.5 is confusing a lot of newcomers because it is very confident in the answers it gives. It can also regurgitate solutions to common test requests like “make a flappy bird clone” which misleads users into thinking it’s genetically smart.

Using the Qwen3.5 models for longer tasks and inspecting the output is a little more disappointing. They’re cool for something I can run locally but I don’t agree with all of the claims about being Sonnet-level quality (including previous Sonnet versions) in my experience with the larger models. The 9B model is not going to be close to Sonnet in any way.

teaearlgraycold2mo ago

I loaded Qwen into LM Studio and then ran Oh My Pi. It automatically picked up the LM Studio API server. For some reason the 35B A3B model had issues with Oh My Pi's ability to pass a thinking parameter which caused it to crash. 27B did not have that issue for me but it's much slower.

Here's how I got the 35B model to work: https://gist.github.com/danthedaniel/c1542c65469fb1caafabe13...

The 35B model is still pretty slow on my machine but it's cool to see it working.

badgersnake2mo ago

I’ve tried it on Claude code, Found it to be fairly crap. It got stuck in a loop doing the wrong thing and would not be talked out of it. I’ve found this bug that would stop it compiling right after compiling it, that sort of thing.

Also seemed to ignore fairly simple instructions in CLAUDE.md about building and running tests.

andsoitis2mo ago

I use Claude Code for agentic coding but it is better to use qwen3-coder in that case.

It qwen3-coder is better for code generation and editing, strong at multi-file agentic tasks, and is purpose-built for coding workflows.

In contrast, qwen3.5 is more capable at general reasoning, better at planning and architecture decisions, good balance of coding and thinking.

jadbox2mo ago

Did you figure out how to fix Thinking mode? I had to turn it off completely as it went on forever, and I tried to fix it with different parameters without success.

sammyteee2mo ago

Thinking has definitely become a bit more convuluted in this model - I gave the prompt of "hey" and it thought for about two minutes straight before giving a bog-standard "hello, how can i help" reply etc

andrekandre2mo ago

supposedly you can turn it off by passing `\no_think` or `/no_think` into the prompt but it never worked for me

what did work was passing / adding this json to the request body:

   { "chat_template_kwargs": {"enable_thinking": false}}

[0] https://github.com/QwenLM/Qwen3/discussions/1300

agile-gift02622mo ago

did you try with the recommended settings? the ones for thinking mode, general tasks, really worked for me. Especially the repetition_penalty. At first it wasn't working very well, and it was because I was using OpenWebUI's "Repeat Penalty" field, and that didn't work. I needed to set a custom field with the exact name

bluerooibos2mo ago

These smaller models are fine for Q&A type stuff but are basically unuseable for anything agentic like large file modifications, coding, second brain type stuff - they need so much handholding. I'd be interested to see a demo of what the larger versions can do on better hardware though.

NorwegianDude2mo ago

Qwen3.5 27B works very well, to the point that if you use money on Claude 4.5 Haiku you could save hundreds of USD each day by running it yourself on a consumer GPU at home.

bluerooibos2mo ago

Compared to Opus 4.6 though? And what sort of hardware/RAM is that running on - I'm assuming 32 or 64GB at least, right?

regularfry2mo ago

In some ways the handholding is the point. The way I used qwen2.5-coder in the past was as a rubber duck that happens to be able to type. You have to be in the loop with it, it's just a different style of agent use to what you might do with copilot or Claude.

y422mo ago

> consumer-grade hardware

Not disagreeing per se, but a quick look at the installation instructions confirms what I assumed:

Yeah, you can run a highly quantized version on your 2020 Nvidia GPU. But:

- When inferencing, it occupies your "whole machine.". At least you have a modern interactive heating feature in your flat.

- You need to follow eleven-thousand nerdy steps to get it running; my mum is really looking forward to that.

- Not to mention the pain you went through installing Nvidia drivers; nothing my mum will ever manage in the near future.

... and all this to get something that merely competes with Haiku.

Don't get me wrong - I am exaggerating, I know. It's important to have competition and the opportunity to run "AI" on your own metal. But this reminds me of the early days of smartphones and my old XDA Neo. Sure, it was damn smart, and I remember all those jealous faces because of my "device from the future." But oh boy, it was also a PITA maintaining it.

Here we are now. Running AI locally is a sneak peek into the future. But as long as you need a CS degree and hardware worth a small car to achieve reasonable results, it's far from mainstream. Therefore, "consumer-grade hardware" sounds like a euphemism here.

I like how we nerds are living in our buble celebrating this stuff while 99% of mankind still doomscroll through facebook and laughing at (now AI generated) brain rot.

(No offense (ʘ‿ʘ)╯)

mingodad2mo ago

I'm still a bit confused because it says "All uploads use Unsloth Dynamic 2.0" but then when looking at the available options like for 4 bits there is:

IQ4_XS 5.17 GB, Q4_K_S 5.39 GB, IQ4_NL 5.37 GB, Q4_0 5.38 GB, Q4_1 5.84 GB, Q4_K_M 5.68 GB, UD-Q4_K_XL 5.97 GB

And no explanation for what they are and what tradeoffs they have, but in the turorial it explicitly used Q4_K_XL with llama.cpp .

I'm using a macmini m4 16GB and so far my prefered model is Qwen3-4B-Instruct-2507-Q4_K_M although a bit chat but my test with Qwen3.5-4B-UD-Q4_K_XL shows it's a lot more chat, I'm basically using it in chat mode for basic man style questions.

I understand that each user has it's own specific needs but would be nice to have a place that have a list of typical models/hardware listed with it's common config parameters and memory usage.

Even on redit specific channels it's a bit of nightmare of loot of talk but no concrete config/usage clear examples.

I'm floowing this topic heavilly for the last 3 months and I see more confusion than clarification.

Right now I'm getting good cost/benefit results with the qwen cli with coder-model in the cloud and watching constantly to see when a local model on affordable hardware with enviroment firendly energy comsumption arrives.

danielhanchen2mo ago

Oh https://unsloth.ai/docs/models/qwen3.5/gguf-benchmarks might be helpful - it provides benchmarks for Q4_K_XL vs Q4_K_M etc for disk space vs KL Divergence (proxy for how close to the original full precision model)

Q4_0 and Q4_1 were supposed to provide faster inference, but tests showed it reduced accuracy by quite a bit, so they are deprecated now.

Q4_K_M and UD-Q4_K_XL are the same, just _XL is slightly bigger than _M

The naming convention is _XL > _L > _M > _S > _XS

sowbug2mo ago

Thanks for all your contributions.

Do you think it's time for version numbers in filenames? Or at least a sha256sum of the merged files when they're big enough to require splitting?

Even with gigabit fiber, it still takes a long time to download model files, and I usually merge split files and toss the parts when I'm done. So by the time I have a full model, I've often lost track of exactly when I downloaded it, so I can't tell whether I have the latest. For non-split models, I can compare the sha256sum on HF, but not for split ones I've already merged. That's why I think we could use version numbers.

danielhanchen2mo ago

Thanks! Oh we do split if over 50GB - do you mean also split on 50GB shards? HuggingFace XET has an interesting feature where each file is divided into blocks, so it'll do a SHA256 on each block, and only update blocks

1 more reply

PhilippGille2mo ago

> would be nice to have a place that have a list of typical models/ hardware listed with it's common config parameters and memory usage

https://www.localscore.ai from Mozilla Builders was supposed to be this, but there are not enough users I guess, I didn't find any Qwen 3.5 entries yet

1 more reply

ay2mo ago

I tried qwen3.5:4b in ollama on my 4 year old Mac M1 with my own coding harness and it exhibited pretty decent tool calling, but it is a bit slow and seemed a little confused with the more complex tasks (also, I have it code rust, that might add complexity). The task was “find the debug that does X and make it conditional based on the whichever variable is controlled by the CLI ‘/debug foo’” - I didn’t do much with it after that.

It may be interesting to try a 6bit quant of qwen3.5-35b-a3b - I had pretty good results with it running it on a single 4090 - for obvious reasons I didn’t try it on the old mac.

I am using 8bit quant of qwen3.5-27b as more or less the main engine for the past ~week and am quite happy with it - but that requires more memory/gpu power.

HTH.

spwa42mo ago

What matters for Qwen models, and most/all local MoE models (ie. where the performance is limited) is memory bandwidth. This goes for small models too. Here's the top Apple chips by memory bandwidth (and to steal from clickbait: Apple definitely does not want you to think too closely about this):

M3 Ultra — 819 GB/s

M2 Ultra — 800 GB/s

M1 Ultra — 800 GB/s

M5 Max (40-core GPU) — 610 GB/s

M4 Max (16-core CPU / 40-core GPU) — 546 GB/s

M4 Max (14-core CPU / 32-core GPU) — 410 GB/s

M2 Max — 400 GB/s

M3 Max (16-core CPU / 40-core GPU) — 400 GB/s

M1 Max — 400 GB/s

Or, just counting portable/macbook chips: M5 max (top model, 64/128G) M4 max (top model, 64/128G), M1 max (64G). Everything else is slower for local LLM inference.

TLDR: An M1 max chip is faster than all M5 chips, with the sole exception of the 40-GPU-core M5 max, the top model, only available in 64 and 128G versions. An M5 pro, any M5 pro (or any M* pro, or M3/M2 max chip) will be slower than an M1 max on LLM inference, and any Ultra chip, even the M1 Ultra, will be faster than any max chip, including the M5 max (though you may want the M2 ultra for bfloat16 support, maybe. It doesn't matter much for quantized models)

embedding-shape2mo ago

For comparison, most recent (consumer) NVIDIA GPUs released:

- 5050 - MSRP: 249 USD - 320 GB/s

- 5060 - MSRP: 299 USD - 448 GB/s

- 5060 Ti - MSRP: 379 USD - 448 GB/s

- 5070 - MSRP: 549 USD - 672 GB/s

- 5070 Ti - MSRP: 749 USD - 896 GB/s

- 5080 - MSRP: 999 USD - 960 GB/s

- 5090 - MSRP: 1999 USD - 1792 GB/s

M3 Ultra seems to come close to a ~5070 Ti more or less.

1 more reply

spatular2mo ago

There is also prompt processing that's compute-bound, and for agentic workflows it can matter more than tg, especially if the model is not of "thinking" type.

siquick2mo ago

This may help you work out the best quant to use for your use case.

https://www.siquick.com/blog/model-quantization-fine-tuning-...

antirez2mo ago

My private benchmarks, using DeepSeek replies to coding problems as a baseline, with Claude Opus as judge. However when reading this percentages consider that the no-think setup is much faster, and may be more practical for most situations.

    1   │ DeepSeek API -- 100%
    2   │ qwen3.5:35b-a3b-q8_0 (thinking) -- 92.5%
    3   │ qwen3.5:35b-a3b-q4_K_M (thinking) -- 90.0%
    4   │ qwen3.5:35b-a3b-q8_0 (no-think) -- 81.3%
    5   │ qwen3.5:27b-q8_0 (thinking) -- 75.3%

I expected the 27B dense model to score higher. Disclaimer: those numbers are from one-shot replies evaluations, the model was not put in a context where it could reiterate as an agent.

throwdbaaway2mo ago

Yours is the only benchmark that puts 35B A3B above 27B. Time for human judgement to verify? For example, if you look at the thinking traces, there might be logical inconsistencies in the prompts, which then tripped up the 27B more when reasoning. This will also be reflected in the score when thinking is disabled, but we can sort of debug with the thinking traces.

antirez2mo ago

I inspected manually and indeed the 27B is doing worse, but I believe it could be due to the exact GGUF in the ollama repository and/or with the need of adjusting the parameters. I'll try more stuff.

andhuman2mo ago

Isn’t llama.cpp’s implementation of Qwen 3.5 better, or am I misinformed?

1 more reply

alansaber2mo ago

Maybe a reductive question but are there any thinking models that don't (relatively) add much latency?

ac292mo ago

The whole point of thinking is to throw more compute/tokens at a problem, so it will always add latency over non thinking modes/models. Many models do support variable thinking levels or thinking token budgets though, so you can set them to low/minimal thinking if you want only a minimal increase in latency versus no thinking.

d4rkp4ttern2mo ago

For every new interesting open model I try to test PP (prompt processing) and TG (token gen) speeds via llama-cpp/server in Claude Code (which can have at least 15-30K tokens context due system prompt and tools etc), on my good old M1 Max 64GB MacBook.

With the latest llama-cpp build from source and latest unsloth quants, the TG speed of Qwen3.5-30B-A3B is around half of Qwen3-30B-A3B (with 33K tokens initial Claude Code context), so the older Qwen3 is much more usable.

Qwen3-30B-A3B (Q4_K_M):

  - PP: 272 tok/s | TG: 25 tok/s @ 33k depth

  - KV cache: f16

  - Cache reuse: follow-up delta processed in 0.4s

Qwen3.5-35B-A3B (Q4_K_M):

  - PP: 395 tok/s | TG: 12 tok/s @ 33k depth

  - KV cache: q8_0

  - Cache reuse: follow-up delta processed in 2.7s (requires --swa-full)

Qwen3.5's sliding window attention uses significantly less RAM and delivers better response quality, but at 33k context depth it generates at half the tok/s of the standard-attention Qwen3-30B.

Full llama-server and Claude-Code setup details here for these and other open LLMs:

https://pchalasani.github.io/claude-code-tools/integrations/...

regularfry2mo ago

I definitely get the impression there's something not quite right with qwen3.5 in llama.cpp. It's impressive but just a bit off. A patch landed yesterday which helped though.

ranger_danger2mo ago

Which patch are you referring to?

Twirrim2mo ago

I've been finding it very practical to run the 35B-A3B model on an 8GB RTX 3050, it's pretty responsive and doing a good job of the coding tasks I've thrown at it. I need to grab the freshly updated models, the older one seems to occasionally get stuck in a loop with tool use, which they suggest they've fixed.

fy202mo ago

I guess you are doing offloading to system RAM? What tokens per second do you get? I've got an old gaming laptop with a RTX 3060, sounds like it could work well as a local inference server.

Twirrim2mo ago

I'm getting about 15-20 tok/s with a 128k context window using the Q3_K_S version.

For running the server:

    $ ./llama.cpp/build/bin/llama-server --host 0.0.0.0 \
      --port 8001 \
      -hf unsloth/Qwen3.5-35B-A3B-GGUF:Q3_K_S \
      --ctx-size 131072 \
      --temp 0.6 \
      --top-p 0.95 \
      --top-k 20 \
      --min-p 0.00

manmal2mo ago

In the article, they claim up to 25t/s for the LARGEST model with a 24GB VRAM card. Need a lot of RAM obviously

ufish2352mo ago

Can you give an example of some coding tasks? I had no idea local was that good.

hooch2mo ago

Changed into a directory recently and fired up the qwen code CLI and gave it two prompts: "so what's this then?" - to which it had a good summary across stack and product, and then "think you can find something todo in the TODO?" - and while I was busy in Claude Code on another project, it neatly finished three HTML & CSS tasks - that I had been procrastinating on for weeks.

This was a qwen3-coder-next 35B model on M4 Max with 64GB which seems to be 51GB size according to ollama. Have not yet tried the variants from the TFA.

manmal2mo ago

3.5 seems to be better at coding than 3-coder-next, I’d check it out.

NortySpock2mo ago

I personally have used Qwen2.5-coder:14B for "live, talking rubber duck" sorts of things.

"I am learning Elixir, can you explain this code to me?" (And then I can also ask follow-up questions.)

"Here is a bunch of logs. Given that the symptom is that the system fails to process a message, what log messages jump out as suspicious for dropping a message?"

"Here is the code I want to test. <code> Here are the existing tests. <test code> What is one additional test you would add?"

"I am learning Elixir. Here is some code that fails to compile, here is the error message, can you walk me through what I did wrong?"

I haven't gotten much value out of "review this code", but maybe I'll have to try prompting for "persona: brief rude senior" as mentioned elsewhere.

Twirrim2mo ago

3.5 is doing a good job of reviewing code, even without prompting it to be brief and/or rude.

Twirrim2mo ago

I've been using opencode pointing to the local model running llama.cpp.

The last thing I was having it build is a rust based app that essentially pulls data from a set of APIs every 2 minutes, processes it and stores the data in a local database, with a half hourly task that does further analysis. It has done a decent job.

It's definitely not as fast or as good as large online models, but it's fast enough and good enough, and using hardware I already had spare.

fragmede2mo ago

Which models would that be?

Twirrim2mo ago

unsloth's quantized ones. They mention on the site that this links to that a couple of days ago they released updated freshly quantized versions of Qwen3.5-35B, 27B, 122B and 397B, with various improvements.

CuriositryOP2mo ago

Qwen3.5 9b seems to be fairly competent at OCR and text formatting cleanup running in llama.cpp on CPU, albeit slow. However, I have compiled it umpteen ways and still haven't gotten GPU offloading working properly (which I had with Ollama), on an old 1650 Ti with 4GB VRAM (it tries to allocate too much memory).

acters2mo ago

I have a 1660ti and the cachyos + aur/llama.cpp-cuda package is working fine for me. With about 5.3 GB of usable memory, I find that the 35B model is by far the most capable one that performs just as fast as the 4B model that fits entirely on my GPU. I did try the 9B model and was surprisingly capable. However 35B still better in some of my own anecdotal test cases. Very happy with the improvement. However, I notice that qwen 3.5 is about half the speed of qwen 3

AllegedAlec2mo ago

I found that the drivers I had were no longer compatible with the newer kernels. After upgrading to newer drivers it was able to offload again.

dunb2mo ago

Are you running with all the --fit options and it’s not working correctly? You could try looking at how many layers are being attempted to offload and manually adjust from there. Walk down --n-gpu-layers with a bash script until it loads.

lioeters2mo ago

> GPU offloading working

I had this issue which in my case was solved by installing a newer driver. YMMV.

  sudo apt install nvidia-driver-570

WhyNotHugo2mo ago

If you’re building from source, the vulkan backend is the easiest to build and use for GPU offloading.

CuriositryOP2mo ago

Yes, that's what I tried first. Same issue with trying to allocate more memory than was available.

tasuki2mo ago

How does one choose between "fewer parameters and less quantization" vs "more parameters and more quantization" ?

labcomputer2mo ago

There were some benchmarks a few years ago from, IIRC, the people behind either llama.cpp or Ollama (I forget which).

The basic rule of thumb is that more parameters is always better, with diminishing returns as you get down to 2-3 bits per parameter. This is purely based on model quality, not inference speed.

paoliniluis2mo ago

just finding the perfect spot between accuracy of the answers/available VRAM/tokens per second

tasuki2mo ago

Ok, say I have 14GB VRAM. What is the tradeoff between using 9B with 8-bit params vs 27B with 3-bit params?

causal2mo ago

You COULD even do Qwen3.5-35B-A3B-GGUF.

UD-IQ3-XXS is only 13.1GB, which might outperform both in both intelligence and certainly speed (only 3B activated): https://huggingface.co/unsloth/Qwen3.5-35B-A3B-GGUF

To accommodate cache you will need to offload a few feed-forward layers to the CPU. Will still be quite fast.

Edit: Actually 27B does a little better than 35B on most benchmarks- 35B will still be much faster.

causal2mo ago

3-bit 27B will almost certainly be better. 4-bits is usually the limit below-which you start to see more steep drop-offs, but you also get diminishing returns above 6-bits. So I'd still rather pack in more params at 3-bits.

9B will be faster, however.

andai2mo ago

PeterStuer2mo ago

I am running both Qwen-coder-next and Qwen 3.5 locally. Not too bad, but I always have Opus 4.6 checking their output as the Qwen family tends to hallucinate non existing library features in amounts similar to the Claude 3.5 / GPT 4 era.

The combo of free long running tasks on Qwen overnight with steering and corrections from Opus works for me.

I guess I could just do Opus/Sonnet for my Claude Code back-end, but I specifically want to keep local open weights models in the loop just in case the hosted models decide to quit on e.g. non-US users.

brcmthrowaway2mo ago

How did they solve the hallucination? Reasoning tokens?

b89kim2mo ago

I’ve been benchmarking GGUF quants for Python tasks under some hardware configs.

  - 4090 : 27b-q4_k_m
  - A100: 27b-q6_k
  - 3*A100: 122b-a10b-q6_k_L

Using the Qwen team's "thinking" presets, I found that non-agentic coding performance doesn't feel significant leap over unquantized GPT-OSS-120B. It shows some hallucination and repetition for mujoco codes with default presence penalty. 27b-q4_k_m with 4090 generates 30~35 tok/s in good quality.

dryarzeg2mo ago

That's quite a specific task for local models like these though (I mean mujoco), so it might be underrepresented in the training data or RL. I'm not sure if you will be able to see a significant leap in this direction in the next 0.5-2 years, although it's still possible.

b89kim2mo ago

I’ve been testing these on other tasks—IK, Kalman filters, and UI/DB boilerplate. Qwen3.5 is multimodal and specialized for js/webdev or agentic coding. It’s not surprising MoE model have some limitations in specific area. I understand most LLM have limited ability in mathematical/physical reasoning. And I don't think these tasks represent general performance. I'm just sharing personal experiences for those curious.

dryarzeg2mo ago

For me, the main issue with all kinds of recent "advancements" in LLMs is their lack in ability to generalize and extrapolate existing knowledge; they often can be quite weak when it comes to complex associations. Because while many LLMs can demonstrate sufficient theoretical knowledge in maths and physics - and by "sufficient" I mean at least postgraduate level - they often simply fail to apply this knowledge in fields that are closer to real life. At least, that's what I've seen from my experience. They're fine with theory, but once it comes to application, it's all messed up - even in their main "specializations" such as web development or other software-related tasks. And that's... kinda disappointing for me and even makes me a bit sad. We have a powerful tool, but we can't use it's true potential either because we're using it in the wrong way or because it's architecture cannot support this true potential.

_qua2mo ago

For roughly equivalent memory sizes, how does one choose between the bit depth and the model size?

moffkalast2mo ago

As a rule of thumb the larger the model is, the more you can quantize it without losing performance, but smaller models will run faster. It usually always makes sense to pick the larger model at a lower quant, as long as the speed is acceptable. Smaller models also use a smaller KV cache, so longer contexts are more viable. It really depends on what your use case is.

Imo though, going below 4 bits for anything that's less than 70B is not worth the degradation. BF/FP16 and Q8 are usually indistinguishable except for vision encoders (mmproj) and for really small models, like under 2B.

jedisct12mo ago

Qwen3.5-27B works amazingly well with https://swival.dev now that the unsloth quants have fixed the tool calling issues.

I still like and mainly use Qwen3-Coder-Next, though, as it seems to be generally more reliable.

adsharma2mo ago

So many variants of these models. The ggufs from unsloth don't work with ollama. Perhaps wait for a bit for the latest llama.cpp to be picked up by downstream projects.

If you're on a 16GB Mac mini, what's a good variant to run?

rurban2mo ago

We did run it locally on a free H100, and it performed awfully. With vLLM and opencode. Now we are running gpt-oss-120b which is better, but still far behind opus 4.6, the only coding model which is better than our most experienced senior dev. gpt-5.3-codex is more like on the sonnet level on complicated C code. Bearable, but still many stupidities. gpt-oss is hilariously stupid, but might work for typescript, react, python simple tasks.

For vision qwen is the best, our goto vision model.

reissbaker2mo ago

How does it compare at vision tasks to Kimi K2.5?

veritascap2mo ago

How does scaffolding work with these local models? Skills, commands, rules, etc. do they all work similarly? (It’s probably obvious but I haven’t delved into local LLMs yet.)

vvram2mo ago

What would be optimal HW configurations/systems recommended?

speedgoose2mo ago

It depends. Gaming PCs are fine for small models. Apple hardware can run much bigger models without having to open a window to cool down the room. If money isn’t an issue, NVIDIA isn’t that overpriced for no reasons and a server full of NVIDIA AI GPUs is neat.

benbojangles2mo ago

I'm running Qwen3.5:0.8b locally on an Orangepi Zero 2w using llama.cpp, runs just fine on cpu only. If I want vulkan GPU I have run qwen3.5:2b locally on a meta quest 3 with zeroclaw and saved myself hundreds of $$$ buying a low power computer. I recommend people stop shopping around for inflated mac minis and look at getting a used android phone to load local models on.

ilaksh2mo ago

Anyone providing hosted inference for 9B? I'm just trying to save the operational effort of renting a GPU since this is a business use case that doesn't have real GPUs available right now. I don't see the small ones on OpenRouter. Maybe there will be a runpod serverless or normal pod template or something.

Also does 9b or 9b 8 bit or 6bit run with very low latency on a 4090?

mongrelion2mo ago

By anyone do you mean a well-established business or any entity willing to serve you?

RandomGerm4n2mo ago

9b with 4bits runs with around 60 tok/s on my RTX 4070 with 12GB VRAM and 35b-A3B runs with around 14 tok/s and partial offloading. For roleplaying I prefer the faster 9b Version but for coding tasks both aren't really usable and Claude is still way better especially if you manage to persuade your employer to give you unlimited access.

ac292mo ago

> 35b-A3B runs with around 14 tok/s and partial offloading

FYI, this is what I am seeing for pure CPU inference so something is likely off with your setup.

Test setup is intel 13500 w/ 6 threads and 64GB DDR4 ram, a newer system should be much faster

edg50002mo ago

How does 397B-A17B compare against frontier? Did anybody try? Probably needs serious HW that most people don't have.

sosodev2mo ago

I’ve tried it via openrouter. It’s very good, but for some tasks frontier models are still significantly better.

For me, the 122b model is good enough on my own hardware that the downsides can be worked around for the sake of privacy and cost savings.

latrine55262mo ago

I have a 5090d and got ~140 token/s output when running qwen-3.5-9b-heretic in lmstudio.

I disabled the thinking and configured the translate plugin on my browser to use the lmstudio API.

It performs way better than Google Translate in accuracy. The speed is a little slower, but sufficient for me.

sieste2mo ago

> you can use 'true' and 'false' interchangeably.

made me laugh, especially in the context of LLMs.

computerex2mo ago

You can use my new golang inference engine to run variants of Qwen 3.5 faster than llama.cpp: https://github.com/computerex/dlgo

jadbox2mo ago

Using llama.cpp and the 9b q4 xl model, it is on Thinking mode by default and runs without stopping. The only way to force it to stop is to set the thinking budget to -1. (Which is weird as the docs say 0 should be valid)

singpolyma32mo ago

Does anyone know what the quantization is with ollama models? They always just list parameter count.

I'm also a bit unsure of the trade offs between smaller quant vs smaller model

paoliniluis2mo ago

run ollama show <name_of_model>:<parameters> and you'll get the info. E.g. ollama show qwen3.5:0.8b Model architecture qwen35 parameters 873.44M context length 262144 embedding length 1024 quantization Q8_0 requires 0.17.1

  Capabilities
    completion    
    vision        
    tools         
    thinking      

  Parameters
    presence_penalty    1.5     
    temperature         1       
    top_k               20      
    top_p               0.95    

  License
    Apache License               
    Version 2.0, January 2004

xrd2mo ago

I wanted to submit a fix to the site as I couldn't compile llama.cpp without `sudo apt install nvidia-cuda-toolkit-gcc`. Anyone know where to do that?

kdmtctl2mo ago

Will it run on an old 4xV100 Tesla rig? Looking something to start with, this can be available, but too inexperienced to understand all fp* nuances.

chr15m2mo ago

It's also working in Ollama now. The 27B model is absolutely cracked on an RTX 3090. Feels close to frontier American models for writing code.

segmondy2mo ago

It's truly an amazing model from the small models all the way to 397B. I wish they had released one as a FIM model.

brainless2mo ago

Local models, particularly the new ones would be really useful in many situations. They are not for general chat but if tools use them in specific agents, the results are awesome.

I built https://github.com/brainless/dwata to submit for Google Gemini Hackathon, and focused on an agent that would replace email content with regex to extract financial data. I used Gemini 3 Flash.

After submitting to the contest, I kept working on branch: reverse-template-based-financial-data-extraction to use Ministral 3:3b. I moved away from regex detection to a reverse template generation. Like Jinja2 syntax but in reverse, from the source email.

Financial data extraction now works OK ish and I am constantly improving this to aim for a launch soon. I will try with Qwen 3.5 Small, maybe 4b model. Both Ministral 3:3b and Qwen 3.5 Small:4b will fit on the smallest Mac Mini M4 or a RTX 3060 6GB (I have these devices). dwata should be able to process all sorts of financial data, transaction and meta-data (vendor, reference #), at a pretty nice speed. Keep it running a couple hours and you can go through 20K or 30K emails. All local!

gwangee2mo ago

Qwen 3.5 is a really good local model. I'm using it with personal assistant(https://github.com/daegwang/atombot) every day!

KronisLV2mo ago

I had an annoying issue in a setup with two Nvidia L4 cards where trying to run the MoE versions to get decent performance just didn't work with Ollama, seems the same as these:

https://github.com/ollama/ollama/issues/14419

https://github.com/ollama/ollama/issues/14503

So for now I'm back to Qwen 3 30B A3B, kind of a bummer, because the previous model is pretty fast but kinda dumb, even for simple tasks like on-prem code review!

bradley132mo ago

I have it running locally, but speed is a problem. I have the 35GB model running on a PC with 64GB, a fairly new processor and a mid-level GPU. Ask a question, go drink a coffee.

I mean, it's great that so many models are open-source and readily available. That is hugely important. Running models locally protects your data. But speed is a problem, and likely to remain a problem for the foreseeable future.

Western02mo ago

1. how creating image on small 7-12B LLM 2. how creating a voice

3. how earning bilion dolars in 2 week?

lasgawe2mo ago

a clear guide. thanks for that.

j / k navigate · click thread line to collapse

168 comments

moqizhengz2mo ago

smokel2mo ago

> This outperforms the majority of online llm services

I assume you mean outperforms in speed on the same model, not in usability compared to other more capable models.

(For those who are getting their hopes up on using local LLMs to be any replacement for Sonnet or Opus.)

moffkalast2mo ago

Obviously it's not going to be of a paid tier 2T sized SOTA model quality, but it can probably roughly match Haiku at the very least. And for tasks that aren't super complex that's already enough.

Anduia2mo ago

That's why I start any prompt to Qwen 3.5 with:

persona: brief rude senior

6 more replies

andai2mo ago

>for coding, only the best model available is usually sensible to use otherwise it's just wasted time.

1 more reply

itsTyrion2mo ago

oh? I used it in t3 chat before, with traits `concise` `avoid unnecessary flattery/affirmation/praise` `witty` `feel free to match potential user's sarcasm`

and it does use that sarcasm permission at times (I still dislike the way it generally communicates)

ggregoire2mo ago

> I find Qwen useless for anything but coding tasks because if its insufferable sycophancy

We use Qwen at work since 2.0 for text/image/video analysis (summarization, categorization, NER, etc), I think it's impressive. We ask for JSON and always ask "do not explain your response".

segmondy2mo ago

You can replace Sonnet and Opus with local models, you just need to run the larger ones.

throwdbaaway2mo ago

codemog2mo ago

Can someone explain how a 27B model (quantized no less) ever be comparable to a model like Sonnet 4.0 which is likely in the mid to high hundreds of billions of parameters?

Is it really just more training data? I doubt it’s architecture improvements, or at the very least, I imagine any architecture improvements are marginal.

alecco2mo ago

otabdeveloper42mo ago

There's diminishing returns bigly when you increase parameter count.

The sweet spot isn't in the "hundreds of billions" range, it's much lower than that.

Anyways your perception of a model's "quality" is determined by careful post-training.

2 more replies

spwa42mo ago

girvo2mo ago

Considering the full fat Qwen3.5-plus is good, but barely Sonnet 4 good in my testing (but incredibly cheap!) I doubt the quantised versions are somehow as good if not better in practice.

2 more replies

revolvingthrow2mo ago

It doesn’t. I’m not sure it outperforms chatgpt 3

2 more replies

zozbot2342mo ago

htrp2mo ago

any good packages you recommend for this?

teaearlgraycold2mo ago

Qwen3.5 35B A3B is much much faster and fits if you get a 3 bit version. How fast are you getting 27B to run?

On my M3 Air w/ 24GB of memory 27B is 2 tok/s but 35B A3B is 14-22 tok/s which is actually usable.

throwdbaaway2mo ago

Using ik_llama.cpp to run a 27B 4bpw quant on a RTX 3090, I get 1312 tok/s PP and 40.7 tok/s TG at zero context, dropping to 1009 tok/s PP and 36.2 tok/s TG at 40960 context.

35B A3B is faster but didn't do too well in my limited testing.

1 more reply

ece2mo ago

The 27B is rated slightly higher for SWE-bench.

ranger_danger2mo ago

Don't sleep on the 9B version either, I get much faster speeds and can't tell any difference in quality. On my 3070ti I get ~60tok/s with it, and half that with the 35B-A3B.

andai2mo ago

27B needs less memory and does better on benchmarks, but 35B-A3B seems to run roughly twice as fast.

ljosifov2mo ago

Say more please if you can. How/why is ik_llama.cpp faster then mainline, for the 27B dense? I'd like to be able to run 27B dense faster on a 24GB vram gpu, and also on an M2 max.

ac292mo ago

the_duke2mo ago

What context length and related performance are you getting out if this setup?

At least 100k context without huge degradation is important for coding tasks. Most "I'm running this locally" reports only cover testing with very small context.

Aurornis2mo ago

Long context degradation is a problem with the Qwen3.5 models for me. They have some clever tricks to accelerate attention that favor more recent context.

vardalab2mo ago

lukan2mo ago

What exact model are you using?

vasquez2mo ago

Q4 is a common baseline for simple tasks on local models. I like to step up to Q5/Q6 for anything involving tool use on the smallish models I can run (9B and 35B-A3B).

yangikan2mo ago

Do you point claude code to this? The orchestration seems to be very important.

tommyjepsen2mo ago

I ran the Qwen3 Coder 30B through LM Studio and with OpenCode(Instead of Claude code). Did decent on M4 Max 32GB. https://www.tommyjepsen.com/blog/run-llm-locally-for-coding

Aurornis2mo ago

The 9B models are not useful for coding outside of very simple requests.

teaearlgraycold2mo ago

Here's how I got the 35B model to work: https://gist.github.com/danthedaniel/c1542c65469fb1caafabe13...

The 35B model is still pretty slow on my machine but it's cool to see it working.

badgersnake2mo ago

Also seemed to ignore fairly simple instructions in CLAUDE.md about building and running tests.

andsoitis2mo ago

I use Claude Code for agentic coding but it is better to use qwen3-coder in that case.

It qwen3-coder is better for code generation and editing, strong at multi-file agentic tasks, and is purpose-built for coding workflows.

In contrast, qwen3.5 is more capable at general reasoning, better at planning and architecture decisions, good balance of coding and thinking.

jadbox2mo ago

Did you figure out how to fix Thinking mode? I had to turn it off completely as it went on forever, and I tried to fix it with different parameters without success.

sammyteee2mo ago

andrekandre2mo ago

supposedly you can turn it off by passing `\no_think` or `/no_think` into the prompt but it never worked for me

what did work was passing / adding this json to the request body:

   { "chat_template_kwargs": {"enable_thinking": false}}

[0] https://github.com/QwenLM/Qwen3/discussions/1300

agile-gift02622mo ago

bluerooibos2mo ago

NorwegianDude2mo ago

Qwen3.5 27B works very well, to the point that if you use money on Claude 4.5 Haiku you could save hundreds of USD each day by running it yourself on a consumer GPU at home.

bluerooibos2mo ago

Compared to Opus 4.6 though? And what sort of hardware/RAM is that running on - I'm assuming 32 or 64GB at least, right?

regularfry2mo ago

y422mo ago

> consumer-grade hardware

Not disagreeing per se, but a quick look at the installation instructions confirms what I assumed:

Yeah, you can run a highly quantized version on your 2020 Nvidia GPU. But:

- When inferencing, it occupies your "whole machine.". At least you have a modern interactive heating feature in your flat.

- You need to follow eleven-thousand nerdy steps to get it running; my mum is really looking forward to that.

- Not to mention the pain you went through installing Nvidia drivers; nothing my mum will ever manage in the near future.

... and all this to get something that merely competes with Haiku.

I like how we nerds are living in our buble celebrating this stuff while 99% of mankind still doomscroll through facebook and laughing at (now AI generated) brain rot.

(No offense (ʘ‿ʘ)╯)

mingodad2mo ago

I'm still a bit confused because it says "All uploads use Unsloth Dynamic 2.0" but then when looking at the available options like for 4 bits there is:

IQ4_XS 5.17 GB, Q4_K_S 5.39 GB, IQ4_NL 5.37 GB, Q4_0 5.38 GB, Q4_1 5.84 GB, Q4_K_M 5.68 GB, UD-Q4_K_XL 5.97 GB

And no explanation for what they are and what tradeoffs they have, but in the turorial it explicitly used Q4_K_XL with llama.cpp .

I understand that each user has it's own specific needs but would be nice to have a place that have a list of typical models/hardware listed with it's common config parameters and memory usage.

Even on redit specific channels it's a bit of nightmare of loot of talk but no concrete config/usage clear examples.

I'm floowing this topic heavilly for the last 3 months and I see more confusion than clarification.

danielhanchen2mo ago

Q4_0 and Q4_1 were supposed to provide faster inference, but tests showed it reduced accuracy by quite a bit, so they are deprecated now.

Q4_K_M and UD-Q4_K_XL are the same, just _XL is slightly bigger than _M

The naming convention is _XL > _L > _M > _S > _XS

sowbug2mo ago

Thanks for all your contributions.

Do you think it's time for version numbers in filenames? Or at least a sha256sum of the merged files when they're big enough to require splitting?

danielhanchen2mo ago

1 more reply

PhilippGille2mo ago

> would be nice to have a place that have a list of typical models/ hardware listed with it's common config parameters and memory usage

https://www.localscore.ai from Mozilla Builders was supposed to be this, but there are not enough users I guess, I didn't find any Qwen 3.5 entries yet

1 more reply

ay2mo ago

It may be interesting to try a 6bit quant of qwen3.5-35b-a3b - I had pretty good results with it running it on a single 4090 - for obvious reasons I didn’t try it on the old mac.

I am using 8bit quant of qwen3.5-27b as more or less the main engine for the past ~week and am quite happy with it - but that requires more memory/gpu power.

HTH.

spwa42mo ago

M3 Ultra — 819 GB/s

M2 Ultra — 800 GB/s

M1 Ultra — 800 GB/s

M5 Max (40-core GPU) — 610 GB/s

M4 Max (16-core CPU / 40-core GPU) — 546 GB/s

M4 Max (14-core CPU / 32-core GPU) — 410 GB/s

M2 Max — 400 GB/s

M3 Max (16-core CPU / 40-core GPU) — 400 GB/s

M1 Max — 400 GB/s

Or, just counting portable/macbook chips: M5 max (top model, 64/128G) M4 max (top model, 64/128G), M1 max (64G). Everything else is slower for local LLM inference.

embedding-shape2mo ago

For comparison, most recent (consumer) NVIDIA GPUs released:

- 5050 - MSRP: 249 USD - 320 GB/s

- 5060 - MSRP: 299 USD - 448 GB/s

- 5060 Ti - MSRP: 379 USD - 448 GB/s

- 5070 - MSRP: 549 USD - 672 GB/s

- 5070 Ti - MSRP: 749 USD - 896 GB/s

- 5080 - MSRP: 999 USD - 960 GB/s

- 5090 - MSRP: 1999 USD - 1792 GB/s

M3 Ultra seems to come close to a ~5070 Ti more or less.

1 more reply

spatular2mo ago

There is also prompt processing that's compute-bound, and for agentic workflows it can matter more than tg, especially if the model is not of "thinking" type.

siquick2mo ago

This may help you work out the best quant to use for your use case.

https://www.siquick.com/blog/model-quantization-fine-tuning-...

antirez2mo ago

    1   │ DeepSeek API -- 100%
    2   │ qwen3.5:35b-a3b-q8_0 (thinking) -- 92.5%
    3   │ qwen3.5:35b-a3b-q4_K_M (thinking) -- 90.0%
    4   │ qwen3.5:35b-a3b-q8_0 (no-think) -- 81.3%
    5   │ qwen3.5:27b-q8_0 (thinking) -- 75.3%

I expected the 27B dense model to score higher. Disclaimer: those numbers are from one-shot replies evaluations, the model was not put in a context where it could reiterate as an agent.

throwdbaaway2mo ago

antirez2mo ago

I inspected manually and indeed the 27B is doing worse, but I believe it could be due to the exact GGUF in the ollama repository and/or with the need of adjusting the parameters. I'll try more stuff.

andhuman2mo ago

Isn’t llama.cpp’s implementation of Qwen 3.5 better, or am I misinformed?

1 more reply

alansaber2mo ago

Maybe a reductive question but are there any thinking models that don't (relatively) add much latency?

ac292mo ago

d4rkp4ttern2mo ago

Qwen3-30B-A3B (Q4_K_M):

  - PP: 272 tok/s | TG: 25 tok/s @ 33k depth

  - KV cache: f16

  - Cache reuse: follow-up delta processed in 0.4s

Qwen3.5-35B-A3B (Q4_K_M):

  - PP: 395 tok/s | TG: 12 tok/s @ 33k depth

  - KV cache: q8_0

  - Cache reuse: follow-up delta processed in 2.7s (requires --swa-full)

Qwen3.5's sliding window attention uses significantly less RAM and delivers better response quality, but at 33k context depth it generates at half the tok/s of the standard-attention Qwen3-30B.

Full llama-server and Claude-Code setup details here for these and other open LLMs:

https://pchalasani.github.io/claude-code-tools/integrations/...

regularfry2mo ago

I definitely get the impression there's something not quite right with qwen3.5 in llama.cpp. It's impressive but just a bit off. A patch landed yesterday which helped though.

ranger_danger2mo ago

Which patch are you referring to?

Twirrim2mo ago

fy202mo ago

I guess you are doing offloading to system RAM? What tokens per second do you get? I've got an old gaming laptop with a RTX 3060, sounds like it could work well as a local inference server.

Twirrim2mo ago

I'm getting about 15-20 tok/s with a 128k context window using the Q3_K_S version.

For running the server:

    $ ./llama.cpp/build/bin/llama-server --host 0.0.0.0 \
      --port 8001 \
      -hf unsloth/Qwen3.5-35B-A3B-GGUF:Q3_K_S \
      --ctx-size 131072 \
      --temp 0.6 \
      --top-p 0.95 \
      --top-k 20 \
      --min-p 0.00

manmal2mo ago

In the article, they claim up to 25t/s for the LARGEST model with a 24GB VRAM card. Need a lot of RAM obviously

ufish2352mo ago

Can you give an example of some coding tasks? I had no idea local was that good.

hooch2mo ago

This was a qwen3-coder-next 35B model on M4 Max with 64GB which seems to be 51GB size according to ollama. Have not yet tried the variants from the TFA.

manmal2mo ago

3.5 seems to be better at coding than 3-coder-next, I’d check it out.

NortySpock2mo ago

I personally have used Qwen2.5-coder:14B for "live, talking rubber duck" sorts of things.

"I am learning Elixir, can you explain this code to me?" (And then I can also ask follow-up questions.)

"Here is a bunch of logs. Given that the symptom is that the system fails to process a message, what log messages jump out as suspicious for dropping a message?"

"Here is the code I want to test. <code> Here are the existing tests. <test code> What is one additional test you would add?"

"I am learning Elixir. Here is some code that fails to compile, here is the error message, can you walk me through what I did wrong?"

I haven't gotten much value out of "review this code", but maybe I'll have to try prompting for "persona: brief rude senior" as mentioned elsewhere.

Twirrim2mo ago

3.5 is doing a good job of reviewing code, even without prompting it to be brief and/or rude.

Twirrim2mo ago

I've been using opencode pointing to the local model running llama.cpp.

It's definitely not as fast or as good as large online models, but it's fast enough and good enough, and using hardware I already had spare.

fragmede2mo ago

Which models would that be?

Twirrim2mo ago

CuriositryOP2mo ago

acters2mo ago

AllegedAlec2mo ago

I found that the drivers I had were no longer compatible with the newer kernels. After upgrading to newer drivers it was able to offload again.

dunb2mo ago

lioeters2mo ago

> GPU offloading working

I had this issue which in my case was solved by installing a newer driver. YMMV.

  sudo apt install nvidia-driver-570

WhyNotHugo2mo ago

If you’re building from source, the vulkan backend is the easiest to build and use for GPU offloading.

CuriositryOP2mo ago

Yes, that's what I tried first. Same issue with trying to allocate more memory than was available.

tasuki2mo ago

How does one choose between "fewer parameters and less quantization" vs "more parameters and more quantization" ?

labcomputer2mo ago

There were some benchmarks a few years ago from, IIRC, the people behind either llama.cpp or Ollama (I forget which).

The basic rule of thumb is that more parameters is always better, with diminishing returns as you get down to 2-3 bits per parameter. This is purely based on model quality, not inference speed.

paoliniluis2mo ago

just finding the perfect spot between accuracy of the answers/available VRAM/tokens per second

tasuki2mo ago

Ok, say I have 14GB VRAM. What is the tradeoff between using 9B with 8-bit params vs 27B with 3-bit params?

causal2mo ago

You COULD even do Qwen3.5-35B-A3B-GGUF.

UD-IQ3-XXS is only 13.1GB, which might outperform both in both intelligence and certainly speed (only 3B activated): https://huggingface.co/unsloth/Qwen3.5-35B-A3B-GGUF

To accommodate cache you will need to offload a few feed-forward layers to the CPU. Will still be quite fast.

Edit: Actually 27B does a little better than 35B on most benchmarks- 35B will still be much faster.

causal2mo ago

9B will be faster, however.

andai2mo ago

PeterStuer2mo ago

The combo of free long running tasks on Qwen overnight with steering and corrections from Opus works for me.

brcmthrowaway2mo ago

How did they solve the hallucination? Reasoning tokens?

b89kim2mo ago

I’ve been benchmarking GGUF quants for Python tasks under some hardware configs.

  - 4090 : 27b-q4_k_m
  - A100: 27b-q6_k
  - 3*A100: 122b-a10b-q6_k_L

dryarzeg2mo ago

b89kim2mo ago

dryarzeg2mo ago

_qua2mo ago

For roughly equivalent memory sizes, how does one choose between the bit depth and the model size?

moffkalast2mo ago

jedisct12mo ago

Qwen3.5-27B works amazingly well with https://swival.dev now that the unsloth quants have fixed the tool calling issues.

I still like and mainly use Qwen3-Coder-Next, though, as it seems to be generally more reliable.

adsharma2mo ago

So many variants of these models. The ggufs from unsloth don't work with ollama. Perhaps wait for a bit for the latest llama.cpp to be picked up by downstream projects.

If you're on a 16GB Mac mini, what's a good variant to run?

rurban2mo ago

For vision qwen is the best, our goto vision model.

reissbaker2mo ago

How does it compare at vision tasks to Kimi K2.5?

veritascap2mo ago

How does scaffolding work with these local models? Skills, commands, rules, etc. do they all work similarly? (It’s probably obvious but I haven’t delved into local LLMs yet.)

vvram2mo ago

What would be optimal HW configurations/systems recommended?

speedgoose2mo ago

benbojangles2mo ago

ilaksh2mo ago

Also does 9b or 9b 8 bit or 6bit run with very low latency on a 4090?

mongrelion2mo ago

By anyone do you mean a well-established business or any entity willing to serve you?

RandomGerm4n2mo ago

ac292mo ago

> 35b-A3B runs with around 14 tok/s and partial offloading

FYI, this is what I am seeing for pure CPU inference so something is likely off with your setup.

Test setup is intel 13500 w/ 6 threads and 64GB DDR4 ram, a newer system should be much faster

edg50002mo ago

How does 397B-A17B compare against frontier? Did anybody try? Probably needs serious HW that most people don't have.

sosodev2mo ago

I’ve tried it via openrouter. It’s very good, but for some tasks frontier models are still significantly better.

For me, the 122b model is good enough on my own hardware that the downsides can be worked around for the sake of privacy and cost savings.

latrine55262mo ago

I have a 5090d and got ~140 token/s output when running qwen-3.5-9b-heretic in lmstudio.

I disabled the thinking and configured the translate plugin on my browser to use the lmstudio API.

It performs way better than Google Translate in accuracy. The speed is a little slower, but sufficient for me.

sieste2mo ago

> you can use 'true' and 'false' interchangeably.

made me laugh, especially in the context of LLMs.

computerex2mo ago

You can use my new golang inference engine to run variants of Qwen 3.5 faster than llama.cpp: https://github.com/computerex/dlgo

jadbox2mo ago

singpolyma32mo ago

Does anyone know what the quantization is with ollama models? They always just list parameter count.

I'm also a bit unsure of the trade offs between smaller quant vs smaller model

paoliniluis2mo ago

  Capabilities
    completion    
    vision        
    tools         
    thinking      

  Parameters
    presence_penalty    1.5     
    temperature         1       
    top_k               20      
    top_p               0.95    

  License
    Apache License               
    Version 2.0, January 2004

xrd2mo ago

I wanted to submit a fix to the site as I couldn't compile llama.cpp without `sudo apt install nvidia-cuda-toolkit-gcc`. Anyone know where to do that?

kdmtctl2mo ago

Will it run on an old 4xV100 Tesla rig? Looking something to start with, this can be available, but too inexperienced to understand all fp* nuances.

chr15m2mo ago

It's also working in Ollama now. The 27B model is absolutely cracked on an RTX 3090. Feels close to frontier American models for writing code.

segmondy2mo ago

It's truly an amazing model from the small models all the way to 397B. I wish they had released one as a FIM model.

brainless2mo ago

Local models, particularly the new ones would be really useful in many situations. They are not for general chat but if tools use them in specific agents, the results are awesome.

I built https://github.com/brainless/dwata to submit for Google Gemini Hackathon, and focused on an agent that would replace email content with regex to extract financial data. I used Gemini 3 Flash.

gwangee2mo ago

Qwen 3.5 is a really good local model. I'm using it with personal assistant(https://github.com/daegwang/atombot) every day!

KronisLV2mo ago

I had an annoying issue in a setup with two Nvidia L4 cards where trying to run the MoE versions to get decent performance just didn't work with Ollama, seems the same as these:

https://github.com/ollama/ollama/issues/14419

https://github.com/ollama/ollama/issues/14503

So for now I'm back to Qwen 3 30B A3B, kind of a bummer, because the previous model is pretty fast but kinda dumb, even for simple tasks like on-prem code review!

bradley132mo ago

I have it running locally, but speed is a problem. I have the 35GB model running on a PC with 64GB, a fairly new processor and a mid-level GPU. Ask a question, go drink a coffee.

Western02mo ago

1. how creating image on small 7-12B LLM 2. how creating a voice

3. how earning bilion dolars in 2 week?

lasgawe2mo ago

a clear guide. thanks for that.

j / k navigate · click thread line to collapse