undefined | Better HN

0 pointstarruda4d ago0 comments

I can't say anything about the OP method, but I already tested the smol-IQ2_XS quant (which has 2.46 BPW) with the pi harness. I did not do a very long session because token generation and prompt processing gets very slow, but I think I worked for up to ~70k context and it maintained a lot of coherence in the session. IIRC the GPQA diamond is supposed to exercise long chains of thought and it scored exceptionally well with 82% (the original BF16 official number is 88%: https://huggingface.co/Qwen/Qwen3.5-397B-A17B).

Note that not all quants are the same at a certain BPW. The smol-IQ2_XS quant I linked is pretty dynamic, with some tensors having q8_0 type, some q6_k and some q4_k (while the majority is iq2_xs). In my testing, this smol-IQ2_XS quant is the best available at this BPW range.

Eventually I might try a more practical eval such as terminal bench.

0 comments

Aurornis4d ago

> I did not do a very long session

This is always the problem with the 2-bit and even 3-bit quants: They look promising in short sessions but then you try to do real work and realize they’re a waste of time.

Running a smaller dense model like 27B produces better results than 2-bit quants of larger models in my experience.

amelius4d ago

> This is always the problem with the 2-bit and even 3-bit quants: They look promising in short sessions but then you try to do real work and realize they’re a waste of time.

It would be nice to see a scientific assessment of that statement.

singpolyma34d ago

Lots of people seem to use 4bit. Do you think that's worth it vs a smaller model in some cases?

Aurornis4d ago

4 bit is as low as I like to go. There are KLD and perplexity tests that compare quantizations where you can see the curve of degradation, but perplexity and KLD numbers can be misleading compared to real world use where small errors compound over long sessions.

In my anecdotal experience I’ve been happier with Q6 and dealing with the tradeoffs that come with it over Q4 for Qwen3.5 27B.

hnfong4d ago

Generally the perplexity charts indicate that quality drops significantly below 4-bit, so in that sense 4-bit is the sweet spot if you're resource constrained.

j / k navigate · click thread line to collapse

0 pointstarruda4d ago0 comments

Eventually I might try a more practical eval such as terminal bench.

0 comments

Aurornis4d ago

> I did not do a very long session

This is always the problem with the 2-bit and even 3-bit quants: They look promising in short sessions but then you try to do real work and realize they’re a waste of time.

Running a smaller dense model like 27B produces better results than 2-bit quants of larger models in my experience.

amelius4d ago

> This is always the problem with the 2-bit and even 3-bit quants: They look promising in short sessions but then you try to do real work and realize they’re a waste of time.

It would be nice to see a scientific assessment of that statement.

singpolyma34d ago

Lots of people seem to use 4bit. Do you think that's worth it vs a smaller model in some cases?

Aurornis4d ago

In my anecdotal experience I’ve been happier with Q6 and dealing with the tradeoffs that come with it over Q4 for Qwen3.5 27B.

hnfong4d ago

Generally the perplexity charts indicate that quality drops significantly below 4-bit, so in that sense 4-bit is the sweet spot if you're resource constrained.

j / k navigate · click thread line to collapse