Quantization from the Ground Up (opens in new tab)

(ngrok.com)

351 pointssamwho1mo ago59 comments

59 comments

The hardware situation is way better than you think, and quantization is a huge part of why.

Take Qwen 3.5 27B, which is a solid coding model. At FP16 it needs 54GB of VRAM. Nobody's running that on consumer hardware. At Q4_K_M quantization, it needs 16GB. A used RTX 3090 has 24GB and goes for about $900. That model runs locally with room for context.

For 14B coding models at Q4, you're looking at about 10GB. A used RTX 3060 12GB handles that for under $270.

The gap between "needs a datacenter" and "runs on my desk" is almost entirely quantization. A 27B model at Q4 loses surprisingly little quality for most coding tasks. It's not free, but it's not an RTX 7090 either. A used 3090 is probably the most recommended card in the local LLM community right now, and for good reason.

rdos1mo ago

14B even at Q4 isn't realistic for coding on a single 12GB RTX 3060. Token speed is too slow. After all they are dense models. You aren't getting a good MoE model under 30B. You can do OCR, STT, TTS really well and for LLMs, good use cases are classification, summarization and extraction with <10B models.

suprjami1mo ago

Dual 3060s run 24B Q6 and 32B Q4 at ~15 tok/sec. That's fast enough to be usable.

Add a third one and you can run Qwen 3.5 27B Q6 with 128k ctx. For less than the price of a 3090.

rdos1mo ago

Sure, two 3060 can pull usable performance on an usable LLM, but a single one can't (yet).

> 3x RTX 3060 less tgab the price of a 3090

Interesting, here it is around the same. 200-250€ for a used 12GB 3060 and 600-800 for a used 3090€.

faangguyindia1mo ago

U are better off just buying their coding plan.

Running LLM makes no sense whatsoever

oompydoompy741mo ago

Remaining dependent on proprietary frontier models that you can only access via an API makes no sense whatsoever. My hope is that the future is open weight models running on local hardware.

naasking1mo ago

Eventually, yes. ParoQuant is hopefully the future here, 4-bit weights with no real degradation from FP16:

https://github.com/z-lab/paroquant

epaulson1mo ago

I was a little confused by this part:

"This is what's happening to the parameters of models when they're quantized down to sizes that are possible to run on your laptop. Instead of floats, small integers are what get stored and loaded into memory. When the time comes to use the quantized values, to generate an answer to a question for example, the values are dequantized on the fly. You might think this sounds slower, but we'll see later on that this actually ends up being faster as well as smaller."

I thought that most GPUs supported floating point math in these quantized formats, like they can natively do math on an float4 number (that's maybe packed, 2 float4s into a single byte, or more probably 16 float4s in an 8 byte array or maybe something even bigger)

Am I getting this wrong - is it instead the GPU pulls in the quantized numbers and then converts them back into 32-bit or 64-bit float to actually run through the ALUs on the GPU? (and the memory bandwidth savings make up for the extra work to convert them back into 32 bit numbers once you get them onto the GPU?)

Or is it some weird hybrid, like there is native support for float8 and Bfloat16, but if you want to use float2 you have to convert it to float4 or something the hardware can work with.

I am confused what actually happens in the vectorized ADD and MULT instructions in the GPU with these quantized numbers.

djsamseng1mo ago

Your understanding is correct. The key detail is that the author used an M1 Max and H100 for their testing.

M1 Max: FP16 hardware support, FP8 and Bfloat16 emulated in software (via dequantization)

H100: FP16 and FP8 hardware support

> which I ran both on a MacBook Pro M1 Max and a rented H100 SXM GPU

est1mo ago

> I am confused what actually happens in the vectorized ADD and MULT instructions in the GPU with these quantized numbers.

I might be wrong, but I think LLM is all about comparing distance between tokens. You can tell that -255 and +255 are very separated, but you are also away that -8 and +8 are also very far away.

Microsoft Bitnet and Google TurboQuant shows that in extreme you can use just -1, 0, +1

adrian_b1mo ago

Very old CPUs had support only down to FP16, which is useful in graphics applications.

Then support for Bfloat16 and for INT8 has been added, which are not useful for anything else but AI/ML applications. Then support for FP8 has been added. Even smaller formats are supported only on some very recent GPUs.

If you have a recent enough GPU, it might support something like float2 or float4, but if you have an older GPU you must convert the short format to the next bigger format that is supported, before performing some operations.

stefan_1mo ago

Hardware support will vary widely, as will speed on these smaller FP formats, sometimes intentionally nerfed in consumer cards.

Lots of devices with embedded "AI accelerators" will also only do things like INT8, and for some reason INT8 is generally worse than the same size FP8 (maybe that could be fixed with smarter quantization).

aarondf1mo ago

My word... samwho is doing some of the best technical explainers on the internet right now.

polotics1mo ago

Leading to my question: Ok keeping a zero and a minus-zero does make sense for some limits calculations... But when all you have is 4 bits, is this not quite wasteful? Would using the bits for eg. a 2.5 not improve the model?

sillysaurusx1mo ago

It might be useful. The Lion optimizer uses 1-bit values to represent forward or backward. NNs can pick up on patterns like that in very strange ways. Of course, those are 1's, not 0's, so maybe the benefit disappears when multiplying by zero. But it's important to challenge assumptions like "well, let's get rid of the negative half of 0" before you test experimentally whether it's useful or not. NNs are nothing if not shockingly weird when you try to make them.

polotics1mo ago

Oh well that's a rabbit hole: NVIDIA Blackwell has this, also GGUFs sidestep this with Qi_j / Qi_K... Great article, spikes curiosity!

seabass1mo ago

Heartily second that! It was cool to see a combination of DOM, SVG, and canvas visualization all in use for this post.

armcat1mo ago

This is beautifully written and visualised, well done! The KL divergence comparisons between original and different quantisation levels is on-point. I'm not sure people realize how powerful quantisation methods are and what they've done for democratising local AI. And there are some great players out there like Unsloth and Pruna.

samwhoOP1mo ago

Thank you! I was really surprised how robust models are to losing information. It seems wrong that they can be compressed so much and still function at all, never mind function quite closely to the original size.

Think we're only going to keep seeing more progress in this area on the research side, too.

buildbot1mo ago

You can even train in 4 & 8 bits with newer microscaled formats! From https://arxiv.org/pdf/2310.10537 to gpt-oss being trained (partially) natively in MXFP4 - https://huggingface.co/blog/RakshitAralimatti/learn-ai-with-...

To Nemotron 3 Super, which had 25T of nvfp4 native pretraining! https://docs.nvidia.com/nemotron/0.1.0/nemotron/super3/pretr...

naasking1mo ago

Newer quantization approaches are even better, 4-bits gets you no meaningful loss relative to FP16: https://github.com/z-lab/paroquant

Hopefully Microsoft keeps pushing BitNet too, so only "1.58" bits are needed.

I think fractional representations are only relevant for training at this point, and bf16 is sufficient, no need for fp4 and such.

1 more reply

gavinray1mo ago

I read the entire thing top-to-bottom, as a visual learner this is superb.

One nitpick -- in the "asymmetric quantification" code, shouldn't "zero" be called "midpoint" or similar? Or is "zero" an accepted mathematics term in this domain?

samwhoOP1mo ago

“Zero point” is how I saw it referred to in the literature, so that’s what I went with. I personally prefer to think of it as an offset, but I try to stick with terms folks are likely to see in the wild.

gavinray1mo ago

Fair enough, thanks!

samwhoOP1mo ago

You’re welcome! Thanks so much for the kind words.

mrsilencedogood1mo ago

Quantization is important for me because it's the only way out I can see for a future of programming that doesn't involve going through a giant bigco who can run, as the article says, a machine with 2TB of memory. And not just memory, but my understanding is that for the model to be performant, it has to be VRAM to boot.

This comes as the latest concern of mine in a long line around "how software gets written" remaining free-as-in-freedom. I've always been really uneasy about how reliant many programming languages were on Jetbrains editors, only vaguely comforted by their "open-core" offering, which naturally only existed for languages with strong OSS competition for IDEs (so... java and python, really). "Intellisense" seemed very expensive to implement and was hugely helpful in writing programs without stopping every 4 seconds to look up whether removing whitespace at the end of a line is trim, strip, or something else in this language. I was naturally pleased to see language servers take off, even if it was much to my chagrin that it came from Microsoft, who clearly was out of open standards to EEE and decided to speed up the process by making some new ones.

Now LLMs are the next big worry of mine. It seems pretty bad for free and open software if the "2-person project, funded indirectly by the welfare state of a nordic or eastern-european nation" model that drives ridiculously important core libre/OSS libraries now is even less able to compete with trillion dollar corporations.

Open-weight, quantized, but still __good__ models seem like the only way out. I remain somewhat hopeful just from how far local models have come - they're significantly more usable than they were a year ago, and we've got more tools like LM Studio etc making running them easy. But there's still a good way to go.

I'll be sad if a "programming laptop" ends up going from "literally anything that can run debian" to "yeah you need an RTX 7090, 128GB of VRAM, and the 2kW wearable power supply backpack addon at a minimum".

all21mo ago

I've been watching the drizzle of LLM papers come through, and I think we're going to hit a 1T param MoE on consumer hardware before this year is out. It'll still be behind the bigco models, but it'll be a force multiplier. Ideally, we'd get these models to run on a CPU. MS BitNet is one way to do this. You can already run ternary LLMs on consumer CPUs with a decent tps.

mattmanser1mo ago

Though what is consumer hardware right now?

Can we still classify 5090s as consumer hardware given how expensive they are? They're £3k at the moment, and it looks like it's only going to get worse unless the AI bubble pops.

nwatson1mo ago

I got an Olares One system with a 24GB (consumer not 32GB) NVIDIA RTX 5090 for less than $3k at the Kickstarter price. It comes with Olares OS which for my purposes is not all that useful, I settled finally on a good Ubuntu 24.04 LTS configuration, but it was a good deal. I actually bought two.

all21mo ago

I was thinking more in terms of 24GB of VRAM total. I started sketching the architecture for such a model this afternoon, nothing novel, just combining existing advancements in the field. It looks achievable.

nl1mo ago

I mean you can run a 1T model on consumer hardware now by doing things like layer offloading and streaming from SSD. It's just too slow to be useful.

charcircuit1mo ago

You can rent compute from small companies to run the models. It's even cheaper if multiple people are able to use the same model at once that way you can pay for being included within a bigger batch as opposed to for the entire compute.

add-sub-mul-div1mo ago

You can still continue to master actual software engineering while others spend their time turning their minds into a palimpsest of tricks and lessons of how to convince one model after another after another after another into giving reasonable output. That you'd still have to vet yourself anyway.

mrsilencedogood1mo ago

While I think a lot of the AI hype is just hype - everyone saying most of these things have _hitherto untold riches_ levels of financial incentives to say them - I think it's also undeniable that LLMs speed up many aspects of coding.

I also think that AI might be the beginning of the end of copyright. While before, everyone with money clearly had tremendous incentive to keep copyright strong, now all of a sudden trillions of dollars are basically predicated on the idea that LLMs aren't violating copyright. Copyleft has been a major tool in the FOSS toolbox. If that's weakening, I don't ALSO want free software to be locked out of agentic programming too.

mattmanser1mo ago

Only for the AI companies. Not for you or I.

It's the corrupting nature of capitalism really laid bare. A net loss for so many of their constituents that politicians all over the world are falling over themselves to pave the way for foreign companies to exploit their constituents IP.

A true tragedy of the commons unfolding before us.

I get why, and I get why it's the only realistic choice, but it really is showing the weaknesses of modern politics.

1 more reply

AIorNot1mo ago

Man what a brilliant technical essay.. hat's off to the writer for clarity and visualizations.

samwhoOP1mo ago

Thank you!

steve_adams_861mo ago

Sam's previous posts are well worth digging up too. This one is outstanding, but they're all good. I really enjoyed this and learned a lot.

I'm a bit envious of his job. Learning to teach others, and building out such cool interactive, visual documents to do it? He makes it look easier than it is, of course. A lot of effort and imagination went into this, and I'm sure it wasn't a walk in the park. Still, it seems so gratifying.

muskstinks1mo ago

The 2 bit is probably slower because it clashes with some register sizes and how data is read in blocks. No additional benefit because the architecture doesn't read 2 bits but probably min 4 bits and then it clashes with utilization.

Really good visualizations overall.

cphoover1mo ago

5-10% accuracy is like the difference between a usable model, and unusable model.

samwhoOP1mo ago

Definitely could be, but in the time I spent talking to the 4-bit models in comparison to the 16-bit original it seemed surprisingly capable still. I do recommend benchmarking quantized models at the specific tasks you care about.

djsjajah1mo ago

yes, but the difference between one model and one 4x larger is usually a lot more than that.

It is not a question of do a run Qwen 8b at bf16 or a quantized version. It more of a question of do I run Qwen 8b at full precision or do I run a quantized version of Qwen 27b.

You will find that you are usually better off with the larger model.

amelius1mo ago

Yes I was wondering why they mentioned those numbers without mentioning their practical significance.

gurachek1mo ago

The float comparison slider is great.

One thing from practical experience - the quality gap between model sizes shows up in a way benchmarks don't capture. I have a system where a smaller model generates plans and a larger model can override them. On any single output they look comparable. The difference shows up 3-4 steps later — small model makes a decision that sounds reasonable but compounds into a bad plan. Perplexity won't catch that, KL divergence won't either. They both measure one prediction at a time.

stuxnet791mo ago

What is the best way to archive a JS heavy site like this? I reviewed OPs github and they haven't open-sourced these visualizations probably because they are tied to his employer.

krackers1mo ago

Most (all?) of this holds for quantizing convnets too, if you're looking for an easy exercise you can play around with quantizing resnet50 or something and plotting layer activations

fcpk1mo ago

something I have been wondering about is doing regressive layer specific quantization based on large test sets. ie reduce very specifically layers that don't improve general quality.

buildbot1mo ago

This is a thing! For example, https://arxiv.org/abs/2511.06516

fcpk1mo ago

that's brilliant, I wonder why we haven't seen much use of it to do very heavy quantization

woadwarrior011mo ago

This is a very well established idea. It's called dynamic quantization. Vary the quantization bit-width (or skip quantization altogether) on a layer by layer basis, using a calibration dataset.

EvoPress is the first time that comes to my mind, when I think of dynamic quantization.

https://arxiv.org/abs/2410.14649

qskousen1mo ago

I've experimented with this with diffusion models with a safetensors - gguf tool I wrote. even with relatively few sample images (~10k, still enough to keep my 3090 spinning for days straight) the benefits are quite noticeable - a smaller file with overall better results.

nazgulsenpai1mo ago

This isn't just a good explainer of quantization, it's a good overview of LLMs in general.

opello1mo ago

I think it's a good introduction to quantization generally and specifically in how it applies to reducing LLMs. But I also think it should say something about LLMs or "AI" in the title (as even the article is tagged AI on the author's site) because despite that being an easy assumption to make given the zeitgeist, including the detail would be more clear.

aeve8901mo ago

Oh, _that_ quantization.

maxilevi1mo ago

since when ngrok is doing ai

srichard161mo ago

https://ngrok.ai/

j / k navigate · click thread line to collapse

59 comments

msbhogavi1mo ago

The hardware situation is way better than you think, and quantization is a huge part of why.

For 14B coding models at Q4, you're looking at about 10GB. A used RTX 3060 12GB handles that for under $270.

rdos1mo ago

suprjami1mo ago

Dual 3060s run 24B Q6 and 32B Q4 at ~15 tok/sec. That's fast enough to be usable.

Add a third one and you can run Qwen 3.5 27B Q6 with 128k ctx. For less than the price of a 3090.

rdos1mo ago

Sure, two 3060 can pull usable performance on an usable LLM, but a single one can't (yet).

> 3x RTX 3060 less tgab the price of a 3090

Interesting, here it is around the same. 200-250€ for a used 12GB 3060 and 600-800 for a used 3090€.

faangguyindia1mo ago

U are better off just buying their coding plan.

Running LLM makes no sense whatsoever

oompydoompy741mo ago

Remaining dependent on proprietary frontier models that you can only access via an API makes no sense whatsoever. My hope is that the future is open weight models running on local hardware.

naasking1mo ago

Eventually, yes. ParoQuant is hopefully the future here, 4-bit weights with no real degradation from FP16:

https://github.com/z-lab/paroquant

epaulson1mo ago

I was a little confused by this part:

Or is it some weird hybrid, like there is native support for float8 and Bfloat16, but if you want to use float2 you have to convert it to float4 or something the hardware can work with.

I am confused what actually happens in the vectorized ADD and MULT instructions in the GPU with these quantized numbers.

djsamseng1mo ago

Your understanding is correct. The key detail is that the author used an M1 Max and H100 for their testing.

M1 Max: FP16 hardware support, FP8 and Bfloat16 emulated in software (via dequantization)

H100: FP16 and FP8 hardware support

> which I ran both on a MacBook Pro M1 Max and a rented H100 SXM GPU

est1mo ago

> I am confused what actually happens in the vectorized ADD and MULT instructions in the GPU with these quantized numbers.

I might be wrong, but I think LLM is all about comparing distance between tokens. You can tell that -255 and +255 are very separated, but you are also away that -8 and +8 are also very far away.

Microsoft Bitnet and Google TurboQuant shows that in extreme you can use just -1, 0, +1

adrian_b1mo ago

Very old CPUs had support only down to FP16, which is useful in graphics applications.

stefan_1mo ago

Hardware support will vary widely, as will speed on these smaller FP formats, sometimes intentionally nerfed in consumer cards.

aarondf1mo ago

My word... samwho is doing some of the best technical explainers on the internet right now.

polotics1mo ago

sillysaurusx1mo ago

polotics1mo ago

Oh well that's a rabbit hole: NVIDIA Blackwell has this, also GGUFs sidestep this with Qi_j / Qi_K... Great article, spikes curiosity!

seabass1mo ago

Heartily second that! It was cool to see a combination of DOM, SVG, and canvas visualization all in use for this post.

armcat1mo ago

samwhoOP1mo ago

Think we're only going to keep seeing more progress in this area on the research side, too.

buildbot1mo ago

To Nemotron 3 Super, which had 25T of nvfp4 native pretraining! https://docs.nvidia.com/nemotron/0.1.0/nemotron/super3/pretr...

naasking1mo ago

Newer quantization approaches are even better, 4-bits gets you no meaningful loss relative to FP16: https://github.com/z-lab/paroquant

Hopefully Microsoft keeps pushing BitNet too, so only "1.58" bits are needed.

I think fractional representations are only relevant for training at this point, and bf16 is sufficient, no need for fp4 and such.

1 more reply

gavinray1mo ago

I read the entire thing top-to-bottom, as a visual learner this is superb.

One nitpick -- in the "asymmetric quantification" code, shouldn't "zero" be called "midpoint" or similar? Or is "zero" an accepted mathematics term in this domain?

samwhoOP1mo ago

gavinray1mo ago

Fair enough, thanks!

samwhoOP1mo ago

You’re welcome! Thanks so much for the kind words.

mrsilencedogood1mo ago

all21mo ago

mattmanser1mo ago

Though what is consumer hardware right now?

Can we still classify 5090s as consumer hardware given how expensive they are? They're £3k at the moment, and it looks like it's only going to get worse unless the AI bubble pops.

nwatson1mo ago

all21mo ago

nl1mo ago

I mean you can run a 1T model on consumer hardware now by doing things like layer offloading and streaming from SSD. It's just too slow to be useful.

charcircuit1mo ago

add-sub-mul-div1mo ago

mrsilencedogood1mo ago

mattmanser1mo ago

Only for the AI companies. Not for you or I.

A true tragedy of the commons unfolding before us.

I get why, and I get why it's the only realistic choice, but it really is showing the weaknesses of modern politics.

1 more reply

AIorNot1mo ago

Man what a brilliant technical essay.. hat's off to the writer for clarity and visualizations.

samwhoOP1mo ago

Thank you!

steve_adams_861mo ago

Sam's previous posts are well worth digging up too. This one is outstanding, but they're all good. I really enjoyed this and learned a lot.

muskstinks1mo ago

Really good visualizations overall.

cphoover1mo ago

5-10% accuracy is like the difference between a usable model, and unusable model.

samwhoOP1mo ago

djsjajah1mo ago

yes, but the difference between one model and one 4x larger is usually a lot more than that.

It is not a question of do a run Qwen 8b at bf16 or a quantized version. It more of a question of do I run Qwen 8b at full precision or do I run a quantized version of Qwen 27b.

You will find that you are usually better off with the larger model.

amelius1mo ago

Yes I was wondering why they mentioned those numbers without mentioning their practical significance.

gurachek1mo ago

The float comparison slider is great.

stuxnet791mo ago

What is the best way to archive a JS heavy site like this? I reviewed OPs github and they haven't open-sourced these visualizations probably because they are tied to his employer.

krackers1mo ago

Most (all?) of this holds for quantizing convnets too, if you're looking for an easy exercise you can play around with quantizing resnet50 or something and plotting layer activations

fcpk1mo ago

something I have been wondering about is doing regressive layer specific quantization based on large test sets. ie reduce very specifically layers that don't improve general quality.

buildbot1mo ago

This is a thing! For example, https://arxiv.org/abs/2511.06516

fcpk1mo ago

that's brilliant, I wonder why we haven't seen much use of it to do very heavy quantization

woadwarrior011mo ago

This is a very well established idea. It's called dynamic quantization. Vary the quantization bit-width (or skip quantization altogether) on a layer by layer basis, using a calibration dataset.

EvoPress is the first time that comes to my mind, when I think of dynamic quantization.

https://arxiv.org/abs/2410.14649

qskousen1mo ago

nazgulsenpai1mo ago

This isn't just a good explainer of quantization, it's a good overview of LLMs in general.

opello1mo ago

aeve8901mo ago

Oh, _that_ quantization.

maxilevi1mo ago

since when ngrok is doing ai

srichard161mo ago

https://ngrok.ai/

j / k navigate · click thread line to collapse