TurboQuant: Redefining AI efficiency with extreme compression (opens in new tab)

(research.google)

576 pointsray__1mo ago166 comments

166 comments

This is a great development for KV cache compression. I did notice a missing citation in the related works regarding the core mathematical mechanism, though. The foundational technique of applying a geometric rotation prior to extreme quantization, specifically for managing the high-dimensional geometry and enabling proper bias correction, was introduced in our NeurIPS 2021 paper, "DRIVE" (https://proceedings.neurips.cc/paper/2021/hash/0397758f8990c...). We used this exact rotational approach and a similar bias correction mechanism to achieve optimal distributed mean estimation. I also presented this work and subsequent papers in a private invited talk at Google shortly after publication. Given the strong theoretical overlap with the mechanisms in TurboQuant and PolarQuant, I hope to see this prior art acknowledged in the upcoming camera-ready versions.

jjssmith1mo ago

LOL. This is a classical technique, Johnson-Linderstrauss etc. In this context, rediscovered every few years (recently months), e.g. here's 2017: https://proceedings.mlr.press/v70/suresh17a

amitport1mo ago

We do mention and the paper you shared. Please read our paper to see how the rotation-aware bias correction we introduced efficiently fixes the bias and provides a better worst-case error.

busfahrer1mo ago

I just today learned about Multi-Head Latent Attention, which is also sort of a way of compressing the KV cache. Can someone explain how this new development relates to MHLA?

yorwba1mo ago

Multi-Head Latent attention is a redesigned attention mechanism that produces lower-dimensional KV-cache entries. Vector quantization can store KV-cache entries using a small number of bits per dimension while ensuring that the resulting attention scores don't change too much. So MLA needs to be part of the model from the beginning of training, whereas VQ can be retrofitted afterwards, and you could also combine the two.

tripplyons1mo ago

MLA makes it so the keys and values used are a function of a smaller latent vector you cache instead of a key and a value for each token. KV cache quantization reduces the size of the values in the cache by using less bits to store each value. These two approaches operate on different parts of the process so they can be used in combination. For example, you can quantize the latents that are stored for MLA.

eecc1mo ago

Pardon my simplistic question, but when you mean rotation you’re essentially talking about diagonalization aren’t you?

So storing the diagonal as a matrix and the new bases is more compact?

amitport1mo ago

In this context, the rotation is for spreading energy and ensuring predictable coordinate distributions rather than diagonalization; it makes coordinate-wise quantization much more computationally efficient, though it throws away learnable structure.

eecc1mo ago

ah ok, so intuitively it's like minimizing the error when replacing the values with a well-known distribution. So all you need to carry along is the rotation and the assumption that there is some amount of loss.

tripplyons1mo ago

There are papers that try to quantize angles associated with weights because angles have a more uniform distribution. I haven't read this specific paper, but it looks like it uses a similar trick at a glance.

dmacfour1mo ago

Check out the most recent comment about the paper on OpenReview. This doesn't seem like isolated behavior:

https://openreview.net/forum?id=tO3ASKZlok

sva_1mo ago

Schmidhuber'd

jmalicki1mo ago

If they didn't cite your paper that's bullshit.

But if they read your paper enough that they invited you to a talk, that probably means they were far enough along to independently inventing it they were going to do so anyway, and wanted to chat with someone who was also doing the thing they were already doing. Good ideas tend to reveal themselves to anyone who is aware of the problem.

amitport1mo ago

To be clear, I am not claiming they stole an idea. They have made significant independent research. However, a specific part regarding the treatment of rotation with bias correction relates to prior work, and it would be appropriate to have that recognized.

jmalicki1mo ago

If they didn't at least cite it, it is complete bullshit.

If they cited it, but you feel you deserved more credit than that... I feel you, but it's less clear cut.

ekjhgkejhgk1mo ago

Doesn't matter, you should still cite. It's basic manners in science.

kleiba1mo ago

Exactly, that's why the section is called "Related Work".

cubefox1mo ago

> But if they read your paper enough that they invited you to a talk, that probably means they were far enough along to independently inventing it

That's more than a stretch. They likely invited them because someone thought the abstract sounded interesting, or something like that.

CyberDildonics1mo ago

That's rationalizing like crazy. If they knew about it they should have cited it.

jmalicki1mo ago

That's what I'm saying - not citing is total bullshit.

But if they invited a talk, and published a paper and cited it, it might be a little off, but not horrible.

efavdb1mo ago

The earlier paper was from 2021!

gavinray1mo ago

Can someone ELI5 these two concepts please, which make no sense to me:

  > "TurboQuant starts by randomly rotating the data vectors. This clever step simplifies the data's geometry"

I don't understand how taking a series of data and applying a random rotation could mathemetically lead every time to "simpler" geometry.

If I throw a bunch of shapes on the ground, tightly packed and touching each other, then rotate all of them, you can't guarantee that the new conglomerate shape is any more/less "simple" than before, right?

  > "Johnson-Lindenstrauss Transform to shrink complex, high-dimensional data while preserving the essential distances and relationships between data points. It reduces each resulting vector number to a single sign bit (+1 or -1)."

How can a boolean value preserve all of the relational and positional information between data points?

kingstnap1mo ago

Other people have answered here but the real answer is that deep neural networks don't learn isotropic distributions of activations.

What happens is that you get very spikey activations, there are so called "outlier" activations. A easy to read paper that tells you about this is SmoothQuant [0]. Another source from Anthropic and the Mechanistic Interperability people is calling these "privileged basis" [1].

Now based on the weight symmetries of a typical transformer, these actually don't need to exist. Weight symmetries means the ways you can change the weights without actually affecting the mathematical function, there are a broad class of these because the linear algebra has a lot of redundancies in it.

But the behaviour of the Adam optimizer is such that you do end up w/ these things because it sort of more quickly optimizes to produce them. This comes from the fact it is an elementwise dynamic learning rate (and probably partly to do with the epsilon).

[0] https://arxiv.org/pdf/2211.10438 [1] https://transformer-circuits.pub/2023/privileged-basis/index...

gavinray1mo ago

From your second paper:

  > In particular, we can generate fixed random rotation matrices at initialization, and multiply them into the activations any time we read from or write to the residual stream.

I guess I was mistaken in assuming this part was part of the TurboQuant-specific innovations. Still an interesting concept though

Bolwin1mo ago

Do you know if this also applies to the muon optimizer? It seems to be replacing adamw

kingstnap1mo ago

My guess is that probably not for Muon. What I said about ADAM was partly based on this blogpost I read some time ago, should have cited it as well [0].

The thing about Muon is that it doesn't have this specific feature of ADAM that causes it to "move along the diagonal". Basically if you flatten weights as a huge vector of a few billion elements. SGD moves along the gradient, which isn't biased. ADAM normalizes everything elementwise, so it sort of moves along a vector of +-1.

This isn't a proof or anything, but what you can imagine might be happening is that if you move along +-1, then you find spikey solutions somehow. Not sure how to prove that. Muon doesn't really do this, but it has its own sort of funky reshaping of the update (it moves along low rank directions).

[0] https://www.lesswrong.com/posts/yrhu6MeFddnGRSLtQ/adam-optim...

lumost1mo ago

They are saying that models should be invariant to data's orientation - and only sensitive to the distance between vectors. This has a pretty significant effect on reducing the set of possible models, and may stabilize the optimization.

In simple terms, large ML models like LLMs often learn trivial rules such as "if the 21st decimal place of the 5th dimension in the embedding vector is 5 - then the image is of a cat." Learning such a memorization function is usually not what we are trying to do, and there are a variety of techniques to avoid these trivial solutions and "smooth" the optimization geometry.

photon_lines1mo ago

The whole goal of quantisation is to put the data into 'bins' so that it can easily be 'packed' so that you can represent it using less bits (less information). You can think of it like rounding essentially (3.14159 -> 3). Now, sometimes within data, the distribution will be non-ideal for separating it out into bins (let's say that our rounding rules are simple -- we simply use a floor function so 2.45 maps to 2 and 6.4543 maps to 6 etc...) and our bins simply map to the floor -- if we had a set of numbers which look like this: [3.11, 4.43, 5.78, 12.33, 34.32], they would simply map to [3, 4, 5, 12, 34]. Now, we have one huge outlier in our data (34) so to create bins for those sets of numbers, we would need 6 bits of information (2 to the power of 6 = 64), but this is mostly due to the fact that we have one huge outlier (34.32). To get rid of this -- the algorithms applies a random rotation matrix which 'distorts' the original data so that it is more evenly distributed among the possible bins which are assigned to the data set. In linear algebra, a rotation matrix is an orthogonal matrix. When you multiply your vector by this matrix, you aren't changing the "amount" of data (the length of the vector remains the same), but you are recalculating every single number in that vector as a weighted sum of the originals. According to the Central Limit Theorem, when you sum up many random things, the result always starts looking like a bell curve. This is the magic TurboQuant relies on: they don't know what your data looks like, but they know that after the rotation, the data must look like a Beta Distribution and they use this fact to transform the original data into a more 'tightly packed' distribution which allows them to more efficiently pack (or quantise) the information. If most of the transformed data is huddled together into a predictable Bell curve shape, you can pack your bins tightly around that shape leading to much higher precision with fewer needed bits to store it. For example, after applying a rotation matrix, our original transform [3.11, 4.43, 5.78, 12.33, 34.32] might get mapped to something like [8.12, 8.65, 9.25, 10.53, 12.86] and we can crate bins which both are more accurate and need less bits in order to hold our original data set. To create the most optimal bins -- the Lloyd-Max algorithm is used. This algorithm is the gold standard for 1D quantisation. Its goal is to find the best places to put your "boundaries" (where you cut the data) and your "reconstruction values" (the number you store) to minimise the Mean Squared Error (MSE). After applying this, you have your 'rounded' values (or quantized data), but there is still an error value which is missing from our data set: and this is where the residual bit comes in. That bit doesn't represent the original data (or vector) - it simply represents our 'bias' after we apply the above algorithms. It's basically like a '1-bit note' which allows you to perfectly cancel out all the bias terms which our above quantisation algorithm produces to make the 'interactions' (or inner products) when we multiply our values together extremely accurate again even after transforming our original data. Does this make sense?

nico1mo ago

Amazing explanation! Thank you so much for taking the time to put it together. It makes a lot of sense. I’m not the one who asked the question, but I was impressed by such eloquent and clearly explained answer

photon_lines1mo ago

Thank you! I'm glad you found it helpful (and that others did too)!!

thrtythreeforty1mo ago

This is a fantastic explanation. Thank you. The only part I am not following is how it is guaranteed that 1 bit is sufficient for the error value. Is this something the Lloyd-Max algorithm is responsible for ensuring? (Seems to me that if your quantization algorithm is crappy enough, you could need a large number of bits to store the error.)

rtrgrd1mo ago

Added to my non-llm username list :)

Thanks so much for the explanation

psidium1mo ago

Wow, thank you for the explanation. Such a complex topic and yet you’ve made it simple to understand.

functional_dev1mo ago

i wonder what is the limit of quantization when it starts to destroy the logic of weights?

gavinray1mo ago

I had to read this over a few times to piece it together, thanks for the thorough and digestable explanation!

rohansood151mo ago

Thank you.

gopalv1mo ago

> I don't understand how taking a series of data and applying a random rotation could mathemetically lead every time to "simpler" geometry.

Let's pick a simpler compression problem where changing the frame of reference improves packing.

There's a neat trick in the context of floating point numbers.

The values do not always compress when they are stored exactly as given.

[0.1, 0.2, 0.3, 0.4, 0.5]

Maybe I can encode them in 15 bytes instead of 20 as float32.

Up the frame of reference to be decibels instead of bels and we can encode them as sequential values without storing exponent or sign again.

Changing the frame of reference, makes the numbers "more alike" than they were originally.

But how do you pick a good frame of reference is all heuristics and optimization gradients.

redanddead1mo ago

AI and graphics are matrices

Matrices are numbers [x,y,z]

GPUs are matrix processing units

Models are big matrices, we quantize them to make them small. That is lossy. Makes AI dumber the harder you quantize but lets you run inference with lesser hardware

What if you could quantize less destructively/lossy? You could make a model way smaller or make much bigger models that run on less RAM

That is what they achieved here. They're not saying that multiplying the matrices with scalars up or down helps. They're saying that by mutating and transforming the matrix with a function (ie. rotating the dimensions by the same "random" rotation) you have matrices that make smarter models fit in smaller boxes, needing way less RAM to achieve the same performance

If we quantized it as aggressively as we would have without the distribution/mutation function, the drop in benchmarks would be even more noticeable

It's actually a huge breakthrough and commercially its probably only a short term loss in valuation for the manufacturers

wordpad1mo ago

They are not doing random rotation, simplification here means they are aligning the outliers. If you threw a bunch of shapes on the ground they are picking up one that rolled away and putting it with the others.

>How can a boolean value preserve all of the relational and positional information between data points?

They aren't reducing entire vector to a bollean only each of its dimensions.

elif1mo ago

i could be mistaken but from my read, the 'rotation' aspect is nothing new and not dissimilar from normal spin quant, where the importance matrix is rotated during calibration such that the local minima/maxima are more evenly smoothed and excessive/redundant quantization of parameters is avoided.

as for the J-L transformation is way above my head so i'm almost certainly mistaken but it seems to be some clever way to use a bit as a sort of pointer in order to reuse existing chunks of parameter weight data like in a jpeg or zip compression algorithm.

akhenakh1mo ago

Someone implementing it on llamacpp already https://github.com/mudler/llama.cpp/commit/dee102db1bfd723c9...

GistNoesis1mo ago

He even attempts to improve on the paper by replacing the random rotation operation which is O(d^2), by a Subsampled Randomized Hadamard Transform which can be computed in O(d*log d).

Hopefully Johnson–Lindenstrauss lemma applies in the same way for SRHTransformed vectors as they do for randomly rotated vectors and the independence of the distribution laws of the coordinates remains and therefore the quantization of each coordinates independently is still theoretically sound.

1 more reply

cpburns20091mo ago

For some reason I thought the implementation would be way more complicated than that. I obviously lack the domain knowledge to tackle something like this, but it looks straight forward.

qingcharles1mo ago

Agreed. Actual LOC is tiny. Very impressive PR.

vibe421mo ago

The pace of development in llama.cpp is really high, could see an implementation being merged in 4-6 weeks.

parsimo20101mo ago

This blog post sucks. It does not make me want to read the papers.

Look at this figure: https://storage.googleapis.com/gweb-research2023-media/image...

The speedup labels on the vertical axis are 0, 2, 2, 4, 6, 8... Why is 2 repeated? Did they just have nano-banana make them some charts? Can they not be bothered to use matplotlib or bokeh and directly render a graph? I don't know, maybe there is some legitimate reason that I don't know about for making a single value occur multiple times on a graph axes, but if that is the case, then they probably need to explain it in the figure caption. So it's either a "GenAI special" or it's poor communication about how to read the graph...

Look at this video visualization: https://storage.googleapis.com/gweb-research2023-media/media...

Do you have literally any clue what Polar Quantization is? Would this make me think, "I kind of have a high level understanding of that, let me go get the details from the paper."

Look at this figure: https://storage.googleapis.com/gweb-research2023-media/image...

The left hand side of the graph, which is normally assumed to start at 0, starts at 48. Those MASSIVE differences you see in the figure? Only a few percent. And that's a deception but only if the figure is even accurate, because we saw earlier they can't even get figure axes correct.

davesque1mo ago

Yeah, the viz for polar quantization is straight up nonsensical. Okay, so some colors are converted into clocks and then into a bigger box with a pink box inside of it. Got it. Even understanding what polar coordinates are doesn't help you make sense out of it.

alkenrinnstet1mo ago

It's slop. The text is also clearly generated by a chatbot with its nonsensical comparisons and bizarrely superlative language.

flux31251mo ago

I bet the paper was vibe written too

pstoll1mo ago

And a group has published an independent working implementation today, nice to see:

https://github.com/tonbistudio/turboquant-pytorch

ilija1391mo ago

It has a lot clearer explanation of the method than Google's own post.

ramon1561mo ago

Well, yeah. Claude simplified it. That doesn't mean it's a better explanation.

adi_kurian1mo ago

Did it lose important detail?

benob1mo ago

This is the worst lay-people explanation of an AI component I have seen in a long time. It doesn't even seem AI generated.

spencerflem1mo ago

I think it is though-

“ TurboQuant, QJL, and PolarQuant are more than just practical engineering solutions; they’re fundamental algorithmic contributions backed by strong theoretical proofs. These methods don't just work well in real-world applications; they are provably efficient and operate near theoretical lower bounds.”

integralid1mo ago

I also instinctively reacted to that fragment, but at this point I think this is overreacting to a single expression. It's not just a normal thing to say in English, it's something people have been saying for a long time before LLMs existed.

nvme0n1p11mo ago

There are tells all over the page:

> Redefining AI efficiency with extreme compression

"Redefine" is a favorite word of AI. Honestly no need to read further.

> the key-value cache, a high-speed "digital cheat sheet" that stores frequently used information under simple labels

No competent engineer would describe a cache as a "cheat sheet". Cheat sheets are static, but caches dynamically update during execution. Students don't rewrite their cheat sheets during the test, do they? LLMs love their inaccurate metaphors.

> QJL: The zero-overhead, 1-bit trick

> It reduces each resulting vector number to a single sign bit (+1 or -1). This algorithm essentially creates a high-speed shorthand that requires zero memory overhead.

Why does it keep emphasizing zero overhead? Why is storing a single bit a "trick?" Either there's currently an epidemic of algorithms that use more than one bit to store a bit, or the AI is shoving in extra plausible-sounding words to pad things out. You decide which is more likely.

It's 1:30am and I can't sleep, and I still regret wasting my time on this slop.

5 more replies

g-mork1mo ago

Another instinctual reaction here. This specific formulation pops out of AI all the time, there might as well have been an emdash in the title

NoahZuniga1mo ago

Genius new idea: replace the em-dashes with semicolons so it looks less like AI.

tux31mo ago

You're absolutely right. That's not just a genius idea; it's a radical new paradigm.

Quarrel1mo ago

Damnit.

There goes another bit of my writing style that will get mistaken for an LLM.

zarzavat1mo ago

I read "this clever step" and immediately came to the comments to see if anyone picked up on it.

It reads like a pop science article while at the same time being way too technical to be a pop science article.

Turing test ain't dead yet.

TeMPOraL1mo ago

> Turing test ain't dead yet.

Only because people are lazy, and don't bother with a simple post-processing step: attach a bunch of documents or text snippets written by a human (whether yourself or, say, some respected but stylistically boring author), and ask the LLM to match style/tone.

benob1mo ago

Maybe they quantized a bit too much the model parameters...

BenoitP1mo ago

It is AI generated. Or was written by someone a bit far from the technical advances IMHO. The Johnson-Lindenstrauss Lemma is a very specific and powerful concept, when in the article the QLJ explanation is vacuous. A knowledgeable human would not have left the reader wanting for how that relates to the Lemma.

1 more reply

davesque1mo ago

Yeah, and some parts of the article are just bizarre:

> Instead of looking at a memory vector using standard coordinates (i.e., X, Y, Z) that indicate the distance along each axis, PolarQuant converts the vector into polar coordinates using a Cartesian coordinate system. This is comparable to replacing "Go 3 blocks East, 4 blocks North" with "Go 5 blocks total at a 37-degree angle”

Why bother explaining this? Were they targeting the high school and middle school student reader base??

jeeeb1mo ago

It feels very much like Gemini’s writing style - overly excited with lots of unnecessary contrasts.

mesuvash1mo ago

TurboQuant explained with an easy to understand (no-math) animation https://mesuvash.github.io/blog/2026/turboquant-interactive/

fc417fc8021mo ago

Someone else linked that elsewhere in the comments and while it's certainly a nice visual it seems like it's not accurately portraying the paper. Isn't the grid supposed to have a weird alignment that depends on the bit depth? And there's supposed to be a second quantization step involving the residual.

mesuvash1mo ago

Fair point. I've updated the animation to address this. The grid now uses the correct non-uniform centroids (optimal for the arcsine distribution in 2D), so you'll see grid lines cluster near the edges where unit-circle coordinates actually concentrate, rather than being evenly spaced. The spacing does change with bit depth.

On the second quantization step: the paper's inner-product variant uses (b-1) bits for the MSE quantizer shown here, then applies a 1-bit QJL (Quantized Johnson-Lindenstrauss) encoding of the residual to make dot-product estimates unbiased. I chose to omit QJL from the animation to keep it digestible as a visual, but I've added a note calling this out explicitly.

fc417fc8021mo ago

It looks nice! Fair enough about QJL - it seems to be nothing more than an unbiasing measure anyway.

I'm not sure if it's my own misunderstanding or if the paper [0] has something of an error. Section 3.1 starts out to the effect "let x be on the unit hypersphere" (but I'm fairly certain it's actually not). Neither algorithm 1 nor algorithm 2 show a normalization step prior to rotating x. Algorithm 2 line 8 shows that the scalar returned is actually the magnitude of the residual without accounting for QJL.

Anyway I'm pretty sure the authors inadvertently omitted that detail which really had me confused for a while there.

[0] https://arxiv.org/abs/2504.19874

1 more reply

wbsun1mo ago

The blog is new but the paper was submitted almost one year ago: https://arxiv.org/abs/2504.19874. Anyone has ideas if this is already implemented in many models (at least Gemini, I guess)? If that's the case, can I expect cheaper RAM for my computer :D

mskkm1mo ago

seems to be a scam

"The TurboQuant paper (ICLR 2026) contains serious issues in how it describes RaBitQ, including incorrect technical claims and misleading theory/experiment comparisons. We flagged these issues to the authors before submission. They acknowledged them, but chose not to fix them. The paper was later accepted and widely promoted by Google, reaching tens of millions of views.

We’re speaking up now because once a misleading narrative spreads, it becomes much harder to correct. We’ve written a public comment on openreview (https://openreview.net/forum?id=tO3ASKZlok).

We would greatly appreciate your attention and help in sharing it."

https://x.com/gaoj0017/status/2037532673812443214

bdcs1mo ago

Here's my attempt at a undergrad-level summary (corrections welcome!):

The core idea is to quantize KV cache, but do so in a way that destroys minimal information. In this case, it's similarly scores between vectors. The simplest way to do this is to change all the elements from 16bit of precision to, say, 4 bits (Scalar Quant.). These papers improve on it by realizing: almost all the energy (concentration of measure) is towards the equator of the hypersphere (normally distributed as 1/d; d=vector dimensionality). (The curse/blessing of hyper dimensionality strikes again.) So when we quantize the elements (think "latitudes", e.g. to the nearest degree) we destroy a lot of information because basically all the vectors were around the equator (so some latitudes have a lot of vectors and some have very few). The idea is to rotate the vectors away from the equator so they're more consistently distributed (to better preserve the entropy during quantization, which I guess was amitport's DRIVE idea). PolarQuant does a hyperpolar coordinate transform which superficially seems neat for preserving entropy because of this equator/polar framing (and ultimately unnecessary as shown by TurboQuant). They also realized there's a bias to the resulting vectors during similarity, so they wrote the QJL paper to fix the bias. And then the TurboQuant paper took PolarQuant + QJL, removed the hyperpolar coords, and added in some gross / highly-pragmatic extra bits for important channels (c.f. elements of the vectors) which is sort of a pathology of LLMs these days but it is what it is. Et voila, highly compressed KV Cache. If you're curious why you can randomly rotate the input, it's because all the vectors are rotated the same, so similarity works out. You could always un-rotate to get the original, but there's no need because the similarity on rotated/unrotated is the same if you compare apples to apples (with the QJL debiasing). Why was PolarQuant even published? Insu Han is solely on that paper and demanded/deserved credit/promotion, would be my guess. The blog post is chock-full of errors and confusions.

bdcs1mo ago

Some corrections: the vectors are un-rotated in practice for future query vectors. This could be removed with a slightly different LLM arch.

PolarQuant does live on in TurboQuant's codebooks for quantization which borrows from the hyperpolar coords

fc417fc8021mo ago

> added in some gross / highly-pragmatic extra bits for important channels

I'm curious what you meant by that. I understood it to only have the MSE quantization vector, a 1-bit QJL vector, and a scalar magnitude.

> PolarQuant does live on in TurboQuant's codebooks for quantization which borrows from the hyperpolar coords

Isn't the turbo codebook the irregularly spaced centroid grid?

bdcs1mo ago

> extra bits per channel

Page 18 of the paper: > As shown in Table 1, our approach outperforms other methods for both Llama-3.1-8B-Instruct and Ministral-7B-Instruct, achieving significantly higher average scores. We evaluate our method using 2.5-bit and 3.5-bit quantization during text generation. These non-integer bit precisions result from our strategy of splitting channels into outlier and non-outlier sets, and applying two independent instances of TurboQuant to each, allocating higher bit precision to outliers. This outlier treatment strategy is consistent with prior work [63, 51] . For example, in our 2.5-bit setup, 32 outlier channels are quantized at 3 bits, while the remaining 96 channels use 2 bits, leading to an effective bit precision of (32 ×3 + 96×2)/128 = 2.5. For 3.5-bit quantization, a different ratio of outliers and regular channels leads to a higher effective bit precision. Despite using fewer bits than competing techniques, TurboQuant maintains performance comparable to unquantized models

So they find channels / indicies-of-the-vector that are important and give them more bits (3 bits) than the rest (2 bits).

>Isn't the turbo codebook the irregularly spaced centroid grid?

yes I believe so. They mention it's informed by the concentration of measure and the uncorrelated/independent vectors after the initial conditioning rotation. I feel like it was informed by PolarQuant, but that may just be how I intuit what's going on (because thinking about this in polar coordinates makes more sense in my head). IOW, I think the irregular spacing is maybe informed by TurboQuant.

However they do say, slightly to the contrary: "We find optimal scalar quantizers for random variables with Beta distributions by solving a continuous 1-dimensional k-means problem using the Max-Lloyd algorithm."

krackers1mo ago

Beautiful explanation, thanks!

zeeshana07x1mo ago

The gap between how this is described in the paper vs the blog post is pretty wide. Would be nice to see more accessible writing from research teams — not everyone reading is a ML engineer

om81mo ago

These are very different media types with very different goals.

dev_tools_lab1mo ago

Agreed. The practical implications are often more interesting than the math anyway — smaller models running locally means you can afford to run multiple models in parallel for cross-validation, which changes how you approach tasks like code analysis or bug detection.

bluequbit1mo ago

I did not understand what polarQuant is.

Is is something like pattern based compression where the algorithm finds repeating patterns and creates an index of those common symbols or numbers?

Maxious1mo ago

https://mesuvash.github.io/blog/2026/turboquant-interactive/ has a little visualisation

spencerflem1mo ago

I like the visualization, but I don’t understand the grid quantization. If every point is on the unit circle aren’t all the center grid cords unused?

fc417fc8021mo ago

Yeah that's odd. It seems like you'd want an n-1 dimensional grid on the surface of the unit sphere rather than an n dimensional grid within which the sphere resides.

Looking at the paper (https://arxiv.org/abs/2504.19874) they cite earlier work that does exactly that. They object that grid projection and binary search perform exceptionally poorly on the GPU.

I don't think they're using a regular grid as depicted on the linked page. Equation 4 from the paper is how they compute centroids for the MSE optimal quantizer.

Why specify MSE optimal you ask? Yeah so it turns out there's actually two quantization steps, a detail also omitted from the linked page. They apply QJL quantization to the residual of the grid quantized data.

My description is almost certainly missing key details; I'm not great at math and this is sufficiently dense to be a slog.

mesuvash1mo ago

Yes. Great catch. I simplified the grid just for visualization purpose.

I've updated the visualization. The grid is actually not uniformly spaced. Each coordinate is quantized independently using optimal centroids for the known coordinate distribution. In 2D, unit-circle coordinates follow the arcsine distribution (concentrating near ±1), so the centroids cluster at the edges, not the center.

1 more reply

vincnetas1mo ago

i think grid can be a surface of the unit sphere

Geee1mo ago

Is there an error in the visualization? It shows that every vector is rotated the same amount. My understanding was that they are randomized with different values, which results in a predictable distribution, which is easier to quantize.

mesuvash1mo ago

That's actually correct and intentional. TurboQuant applies the same rotation matrix to every vector. The key insight is that any unit vector, when multiplied by a random orthogonal matrix, produces coordinates with a known distribution (Beta/arcsine in 2D, near-Gaussian in high-d). The randomness is in the matrix itself (generated once from a seed), not per-vector. Since the distribution is the same regardless of the input vector, a single precomputed quantization grid works for everything. I've updated the description to make this clearer.

1 more reply

fc417fc8021mo ago

I believe they are all rotated by the same random matrix, the purpose being (IIUC) to distribute the signal evenly across all dimensions. So effectively it drowns any structure that might be present in noise. That's essential for data efficiency in addition to avoiding bias related issues during the initial quantization step. However there are still some other issues due to bias that are addressed by a second quantization step involving the residual.

That said, I don't believe the visualization is correct. The grid for one doesn't seem to match what's described in the paper.

Also it's entirely possible I've misunderstood or neglected to notice key details.

pstoll1mo ago

Good post but link at the end is broken.

“”” For the full technical explanation with equations, proofs, and PyTorch pseudocode, see the companion post: TurboQuant: Near-Optimal Vector Quantization Without Looking at Your Data.“

mesuvash1mo ago

Author here. Sorry still working on refining the post. Will share once the post is ready.

Rapzid1mo ago

Awesome! So it nudges the vectors into stepped polar rays.. It's effectively angle snapping? Plus a sort of magnitude clustering.

mrugge1mo ago

1. Efficient recursive transform of kv embeddings into polar coordinates 2. Quantize resulting angles without the need for explicit normalization. This saves memory via key insight: angles follow a distribution and have analytical form.

quotemstr1mo ago

Reminds me vaguely of Burrows-Wheeler transformations in bzip2.

viktorcode1mo ago

The way I understand it, it's a way of compressing vectors by switching from their per-component representation to polar coordinates representation, where the nearby vectors are clumped together to a single line, allowing to describe them by different lengths

Rapzid1mo ago

That overview is frustratingly high-level. I know what a vector is, a bit, and yet that compression description is crazy uninformative. And that PolarQuant visualization is.. Very abstract.

htrp1mo ago

The actual paper from April 2025

TurboQuant: Online Vector Quantization with Near-optimal Distortion Rate

https://arxiv.org/abs/2504.19874

bilsbie1mo ago

It seems like most breakthroughs I see are for efficiency? What are the most importsnt breakthroughs from the past two or three years for intelligence?

Lerc1mo ago

If you think of it from the point of view of the universal approximation theorem, it's all efficiency optimisation. We know that it works if we do it incredibly inefficiently.

Every architecture improvement is essentially a way to achieve the capability of a single fully-connected hidden layer network n wide. With fewer parameters.

Given these architectures usually still contain fully connected layers, unless they've done something really wrong, they should still be able to do anything if you make the entire thing large enough.

That means a large enough [insert model architecture] will be able to approximate any function to arbitrary precision. As long as the efficiency gains with the architecture are retained as the scale increases they should be able to get there quicker.

ertgbnm1mo ago

Most breakthroughs that are published are for efficiency because most breakthroughs that are published are for open source.'

All the foundation model breakthroughs are hoarded by the labs doing the pretraining. That being said, RL reasoning training is the obvious and largest breakthrough for intelligence in recent years.

WarmWash1mo ago

With all the floating around of AI researchers though, I kind of wonder how "secret" all these secrets are. I'm sure they have internal siloing, but even still, big players seem to regularly defect to other labs. On top of this, all the labs seem to be pretty neck and neck, with no one clearly pulling ahead across the board.

cubefox1mo ago

> What are the most importsnt breakthroughs from the past two or three years for intelligence?

The most important one in that timeframe was clearly reasoning/RLVR (reinforcement learning with verifiable rewards), which was pioneered by OpenAI's Q* aka Strawberry aka o1.

irthomasthomas1mo ago

Efficiency gains can be used to make existing models more profitable, or to make new larger and more intelligent models.

cubefox1mo ago

Some yes, others no. Distillation and quantization can't be used to make new base models since they require a preexisting one.

irthomasthomas1mo ago

it enables models larger than was previously possible.

1 more reply

redanddead1mo ago

This is an intelligence breakthrough

antiresonant1mo ago

At this rate, the current AI era is going to clear the queue of all mathematics that's ever been created but not yet applied.

naasking1mo ago

This sounds great! TurboQuant does KV cache compression using quantization via rotations, and ParoQuant [1] does weight compression using quantization via rotations! So we can get 4-bit weights that match bf16 precision, the KV cache goes down to 3 bits per key. This brings larger models and long contexts into the range of "possibly runnable" on beefy consumer hardware.

[1] https://github.com/z-lab/paroquant

mrbonner1mo ago

I feel like I’m not the only who feel excited about the whole “compression” tricks while maintaining fidelity in our AI era. In a way, it has a vibe similar to the early 2000s when digital music became popular and the need for lossless compression was paramount. Sort of a pied piper moment for us now . Someone please make a Weisseman score for this stuff.

1 more reply

ssijak1mo ago

For my grug brain can somebody translate this to ELIgrug terms?

Does this mean I would be able to run 500b model on my 48gb macbook without loosing quality?

x_may1mo ago

KV cache compression, so how much memory the model needs to use for extending its context. Does not affect the weight size.

prabal971mo ago

I wrote this more intuitive explanation. I think you might find it helpful!

https://prabal.ca/posts/google-long-context-cheaper/

maurelius21mo ago

I'm somewhat at a loss here other than understanding the fundamentals. Can someone tell me how the compression impact performance?

dryarzeg1mo ago

If in short, for many inference tasks the bottleneck is memory bandwidth. Suppose you have a machine with a memory bandwidth of 256 GB/s, and let's say you want to do inference for 4B model (model with 4 billion parameters). If you will load the model in BF16 format (16 bits), each forward pass (i.e. each token generated) will require roughly ~8 GB of memory bandwidth. So, 256/8 = 32 t/s, and that's the generation speed you will be strictly capped at even if your processing power is measured in exaFLOPS. But let's say now that you have decided to instead quantize the model and then run the quantized version. Suppose you have made a Q4_K_M version (4 bits + some weights will take more). Now each of your forward passes will take roughly 2-3 GB (rough approximations, reality is different) of memory bandwith (actually, it will be around 2 GB), and even in the worst case 256/3 = 85.3, while 256/2 = 128 t/s. Quants can reduce quality of the model and lower it's performance, but in most modern quantization methods those losses are usually negligible (although, of course, they're still present). So, as you can see, it can be concluded that quantization "widens" (it's not removing it fully) memory bottleneck while still preserving (not always though) acceptable quality.

(Sorry for my terrible English, it's not my native language)

rohansood151mo ago

The paper is about vector quantization, which affects KV cache not model weights/sizes.

valine1mo ago

So let’s start with a really simple decoder transformer with a single layer and single attention head, and train it to predict the next token in a sequence of text. To predict the next token you need a few things: a query for the very last token in the sequence, and a key and value for every prior token. You take your query and compute a dot product with every prior key (two large vectors in, scaler attention score out). That scaler attention score first goes through softmax, and then becomes the weight you use to compute a weighted average of your values, new value goes through the mlp, mlp output is projected into the logits from which you sample your next token (that’s the general idea at least skipped a few steps).

The last query in the sequence will be new for every new token you predict, but the set of prior keys and values stay the same, ie keys and values are reusable. The key value cache gets bigger and bigger for each new token you add to the sequence, and that’s where compression comes in. You have to store the keys and values in vram, and you’d like to keep the size down by not storing the raw uncompressed tensors. To make this work well your compression needs two things: it needs to be fast so that you can compress and decompress on the fly, and it needs to play well with softmax attention. Prior attempts at compression usually suck at one or the other, either the speed to decompress is too slow and your token/s takes a hit, or you lose important precision and the model output quality suffers. The claim in the paper is that they’ve made progress on both.

edg50001mo ago

So limiting max context length also reduces VRAM needs a bit? If cache is 20% of total, 1/10th of context as a limit would mean 18% total memory reduction.

valine1mo ago

Yup exactly, in principle it helps with both inference speed by reducing memory bandwidth usage and also reduces the memory footprint of your kvcache.

prabal971mo ago

Reposting it here ... I wrote this more intuitive explanation. I think you might find it helpful too!

https://prabal.ca/posts/google-long-context-cheaper/

iddan1mo ago

I am guessing as Google is vertically integrated and "actually pays" for AI infra (compared to OpenAI & Anthropic that receives hardware as partnerships) they have a more urgent incentive to reduce model sizes. Also, Google and Apple will be the first to gain from running model on-device

mrcwinn1mo ago

I can assure you OpenAI and Anthropic pay for hardware. They don’t receive it for free.

skybrian1mo ago

This seems to be an inference-time optimization and they are putting AI on every search result page. That seems like plenty of incentive to optimize.

macleginn1mo ago

"TurboQuant proved it can quantize the key-value cache to just 3 bits without requiring training or fine-tuning and causing any compromise in model accuracy" -- what do each 3 bits correspond to? Hardly individual keys or values, since it would limit each of them to 8 different vectors.

carlosvega1mo ago

Is the number of bits per coordinate. So, 1 bit is 2x2 grid. 3 bit is a 64 cell grid (2^3 x 2^3). Here you have a demo.

https://mesuvash.github.io/blog/2026/turboquant-interactive/

jbellis1mo ago

The explanation is terrible, but it's clear that it's not actually lossless.

mmastrac1mo ago

Is this a tradeoff between GPU-computation-expense vs accuracy? ie: you could quantize into segments or grids on the unit circle/sphere/etc, but that's too expensive so it's better to just quantize to a Cartesian grid because the GPU can decompress cheaper?

lwhi1mo ago

Will this help us run models locally?

antoniuschan991mo ago

It could turn a 1M context system to a 4M context system. TurboQuant-style KV-cache compression makes longer context windows cheaper to serve. Not exactly sure how much increase in context size though.

moktonar1mo ago

Aren’t polar coordinates still n-1 + 1 for radius for n-dim vector? If so I understand that angles can be quantized better but when radius r is big the error is large for highly quantized angles right? What am I missing?

amitport1mo ago

r is a single value per vector. You don't have to quantize it, you can keep it and quantize the billion+ other coordinates of the vector.

mungoman21mo ago

What they're saying is that the error for a vector increases with r, which is true.

Trivially, with r=0, the error is 0, regardless of how heavily the direction is quantized. Larger r means larger absolute error in the reconstructed vector.

amitport1mo ago

Yes, the important part is that the normalized error does not increase with the dimension of the vector (which does happen when using biased quantizers)

It is expected that bigger vectors have proportionally bigger error, nothing can be done by the quantizer about that.

1 more reply

lucrbvi1mo ago

Sounds like Multi-Head Latent Attention (MLA) from DeepSeek

veunes1mo ago

Nah, those are completely different beasts. DeepSeek's MLA solves the KV cache issue via low-rank projection - they literally squeeze the matrix through a latent vector at train time. TurboQuant is just Post-Training Quantization where they mathematically compress existing weights and activations using polar coordinates

esafak1mo ago

No, it is about compressing the KV cache; see How TurboQuant works.

_s_a_m_1mo ago

has the word "advanced", gotta be good

alkenrinnstet1mo ago

This article is AI-generated slop.

> This clever step simplifies the data's geometry

No self-respecting researcher talks about their work in this way. But it is characteristic of these chatbots' tendency to over-use superlatives and sycophantic language.

mskkm1mo ago

Pied Piper vibes. As far as I can tell, this algorithm is hardly compatible with modern GPU architectures. My guess is that’s why the paper reports accuracy-vs-space, but conveniently avoids reporting inference wall-clock time. The baseline numbers also look seriously underreported. “several orders of magnitude” speedups for vector search? Really? anyone has actually reproduced these results?

NitpickLawyer1mo ago

Apparently MLX confirmed it - https://x.com/prince_canuma/status/2036611007523512397

mskkm1mo ago

They confirmed on the accuracy on NIAH but didn't reproduce the claimed 8x efficiency.

fc417fc8021mo ago

Efficient execution on the GPU appears to have been one of the specific aims of the authors. Table 2 of their paper shows real world performance that would appear at a glance to be compatible with inference.

mskkm1mo ago

This is not an LLM inference result. Table 2 is the part I find most questionable. Claiming orders-of-magnitude improvements in vector search over standard methods is an extraordinary claim. If it actually held up in practice, I would have expected to see independent reproductions or real-world adoption by now. It’s been about a year since the paper came out, and I haven’t seen much of either. That doesn’t prove the claim is false, but it certainly doesn’t inspire confidence.

veunes1mo ago

Classic academic move. If the authors show accuracy-vs-space charts but hide end-to-end latency, it usually means their code is slower in practice than vanilla fp16 without any compression. Polar coordinates are absolute poison for parallel GPU compute

fc417fc8021mo ago

I don't think they're using polar coordinates? They're quantizing to grid centroids.

j / k navigate · click thread line to collapse

166 comments

amitport1mo ago

jjssmith1mo ago

LOL. This is a classical technique, Johnson-Linderstrauss etc. In this context, rediscovered every few years (recently months), e.g. here's 2017: https://proceedings.mlr.press/v70/suresh17a

amitport1mo ago

We do mention and the paper you shared. Please read our paper to see how the rotation-aware bias correction we introduced efficiently fixes the bias and provides a better worst-case error.

busfahrer1mo ago

I just today learned about Multi-Head Latent Attention, which is also sort of a way of compressing the KV cache. Can someone explain how this new development relates to MHLA?

yorwba1mo ago

tripplyons1mo ago

eecc1mo ago

Pardon my simplistic question, but when you mean rotation you’re essentially talking about diagonalization aren’t you?

So storing the diagonal as a matrix and the new bases is more compact?

amitport1mo ago

eecc1mo ago

tripplyons1mo ago

dmacfour1mo ago

Check out the most recent comment about the paper on OpenReview. This doesn't seem like isolated behavior:

https://openreview.net/forum?id=tO3ASKZlok

sva_1mo ago

Schmidhuber'd

jmalicki1mo ago

If they didn't cite your paper that's bullshit.

amitport1mo ago

jmalicki1mo ago

If they didn't at least cite it, it is complete bullshit.

If they cited it, but you feel you deserved more credit than that... I feel you, but it's less clear cut.

ekjhgkejhgk1mo ago

Doesn't matter, you should still cite. It's basic manners in science.

kleiba1mo ago

Exactly, that's why the section is called "Related Work".

cubefox1mo ago

> But if they read your paper enough that they invited you to a talk, that probably means they were far enough along to independently inventing it

That's more than a stretch. They likely invited them because someone thought the abstract sounded interesting, or something like that.

CyberDildonics1mo ago

That's rationalizing like crazy. If they knew about it they should have cited it.

jmalicki1mo ago

That's what I'm saying - not citing is total bullshit.

But if they invited a talk, and published a paper and cited it, it might be a little off, but not horrible.

efavdb1mo ago

The earlier paper was from 2021!

gavinray1mo ago

Can someone ELI5 these two concepts please, which make no sense to me:

  > "TurboQuant starts by randomly rotating the data vectors. This clever step simplifies the data's geometry"

I don't understand how taking a series of data and applying a random rotation could mathemetically lead every time to "simpler" geometry.

  > "Johnson-Lindenstrauss Transform to shrink complex, high-dimensional data while preserving the essential distances and relationships between data points. It reduces each resulting vector number to a single sign bit (+1 or -1)."

How can a boolean value preserve all of the relational and positional information between data points?

kingstnap1mo ago

Other people have answered here but the real answer is that deep neural networks don't learn isotropic distributions of activations.

[0] https://arxiv.org/pdf/2211.10438 [1] https://transformer-circuits.pub/2023/privileged-basis/index...

gavinray1mo ago

From your second paper:

  > In particular, we can generate fixed random rotation matrices at initialization, and multiply them into the activations any time we read from or write to the residual stream.

I guess I was mistaken in assuming this part was part of the TurboQuant-specific innovations. Still an interesting concept though

Bolwin1mo ago

Do you know if this also applies to the muon optimizer? It seems to be replacing adamw

kingstnap1mo ago

My guess is that probably not for Muon. What I said about ADAM was partly based on this blogpost I read some time ago, should have cited it as well [0].

[0] https://www.lesswrong.com/posts/yrhu6MeFddnGRSLtQ/adam-optim...

lumost1mo ago

photon_lines1mo ago

nico1mo ago

photon_lines1mo ago

Thank you! I'm glad you found it helpful (and that others did too)!!

thrtythreeforty1mo ago

rtrgrd1mo ago

Added to my non-llm username list :)

Thanks so much for the explanation

psidium1mo ago

Wow, thank you for the explanation. Such a complex topic and yet you’ve made it simple to understand.

functional_dev1mo ago

i wonder what is the limit of quantization when it starts to destroy the logic of weights?

gavinray1mo ago

I had to read this over a few times to piece it together, thanks for the thorough and digestable explanation!

rohansood151mo ago

Thank you.

gopalv1mo ago

> I don't understand how taking a series of data and applying a random rotation could mathemetically lead every time to "simpler" geometry.

Let's pick a simpler compression problem where changing the frame of reference improves packing.

There's a neat trick in the context of floating point numbers.

The values do not always compress when they are stored exactly as given.

[0.1, 0.2, 0.3, 0.4, 0.5]

Maybe I can encode them in 15 bytes instead of 20 as float32.

Up the frame of reference to be decibels instead of bels and we can encode them as sequential values without storing exponent or sign again.

Changing the frame of reference, makes the numbers "more alike" than they were originally.

But how do you pick a good frame of reference is all heuristics and optimization gradients.

redanddead1mo ago

AI and graphics are matrices

Matrices are numbers [x,y,z]

GPUs are matrix processing units

Models are big matrices, we quantize them to make them small. That is lossy. Makes AI dumber the harder you quantize but lets you run inference with lesser hardware

What if you could quantize less destructively/lossy? You could make a model way smaller or make much bigger models that run on less RAM

If we quantized it as aggressively as we would have without the distribution/mutation function, the drop in benchmarks would be even more noticeable

It's actually a huge breakthrough and commercially its probably only a short term loss in valuation for the manufacturers

wordpad1mo ago

>How can a boolean value preserve all of the relational and positional information between data points?

They aren't reducing entire vector to a bollean only each of its dimensions.

elif1mo ago

akhenakh1mo ago

Someone implementing it on llamacpp already https://github.com/mudler/llama.cpp/commit/dee102db1bfd723c9...

GistNoesis1mo ago

He even attempts to improve on the paper by replacing the random rotation operation which is O(d^2), by a Subsampled Randomized Hadamard Transform which can be computed in O(d*log d).

1 more reply

cpburns20091mo ago

For some reason I thought the implementation would be way more complicated than that. I obviously lack the domain knowledge to tackle something like this, but it looks straight forward.

qingcharles1mo ago

Agreed. Actual LOC is tiny. Very impressive PR.

vibe421mo ago

The pace of development in llama.cpp is really high, could see an implementation being merged in 4-6 weeks.

parsimo20101mo ago

This blog post sucks. It does not make me want to read the papers.

Look at this figure: https://storage.googleapis.com/gweb-research2023-media/image...

Look at this video visualization: https://storage.googleapis.com/gweb-research2023-media/media...

Do you have literally any clue what Polar Quantization is? Would this make me think, "I kind of have a high level understanding of that, let me go get the details from the paper."

Look at this figure: https://storage.googleapis.com/gweb-research2023-media/image...

davesque1mo ago

alkenrinnstet1mo ago

It's slop. The text is also clearly generated by a chatbot with its nonsensical comparisons and bizarrely superlative language.

flux31251mo ago

I bet the paper was vibe written too

pstoll1mo ago

And a group has published an independent working implementation today, nice to see:

https://github.com/tonbistudio/turboquant-pytorch

ilija1391mo ago

It has a lot clearer explanation of the method than Google's own post.

ramon1561mo ago

Well, yeah. Claude simplified it. That doesn't mean it's a better explanation.

adi_kurian1mo ago

Did it lose important detail?

benob1mo ago

This is the worst lay-people explanation of an AI component I have seen in a long time. It doesn't even seem AI generated.

spencerflem1mo ago

I think it is though-

integralid1mo ago

nvme0n1p11mo ago

There are tells all over the page:

> Redefining AI efficiency with extreme compression

"Redefine" is a favorite word of AI. Honestly no need to read further.

> the key-value cache, a high-speed "digital cheat sheet" that stores frequently used information under simple labels

> QJL: The zero-overhead, 1-bit trick

> It reduces each resulting vector number to a single sign bit (+1 or -1). This algorithm essentially creates a high-speed shorthand that requires zero memory overhead.

It's 1:30am and I can't sleep, and I still regret wasting my time on this slop.

5 more replies

g-mork1mo ago

Another instinctual reaction here. This specific formulation pops out of AI all the time, there might as well have been an emdash in the title

NoahZuniga1mo ago

Genius new idea: replace the em-dashes with semicolons so it looks less like AI.

tux31mo ago

You're absolutely right. That's not just a genius idea; it's a radical new paradigm.

Quarrel1mo ago

Damnit.

There goes another bit of my writing style that will get mistaken for an LLM.

zarzavat1mo ago

I read "this clever step" and immediately came to the comments to see if anyone picked up on it.

It reads like a pop science article while at the same time being way too technical to be a pop science article.

Turing test ain't dead yet.

TeMPOraL1mo ago

> Turing test ain't dead yet.

benob1mo ago

Maybe they quantized a bit too much the model parameters...

BenoitP1mo ago

1 more reply

davesque1mo ago

Yeah, and some parts of the article are just bizarre:

Why bother explaining this? Were they targeting the high school and middle school student reader base??

jeeeb1mo ago

It feels very much like Gemini’s writing style - overly excited with lots of unnecessary contrasts.

mesuvash1mo ago

TurboQuant explained with an easy to understand (no-math) animation https://mesuvash.github.io/blog/2026/turboquant-interactive/

fc417fc8021mo ago

mesuvash1mo ago

fc417fc8021mo ago

It looks nice! Fair enough about QJL - it seems to be nothing more than an unbiasing measure anyway.

Anyway I'm pretty sure the authors inadvertently omitted that detail which really had me confused for a while there.

[0] https://arxiv.org/abs/2504.19874

1 more reply

wbsun1mo ago

mskkm1mo ago

seems to be a scam

We’re speaking up now because once a misleading narrative spreads, it becomes much harder to correct. We’ve written a public comment on openreview (https://openreview.net/forum?id=tO3ASKZlok).

We would greatly appreciate your attention and help in sharing it."

https://x.com/gaoj0017/status/2037532673812443214

bdcs1mo ago

Here's my attempt at a undergrad-level summary (corrections welcome!):

bdcs1mo ago

Some corrections: the vectors are un-rotated in practice for future query vectors. This could be removed with a slightly different LLM arch.

PolarQuant does live on in TurboQuant's codebooks for quantization which borrows from the hyperpolar coords

fc417fc8021mo ago

> added in some gross / highly-pragmatic extra bits for important channels

I'm curious what you meant by that. I understood it to only have the MSE quantization vector, a 1-bit QJL vector, and a scalar magnitude.

> PolarQuant does live on in TurboQuant's codebooks for quantization which borrows from the hyperpolar coords

Isn't the turbo codebook the irregularly spaced centroid grid?

bdcs1mo ago

> extra bits per channel

So they find channels / indicies-of-the-vector that are important and give them more bits (3 bits) than the rest (2 bits).

>Isn't the turbo codebook the irregularly spaced centroid grid?

krackers1mo ago

Beautiful explanation, thanks!

zeeshana07x1mo ago

The gap between how this is described in the paper vs the blog post is pretty wide. Would be nice to see more accessible writing from research teams — not everyone reading is a ML engineer

om81mo ago

These are very different media types with very different goals.

dev_tools_lab1mo ago

bluequbit1mo ago

I did not understand what polarQuant is.

Is is something like pattern based compression where the algorithm finds repeating patterns and creates an index of those common symbols or numbers?

Maxious1mo ago

https://mesuvash.github.io/blog/2026/turboquant-interactive/ has a little visualisation

spencerflem1mo ago

I like the visualization, but I don’t understand the grid quantization. If every point is on the unit circle aren’t all the center grid cords unused?

fc417fc8021mo ago

Yeah that's odd. It seems like you'd want an n-1 dimensional grid on the surface of the unit sphere rather than an n dimensional grid within which the sphere resides.

Looking at the paper (https://arxiv.org/abs/2504.19874) they cite earlier work that does exactly that. They object that grid projection and binary search perform exceptionally poorly on the GPU.

I don't think they're using a regular grid as depicted on the linked page. Equation 4 from the paper is how they compute centroids for the MSE optimal quantizer.

My description is almost certainly missing key details; I'm not great at math and this is sufficiently dense to be a slog.

mesuvash1mo ago

Yes. Great catch. I simplified the grid just for visualization purpose.

1 more reply

vincnetas1mo ago

i think grid can be a surface of the unit sphere

Geee1mo ago

mesuvash1mo ago

1 more reply

fc417fc8021mo ago

That said, I don't believe the visualization is correct. The grid for one doesn't seem to match what's described in the paper.

Also it's entirely possible I've misunderstood or neglected to notice key details.

pstoll1mo ago

Good post but link at the end is broken.

“”” For the full technical explanation with equations, proofs, and PyTorch pseudocode, see the companion post: TurboQuant: Near-Optimal Vector Quantization Without Looking at Your Data.“

mesuvash1mo ago

Author here. Sorry still working on refining the post. Will share once the post is ready.

Rapzid1mo ago

Awesome! So it nudges the vectors into stepped polar rays.. It's effectively angle snapping? Plus a sort of magnitude clustering.

mrugge1mo ago

quotemstr1mo ago

Reminds me vaguely of Burrows-Wheeler transformations in bzip2.

viktorcode1mo ago

Rapzid1mo ago

That overview is frustratingly high-level. I know what a vector is, a bit, and yet that compression description is crazy uninformative. And that PolarQuant visualization is.. Very abstract.

htrp1mo ago

The actual paper from April 2025

TurboQuant: Online Vector Quantization with Near-optimal Distortion Rate

https://arxiv.org/abs/2504.19874

bilsbie1mo ago

It seems like most breakthroughs I see are for efficiency? What are the most importsnt breakthroughs from the past two or three years for intelligence?

Lerc1mo ago

If you think of it from the point of view of the universal approximation theorem, it's all efficiency optimisation. We know that it works if we do it incredibly inefficiently.

Every architecture improvement is essentially a way to achieve the capability of a single fully-connected hidden layer network n wide. With fewer parameters.

Given these architectures usually still contain fully connected layers, unless they've done something really wrong, they should still be able to do anything if you make the entire thing large enough.

ertgbnm1mo ago

Most breakthroughs that are published are for efficiency because most breakthroughs that are published are for open source.'

All the foundation model breakthroughs are hoarded by the labs doing the pretraining. That being said, RL reasoning training is the obvious and largest breakthrough for intelligence in recent years.

WarmWash1mo ago

cubefox1mo ago

> What are the most importsnt breakthroughs from the past two or three years for intelligence?

The most important one in that timeframe was clearly reasoning/RLVR (reinforcement learning with verifiable rewards), which was pioneered by OpenAI's Q* aka Strawberry aka o1.

irthomasthomas1mo ago

Efficiency gains can be used to make existing models more profitable, or to make new larger and more intelligent models.

cubefox1mo ago

Some yes, others no. Distillation and quantization can't be used to make new base models since they require a preexisting one.

irthomasthomas1mo ago

it enables models larger than was previously possible.

1 more reply

redanddead1mo ago

This is an intelligence breakthrough

antiresonant1mo ago

At this rate, the current AI era is going to clear the queue of all mathematics that's ever been created but not yet applied.

naasking1mo ago

[1] https://github.com/z-lab/paroquant

mrbonner1mo ago

1 more reply

ssijak1mo ago

For my grug brain can somebody translate this to ELIgrug terms?

Does this mean I would be able to run 500b model on my 48gb macbook without loosing quality?

x_may1mo ago

KV cache compression, so how much memory the model needs to use for extending its context. Does not affect the weight size.

prabal971mo ago

I wrote this more intuitive explanation. I think you might find it helpful!

https://prabal.ca/posts/google-long-context-cheaper/

maurelius21mo ago

I'm somewhat at a loss here other than understanding the fundamentals. Can someone tell me how the compression impact performance?

dryarzeg1mo ago

(Sorry for my terrible English, it's not my native language)

rohansood151mo ago

The paper is about vector quantization, which affects KV cache not model weights/sizes.

valine1mo ago

edg50001mo ago

So limiting max context length also reduces VRAM needs a bit? If cache is 20% of total, 1/10th of context as a limit would mean 18% total memory reduction.

valine1mo ago

Yup exactly, in principle it helps with both inference speed by reducing memory bandwidth usage and also reduces the memory footprint of your kvcache.

prabal971mo ago

Reposting it here ... I wrote this more intuitive explanation. I think you might find it helpful too!

https://prabal.ca/posts/google-long-context-cheaper/

iddan1mo ago

mrcwinn1mo ago

I can assure you OpenAI and Anthropic pay for hardware. They don’t receive it for free.

skybrian1mo ago

This seems to be an inference-time optimization and they are putting AI on every search result page. That seems like plenty of incentive to optimize.

macleginn1mo ago

carlosvega1mo ago

Is the number of bits per coordinate. So, 1 bit is 2x2 grid. 3 bit is a 64 cell grid (2^3 x 2^3). Here you have a demo.

https://mesuvash.github.io/blog/2026/turboquant-interactive/

jbellis1mo ago

The explanation is terrible, but it's clear that it's not actually lossless.

mmastrac1mo ago

lwhi1mo ago

Will this help us run models locally?

antoniuschan991mo ago

moktonar1mo ago

amitport1mo ago

r is a single value per vector. You don't have to quantize it, you can keep it and quantize the billion+ other coordinates of the vector.

mungoman21mo ago

What they're saying is that the error for a vector increases with r, which is true.

Trivially, with r=0, the error is 0, regardless of how heavily the direction is quantized. Larger r means larger absolute error in the reconstructed vector.

amitport1mo ago

Yes, the important part is that the normalized error does not increase with the dimension of the vector (which does happen when using biased quantizers)

It is expected that bigger vectors have proportionally bigger error, nothing can be done by the quantizer about that.

1 more reply

lucrbvi1mo ago

Sounds like Multi-Head Latent Attention (MLA) from DeepSeek

veunes1mo ago

esafak1mo ago

No, it is about compressing the KV cache; see How TurboQuant works.

_s_a_m_1mo ago

has the word "advanced", gotta be good

alkenrinnstet1mo ago

This article is AI-generated slop.

> This clever step simplifies the data's geometry

No self-respecting researcher talks about their work in this way. But it is characteristic of these chatbots' tendency to over-use superlatives and sycophantic language.

mskkm1mo ago

NitpickLawyer1mo ago

Apparently MLX confirmed it - https://x.com/prince_canuma/status/2036611007523512397

mskkm1mo ago

They confirmed on the accuracy on NIAH but didn't reproduce the claimed 8x efficiency.

fc417fc8021mo ago

mskkm1mo ago

veunes1mo ago

fc417fc8021mo ago

I don't think they're using polar coordinates? They're quantizing to grid centroids.

j / k navigate · click thread line to collapse