They always hope the speed increase makes up for the lower quality, but it never does. The quadratic time seems inherent to the problem.
Indeed, there are lower bounds showing that sub n^2 algorithms can't work: https://arxiv.org/pdf/2302.13214
> In practice, we find that four Taylor terms (P = 4) suffice for recovering conventional attention with elementwise errors of approximately the same magnitude as Float16 resolution, acceptable for many AI applications.
ie., the claim is that this method reproduces the results of conventional attention, up to float16 numerical precision.
and they really do mean that, their results show +/- 1 on log10 plots.
This paper at least aspires to reproduce 'true' attention, which distinguishes it from many of the others. TBD if its successful in that.
I ask because in practice, for inference, attention is typically computed with low-precision (4-bit, 8-bit, 16-bit) floats.
Numerical error, in fact, may be a key factor as to why quadratic attention, in practice, exhibits context rot as context gets longer, analogous to an RNN:
https://www.anthropic.com/engineering/effective-context-engi...
[1] Attention Is Not What You Need, https://arxiv.org/abs/2512.19428
The github repository's first toy example is with 8 Taylor terms, applied to a context of 1B tokens, with attention computed over 1K heads per token. (Note that applying the quadratic formulation to 1B tokens, each with 1K heads, is not practical with current hardware, because it would require computing 1K attention matrices, each with 1B×1B dot-product scores.
Like every other proposed method, this one must be tested too. If it works, AI service providers who ignore it will find themselves at a disadvantage.
It's worth mentioning also that the mathematical techniques introduced by this work are likely of interest for other applications besides attention.
The big mitigation for this is that in causal transformers (i.e. all the chatbot type applications, where each token is only allowed to see tokens before it), you're running inference repeatedly on the same prefix in order to grow it by one token at a time. So if you cache the computations for tokens 0..N-1, on each inference pass you only have to compute O(N) for the newly added token at the end of the sequence.
That's why caching (and caching charges) appear so prominently everywhere in the pricing of inference.
In practice, caching is most beneficial at inference time, because you typically have relatively long conversations that start with the same cacheable prefix (the system prompt). At training time the same optimization can apply, but you're typically not pushing the same prefixes through the model repeatedly so you end up paying the quadratic cost more often.
The quadratic cost of attention is the fundamental compute bottleneck for transformer architectures, which is why there's research like this trying to find shortcuts in computing attention, as well as research into completely new primitives to replace attention (e.g. SSM, which is O(N) on a cold cache and O(1) on a warm cache).
There are other experiments where model designers mix full-attention layers with limited-memory ones. (Which still doesn't avoid N^2, but if e.g. 3/4 of layers use 'light' attention, it still improves efficiency a lot.) The idea is the model can still pull information from far back in context, just not in every layer. Use so far is limited to smaller models (maybe it costs too much model capability to use at the high end?) but it seems like another interesting angle on this stuff.
> DSA reduces the core attention complexity of the main model from O(L^2) to O(Lk), where k (<< L) is the number of selected tokens. Although the lightning indexer still has a complexity of O(L^2), it requires much less computation compared with MLA in DeepSeek-V3.1-Terminus
I wonder if there's a connection to your Taylor truncation order. In RG terms, higher-order polynomial interactions are "irrelevant operators"—they get suppressed as you flow toward the fixed point. If trained attention heads are sitting near this fixed point, that might explain why modest truncation orders work: the network has already learned to concentrate its computation in the lower-order terms. A testable prediction: layers with α closer to 2 (measurable via weightwatcher https://github.com/CalculatedContent/WeightWatcher) might need fewer Taylor terms for accurate approximation than layers with α far from 2. If true, you could potentially use the spectral statistics to adaptively choose truncation order per-head.
Sections 2.1 through 2.4 talk about the decomposing the per-token-pair attention (key vector from the ith token with query vector from the jth token, where, in inference, the jth token is the one being sampled) into an approximation that is only mildly outrageously exponential in size compared to the original exponential-of-a-dot product. And they get something that's a polynomial (in the mathematical sense -- you're literally evaluating a polynomial) and has a size that's manageable at 4th order.
Okay, great, they took something simple and made it bigger and nastier but less transcendental without losing too much precision. (As far as I know, there is really nothing special about the exp in attention in the first place, so trying to approximate it well seems mostly useful insofar as it will keep existing models working.)
But the reason that attention is quadratic is that each token gets evaluated with respect to each other token. They haven't changed this at all. Section 2.5 seems like it's deferring this to an appendix. Section 2.6 gives the hidden state size per token, which, on first read, is strictly larger than the hidden state in normal attention (in normal attention it's d_v * d_k -- I'm not sure where their +1 comes from).
So what did the paper gain? Is there some detail that I missed or that the paper completely glossed over that explains why there is any gain of efficiency at all?
For what it's worth, the paper's overall claim is, in some sense, impossible. You can think of attention as being a sort of vector database, and this gets more accurate the sharper you make the exponential. If you replace softmax with actual max, a query locates the key that is the closest match to the query and returns the associated value. This operation is a plain linear search, it's possible (in principle anyway) to do lots of queries and recover the entire contents of the database, and I think that any paper claiming to do it faster than linear time should explain how it's compressing the data and where the loss is.
In language model terms, imagine an prompt like so:
1: [string 1]
2: [string 2]
3: [string 3]
...
n: [string n]
Tell me the string associated with the number k.
As long as there's enough precision and enough query/key space to fit some embedding of the number k that will match the right thing (and there is a lot of room in high-dimensional spaces), one might expect a transformer to be able to answer this question. But this obviously requires memory with size linear in the prompt length. If you try to get rid of that, you necessarily lose something. (This is not to say that nice attention scaling is impossible -- one could imagine schemes where it takes the model multiple tokens to answer the question, and the number of tokens needed could scale, say, logarithmically with prompt size. But you still need that linear memory.)They defer it to the appendix because it's a standard construction (Q'K)V = Q'(KV), where Q'K is an n×n matrix and requires O(n²) to compute, but KV has a constant size and can be computed in O(n) time, and the multiplication with Q' can also be done in O(n) time.
> Section 2.6 gives the hidden state size per token, which, on first read, is strictly larger than the hidden state in normal attention (in normal attention it's d_v * d_k -- I'm not sure where their +1 comes from).
Actually, their hidden state has a (large) constant size, so strike the words "per token" from section 2.6. In normal attention, the total state is n(d_v + d_k), but their state is basically (d_v + 1)D_k, where D_k is much larger than d_k, but independent of n. The +1 is because they also need to compute the normalization factor for the softmax.
It's true that a constant state size implies that you cannot use it to losslessly store arbitrarily large databases, but LLMs in practice cannot do this either, so there's no loss of capability in that sense. (In fact, if you use enough terms in the Taylor expansion to get the same result as standard attention to within machine precision, the resulting constant state size should give you an upper bound for the amount of data the LLM can effectively retrieve from its context.)
I think you've nailed it: Machine precision puts an upper bound (of constant size) on how much information an LLM can retrieve from its context.
Let's say you consider the 3 most-recent tokens. The first insight is that you can use a Taylor approximation: At token position 3 you compute A_3 = ((q1, q2, q3) . (k1, k2, k3))^1, B_3 = ((q1, q2, q3) . (k1, k2, k3)^2, C_3 = ((q1, q2, q3) . (k1, k2, k3))^3, etc. [1] [2]
The second insight is that you can compute e.g. B_{i+1} incrementally from B_i, with much fewer FLOPS than computing B_{i+1} from scratch. [3]
[1] I'd buy that it's empirically "good enough" that you don't need to go beyond D_3 (fourth degree polynomial).
[2] I'd also buy that it's empirically "good enough" to assume the inputs aren't extreme enough for E_3, F_3 etc. to matter. I agree with other posters that radius of convergence worries aren't addressed. I find it plausible that these issues don't sink the paper. I'd not be surprised to learn that either it doesn't matter in practice, or workarounds can be implemented without much performance impact.
[3] The author's choice to bury this insight in an appendix rather than putting it front and center is a baffling pedagogical choice but it's a small issue in the grand scheme of things. Perhaps that second insight is prior work (possibly by others) that experts in the latest LLM linear algebra could reasonably be expected to be familiar with, but is included as an appendix because it's not universally known in e.g. HN comment sections?
This is where you’ve gone off track. The “hidden state” for their model is a fixed size thing, like in an RNN, not per token. For a transformer, the “hidden state” is called the KV cache, and it grows with sequence length. This is why their method is linear not quadratic.
The Taylor Series they derive isn’t just for softmax (after all, real implementations of softmax will likely already use the Taylor series!), it’s for the entire tensor-level softmax(QK) computation.
(I'm imagining that if in the context there's ~4-8 "similar" attention-targets that should be sharp, and regular attention learns to select the correct one, this taylor approximation version would wash out any difference and they'd all loosly be attended to, and it'd fail to isolate the correct signal)
Really wish this had some downstream tests -- apply it to a pretrained model and see how performance degrades, train a fresh one, etc. The tests are worth doing, but I somehow don't feel that hopeful this is the unlock required for sub-quadratic attention. It's possible that a freshly trained model with this learns to attend without the sharp attention signals, but that seems a bit dubious to me.
But also, maybe this combined with some other selective (sparse attention) trick, means that the hybrid model gets the "fuzzy long tail" of attention well represented as well as the sharpness well represented, and all together it could actually be a part of the larger solution.
"In practice, we find that four Taylor terms (P = 4) suffice for recovering conventional attention with elementwise errors of approximately the same magnitude as Float16 resolution"
I think this does soften, but not linearly. That is to say the fixed state size limitation means that it softens more as it gets further into the past.
Video presentation if someone prefers it: https://www.youtube.com/watch?v=PN3nYBowSvM
Linear attention is a first-degree approximation of Softmax attention, and model performance gets better as you increase the degree of the Taylor approximation.
I'm thinking about adapting an existing model to Taylor-approximated attention. I think it should be possible with some model surgery and rehabilitation training.
My other concern would be that Taylor itself is fairly complex. I wonder how well GPU's handle this in comparison to good old fashioned softmax? The last time I used Taylor with a custom Triton kernel it was still very slow. That could just have been my own jank vibe-coded implementation though.
They also haven't' tried to write a high performance kernel for triton yet. If it goes the way my last experiment with Taylor did they're in for some bad news.
I'm just a hobbyist though, it's certainly possible that people with more time/resources could outperform me without much effort. I just want to see it tested on something familiar and benchmark-able.
https://github.com/glassroom/sata_attention
That toy example is not practical with the quadratic formulation, because it would require computing and storing 1K attention matrices, each with 1B×1B dot-product scores. For example, at Float32 precision, those attention matrices would consume approximately 1K x 1B x 1B x 4 bytes = 3,725,290,298.5 Terabytes of memory, which is not practical.
Like every other proposed method, this one must be tested too. If it performs well in practice, AI service providers who ignore it will find themselves at a disadvantage.
Otherwise, the mathematical techniques introduced by this work are likely useful for other applications besides Transformer attention.
Now this is a very interesting paper, which hopefully should address the chronic inefficiencies of the AI lack of efficient methods and approaches in reducing their significant computational and energy demands which are off the charts.
> These factors penalize performance relative to what a fused, hardware-optimized implementation could achieve, and the reported runtime results should therefore be interpreted conservatively.
It's still early with several limitations, but the need for wasting billions on GPUs will begin to not make any sense soon.
> these models dominate both exponential attention and linear attention at long-context training
There is no exponential attention; standard attention is quadratic. Strange mistake.