> In a future post, we will explain how to improve the learning of linear transformers
So the techniques here are useless without special secret sauce that they're not disclosing. Yet. Mamba is already out there solving similar problems, but the more the merrier. I hope they publish the useful part soon.
The sad part about the whole situation is that one has to hype the research as the new best thing ever rather than an experiment that was well motivated (not all of them are) with results that weren’t as nice as hoped
(Disclaimer: I am an author on the linked paper)
Also I note the only thing you have posted before is a link to this paper in particular.
The posted algorithm and the one mentioned in my paper are very similar. It is just that the cumulative sum computation is parallelized in the posted website.
That linearity model simplification has model expressiveness costs, which is why they don't fit the training data as well.
By persisting the state variable across subsequent computations they transform the quadratic formula for computing output into a linear formula computing output and next state from current state.
It's kind of like memoization, but since it's a number it's constant space too.
$ curl -s https://manifestai.com/blogposts/faster-after-all/ | grep generator
<meta name="generator" content="quarto-1.3.450">If even heavily optimized, they are still (nearly) no better than normal flash attention up to context length 10^4.
And then you haven't even started to account for the degradation in learning.
Maybe if you're doing 100k attention at inference it starts making sense... But then there are other methods you can start using too.