Linear transformers are faster after all (opens in new tab)

(manifestai.com)

116 pointsJoelEinbinder2y ago22 comments

22 comments

20 comments · 6 top-level

modeless2y ago· 4 in thread

> our linear transformers are somewhat useless, as the positive impact from the speedup seen in long contexts is undermined by the negative impact of degraded learning.

> In a future post, we will explain how to improve the learning of linear transformers

So the techniques here are useless without special secret sauce that they're not disclosing. Yet. Mamba is already out there solving similar problems, but the more the merrier. I hope they publish the useful part soon.

sigmoid102y ago

That's pretty common in academia. You publish something new that is worse than the state-of-the-art. To maintain some semblence of meaning for your work, you you then say that the shortcomings will be addressed in future papers. Often these papers never surface because somewhere along the way it turns out that even though your approach was new, it is fundamentally worse. This kind of stuff happens all the time in research and it only makes it to the surface thanks to this twisted publish or perish world academics now live in.

rsfern2y ago

A counter perspective: it’s a good thing these ideas make it to the surface! Clearly someone thought it was a good enough idea to try, now others can have better data before trying that same rabbit hole

The sad part about the whole situation is that one has to hype the research as the new best thing ever rather than an experiment that was well motivated (not all of them are) with results that weren’t as nice as hoped

1 more reply

tnecniv2y ago

Or you get distracted and move on to a different project. Thats the biggest reason I never followed up on my future work ideas in papers.

3abiton2y ago

Wonder why they did not compare it with it.

SmartestUnknown2y ago· 4 in thread

This is not a new algorithm. The same algorithm is described in Figure 4 (Theorem 3.1) of https://arxiv.org/pdf/2310.01655.pdf

(Disclaimer: I am an author on the linked paper)

TTPrograms2y ago

I don't think the posted algorithm is particularly novel, but the algorithm you cite is deeply different.

Also I note the only thing you have posted before is a link to this paper in particular.

SmartestUnknown2y ago

I didn't want to associate this account w/ my real name but now that you mentioned it wasn't right of me to not point that out. I added a disclaimer.

The posted algorithm and the one mentioned in my paper are very similar. It is just that the cumulative sum computation is parallelized in the posted website.

1 more reply

WeMoveOn2y ago

his username checks out

smsx2y ago

The point of this post isn’t the linear transformer algorithm. They’re surveying a variety of Linear transformers and showing a general form in order to talk at large about their performance characteristics.

hacketthfk2y ago· 3 in thread

I don't understand something, why do they claim they go from O(N*N) to O(N), but all they claim they are doing is removing one exponentiation operation, which is O(1)? Where is the O(N) they are removing?

robrenaud2y ago

Removing the exponential allows some linear algebra based tricks. It makes the state space linear. Linearity allows a kind of running sum, where the state space at time T is quickly computable from the state space at time T-1.

That linearity model simplification has model expressiveness costs, which is why they don't fit the training data as well.

cherryteastain2y ago

Wonder if it'd ve possible to have our cake and eat it too by treating layer outputs as log(state space) in that case?

duped2y ago

It's described explicitly in section 1 where they first reduce to a linear relationship and then recognize that a portion of the formula can be captured in a state variable, and rewrite as a recurrence relation.

By persisting the state variable across subsequent computations they transform the quadratic formula for computing output into a linear formula computing output and next state from current state.

It's kind of like memoization, but since it's a number it's constant space too.

bbertelsen2y ago· 2 in thread

What did they use to build this site? I could have sworn I saw what looked like LaTex when it was loading?

Cieplak2y ago

Looks like quarto. The LaTex you saw may have been MathJax.

    $ curl -s https://manifestai.com/blogposts/faster-after-all/ | grep generator
    <meta name="generator" content="quarto-1.3.450">

JoelEinbinderOP2y ago

Developer tools point to MathJax https://www.mathjax.org/. If you disable javascript you can see some LaTex.

thomasahle2y ago· 1 in thread

To be honest this makes me less excited about linear transformers.

If even heavily optimized, they are still (nearly) no better than normal flash attention up to context length 10^4.

And then you haven't even started to account for the degradation in learning.

Maybe if you're doing 100k attention at inference it starts making sense... But then there are other methods you can start using too.

f_devd2y ago

You should check out MoE-Mamba (https://arxiv.org/abs/2401.04081), it's faster and more accurate than Transformer-MoE. Of course only time will tell if it's better when scaled up further than the paper goes.

deepsquirrelnet2y ago

Great writeup and interesting experiments. I can’t help but wonder what would happen in you instead made a rectified linear attention. Is that even possible?

j / k navigate · click thread line to collapse

22 comments

20 comments · 6 top-level

modeless2y ago· 4 in thread

> our linear transformers are somewhat useless, as the positive impact from the speedup seen in long contexts is undermined by the negative impact of degraded learning.

> In a future post, we will explain how to improve the learning of linear transformers

sigmoid102y ago

rsfern2y ago

1 more reply

tnecniv2y ago

Or you get distracted and move on to a different project. Thats the biggest reason I never followed up on my future work ideas in papers.

3abiton2y ago

Wonder why they did not compare it with it.

SmartestUnknown2y ago· 4 in thread

This is not a new algorithm. The same algorithm is described in Figure 4 (Theorem 3.1) of https://arxiv.org/pdf/2310.01655.pdf

(Disclaimer: I am an author on the linked paper)

TTPrograms2y ago

I don't think the posted algorithm is particularly novel, but the algorithm you cite is deeply different.

Also I note the only thing you have posted before is a link to this paper in particular.

SmartestUnknown2y ago

I didn't want to associate this account w/ my real name but now that you mentioned it wasn't right of me to not point that out. I added a disclaimer.

The posted algorithm and the one mentioned in my paper are very similar. It is just that the cumulative sum computation is parallelized in the posted website.

1 more reply

WeMoveOn2y ago

his username checks out

smsx2y ago

hacketthfk2y ago· 3 in thread

robrenaud2y ago

That linearity model simplification has model expressiveness costs, which is why they don't fit the training data as well.

cherryteastain2y ago

Wonder if it'd ve possible to have our cake and eat it too by treating layer outputs as log(state space) in that case?

duped2y ago

By persisting the state variable across subsequent computations they transform the quadratic formula for computing output into a linear formula computing output and next state from current state.

It's kind of like memoization, but since it's a number it's constant space too.

bbertelsen2y ago· 2 in thread

What did they use to build this site? I could have sworn I saw what looked like LaTex when it was loading?

Cieplak2y ago

Looks like quarto. The LaTex you saw may have been MathJax.

    $ curl -s https://manifestai.com/blogposts/faster-after-all/ | grep generator
    <meta name="generator" content="quarto-1.3.450">

JoelEinbinderOP2y ago

Developer tools point to MathJax https://www.mathjax.org/. If you disable javascript you can see some LaTex.

thomasahle2y ago· 1 in thread

To be honest this makes me less excited about linear transformers.

If even heavily optimized, they are still (nearly) no better than normal flash attention up to context length 10^4.

And then you haven't even started to account for the degradation in learning.

Maybe if you're doing 100k attention at inference it starts making sense... But then there are other methods you can start using too.

f_devd2y ago

deepsquirrelnet2y ago

Great writeup and interesting experiments. I can’t help but wonder what would happen in you instead made a rectified linear attention. Is that even possible?

j / k navigate · click thread line to collapse