undefined | Better HN

0 pointsPartiallyTyped2y ago0 comments

I mean ... if you think about it, attention changes the effective weights of a model.

I am fairly certain that if you try, you can show that for any particular sequence of tokens of length N, the N-1 tokens induce a residual FFNN that results in exactly the same distribution over the next tokens given just the Nth.

0 comments

4 comments · 2 top-level

treyd2y ago· 2 in thread

You may be interested in "Linear Transformers Are Secretly Fast Weight Programmers": https://arxiv.org/abs/2102.11174

PartiallyTypedOP2y ago

Seems very similar to "Language Models Implicitly Perform Gradient Descent as Meta-Optimizers"

https://arxiv.org/abs/2212.10559

KirillPanov2y ago

Only superficially.

You should give the Fast Weight Programmers paper a chance, and a thorough reading. It sounds like you already appreciate a fair bit of its main point.

The best part about the FWP paper is the derivation of an FWP equation from the transformer equation. It's remarkably straightforward. You remove the softmax operation (i.e. linearize) and the rest is just algebraic manipulation -- a formal proof.

Transformers are just NNs that learn to control a Content-Addressable Memory (CAM).

This perspective has far-reaching implications for ML, sort of like category theory did for metamathematics and type theory. For example, LSTM cells can be viewed as a NN that learns to control a flip-flop (the "deluxe" kind found on FPGAs, with output-enable, clock-enable, and reset inputs). I've found that this is by far the easiest way to explain LSTM to people. It also raises the obvious question of what other kinds of simple blocks can be controlled by NNs. I think this question will lead to another wave of breakthroughs.

The ultimate limit of this approach is the Gödel Machine -- although no attempt to build one has come anywhere close to success yet.

mashygpig2y ago

Sounds interesting, try it and share your results here :)

j / k navigate · click thread line to collapse