Thinking Like Transformers (2021) [pdf] (opens in new tab)

(arxiv.org)

112 pointsjbay8083y ago20 comments

20 comments

15 comments · 4 top-level

Maybe it's clear to others, but it's certainly not to me, how exactly transformers - or rather transformer-based LLMs - are operating.

I understand how transformers work, but my mental model is that a transformer is the processor and the LLM is an application that runs on it. After all, transformers can be trained to do lots of things, and what it learns when trained with a "predict next word" LLM objective is going to differ from what it learns (and hence operates?) in a different setting.

There have been various LLM interpretation papers analyzing aspects of them, such as the discovery of pairs of consecutive layer attention heads acting as "search and copy" "induction heads", and analysis of the linear layers as key-value stores, which perhaps leads to another weak abstraction of the linear layers as storing knowledge and perhaps the reasoning "program", with the "attention" layers being the mechanism being programmed to do the data tagging/shuffling ?

No doubt there's a lot more to be discovered about how these LLMs are operating - perhaps a wider variety of primitives built out of attention heads other than just induction heads ? It seems a bit early to be building a high level model of the primitives these LLMs have learnt, and not sure if attempting a crude transformer-level model really works given how the residual context is additive - it's not just tokens being moved around.

Juicyy3y ago

I saved this HN thread from a couple months ago that had a lot of great resources.

https://news.ycombinator.com/item?id=35697627

Legend24403y ago

Transformers - and all other ML models - are ways to represent computer programs. You can think of them as a programming language designed to be easy for optimization instead of for human understanding.

It's not clear to anybody exactly what kind of program structure LLMs have internally. Figuring that out is a major goal for the field of mechanistic interpretability.

potatoman223y ago

I think your mental model could be making LLMs seem more confusing than they are. LLMs are stacks of transformers and generative LLMs typically have another model that samples the transformer output.

Maybe there's useful abstractions for analyzing them, but LLMs are just another deep learning model.

HarHarVeryFunny3y ago

The "attention" mechanism (a bit of a misnomer really) is what makes transformers more complex than many other neural nets - data isn't simply flowing through the model from layer to layer, but rather it is being copied and moved around by the attention heads. The "next word" it is generating doesn't even have to be a word it has ever seen before - it may be copying it from the prompt.

1 more reply

inciampati3y ago· 4 in thread

This has been built on extensively over the past two years. For instance: Tighter Bounds on the Expressivity of Transformer Encoders https://arxiv.org/abs/2301.10743. I find it interesting that transformers are equivalent to first order logic on circuits with counters. Amazing what you can do even if you're not Turing complete!

tambourine_man3y ago

Transformers are Turing complete, right?

nyrikki3y ago

The paper from yesterday:

https://news.ycombinator.com/item?id=36332033

Showed that attention with positional encodings and arbitrary precision rational activation functions is Turing complete.

Using a finite precision, nonrational activation function and/or without positional encodings is not Turning complete.

Plus Turing completeness does not tell you anything about practical computation in reasonable time or space constraints.

printf() format strings are TC, and while interesting, probably won't help you solve real problems.

1 more reply

canjobear3y ago

No, they are actually very limited formally. For example you can't model a language of nested brackets to arbitrary depth (as you can with an RNN). That makes it all the more interesting that they are so successful.

sp3323y ago

Being technically maybe turing complete doesn't mean we know how to program it usefully.

https://blog.wtf.sg/posts/2023-02-03-the-new-xor-problem/ To be completely fair, the Transformer architecture does not map neatly into being analysed like automata and categorised in the Chomsky Hierarchy. Neural Networks and the Chomsky Hierarchy train different architectures on formal languages curated from different levels of the Chomky hierarchy.

rolisz3y ago· 3 in thread

To quote someone: RASP is like Matlab, designed by Satan.

There is an interpreter for a RASP like language if you want to try it out: https://srush.github.io/raspy/

And deepmind published a compiler from RASP to Transformer weights: https://github.com/deepmind/tracr

srush3y ago

There is also a more interactive version if you want to challenge yourself. A Python notebook of interactive puzzles to build an adder with transformers.

https://github.com/srush/Transformer-Puzzles

gailw3y ago

there's also an interpreter for RASP as described in the paper :) https://github.com/tech-srl/RASP

And Sasha's blog (your link) has a nice walkthrough of long addition with RASP!

mistrial93y ago

is it possible to describe important science (or business) elements without using direct religious terms for other purposes? "angels" "bible" "satan" etc?

--posting for a friend

ljlolel3y ago

This is cool but I think a more fundamental primitive is the probability distribution over next tokens and how that changes depending on each layers computation.

j / k navigate · click thread line to collapse

20 comments

15 comments · 4 top-level

HarHarVeryFunny3y ago· 4 in thread

Maybe it's clear to others, but it's certainly not to me, how exactly transformers - or rather transformer-based LLMs - are operating.

Juicyy3y ago

I saved this HN thread from a couple months ago that had a lot of great resources.

https://news.ycombinator.com/item?id=35697627

Legend24403y ago

It's not clear to anybody exactly what kind of program structure LLMs have internally. Figuring that out is a major goal for the field of mechanistic interpretability.

potatoman223y ago

I think your mental model could be making LLMs seem more confusing than they are. LLMs are stacks of transformers and generative LLMs typically have another model that samples the transformer output.

Maybe there's useful abstractions for analyzing them, but LLMs are just another deep learning model.

HarHarVeryFunny3y ago

1 more reply

inciampati3y ago· 4 in thread

tambourine_man3y ago

Transformers are Turing complete, right?

nyrikki3y ago

The paper from yesterday:

https://news.ycombinator.com/item?id=36332033

Showed that attention with positional encodings and arbitrary precision rational activation functions is Turing complete.

Using a finite precision, nonrational activation function and/or without positional encodings is not Turning complete.

Plus Turing completeness does not tell you anything about practical computation in reasonable time or space constraints.

printf() format strings are TC, and while interesting, probably won't help you solve real problems.

1 more reply

canjobear3y ago

sp3323y ago

Being technically maybe turing complete doesn't mean we know how to program it usefully.

rolisz3y ago· 3 in thread

To quote someone: RASP is like Matlab, designed by Satan.

There is an interpreter for a RASP like language if you want to try it out: https://srush.github.io/raspy/

And deepmind published a compiler from RASP to Transformer weights: https://github.com/deepmind/tracr

srush3y ago

There is also a more interactive version if you want to challenge yourself. A Python notebook of interactive puzzles to build an adder with transformers.

https://github.com/srush/Transformer-Puzzles

gailw3y ago

there's also an interpreter for RASP as described in the paper :) https://github.com/tech-srl/RASP

And Sasha's blog (your link) has a nice walkthrough of long addition with RASP!

mistrial93y ago

is it possible to describe important science (or business) elements without using direct religious terms for other purposes? "angels" "bible" "satan" etc?

--posting for a friend

ljlolel3y ago

This is cool but I think a more fundamental primitive is the probability distribution over next tokens and how that changes depending on each layers computation.

j / k navigate · click thread line to collapse