I understand how transformers work, but my mental model is that a transformer is the processor and the LLM is an application that runs on it. After all, transformers can be trained to do lots of things, and what it learns when trained with a "predict next word" LLM objective is going to differ from what it learns (and hence operates?) in a different setting.
There have been various LLM interpretation papers analyzing aspects of them, such as the discovery of pairs of consecutive layer attention heads acting as "search and copy" "induction heads", and analysis of the linear layers as key-value stores, which perhaps leads to another weak abstraction of the linear layers as storing knowledge and perhaps the reasoning "program", with the "attention" layers being the mechanism being programmed to do the data tagging/shuffling ?
No doubt there's a lot more to be discovered about how these LLMs are operating - perhaps a wider variety of primitives built out of attention heads other than just induction heads ? It seems a bit early to be building a high level model of the primitives these LLMs have learnt, and not sure if attempting a crude transformer-level model really works given how the residual context is additive - it's not just tokens being moved around.
It's not clear to anybody exactly what kind of program structure LLMs have internally. Figuring that out is a major goal for the field of mechanistic interpretability.
Maybe there's useful abstractions for analyzing them, but LLMs are just another deep learning model.
https://news.ycombinator.com/item?id=36332033
Showed that attention with positional encodings and arbitrary precision rational activation functions is Turing complete.
Using a finite precision, nonrational activation function and/or without positional encodings is not Turning complete.
Plus Turing completeness does not tell you anything about practical computation in reasonable time or space constraints.
printf() format strings are TC, and while interesting, probably won't help you solve real problems.
https://blog.wtf.sg/posts/2023-02-03-the-new-xor-problem/ To be completely fair, the Transformer architecture does not map neatly into being analysed like automata and categorised in the Chomsky Hierarchy. Neural Networks and the Chomsky Hierarchy train different architectures on formal languages curated from different levels of the Chomky hierarchy.
There is an interpreter for a RASP like language if you want to try it out: https://srush.github.io/raspy/
And deepmind published a compiler from RASP to Transformer weights: https://github.com/deepmind/tracr
And Sasha's blog (your link) has a nice walkthrough of long addition with RASP!
--posting for a friend