Visualizing Attention, a Transformer's Heart [video] (opens in new tab)

(3blue1brown.com)

999 pointsrohitpaulk2y ago172 comments

172 comments

91 comments · 23 top-level

seydor2y ago· 17 in thread

I have found the youtube videos by CodeEmporium to be simpler to follow https://www.youtube.com/watch?v=Nw_PJdmydZY

Transformer is hard to describe with analogies, and TBF there is no good explanation why it works, so it may be better to just present the mechanism, "leaving the interpretation to the viewer". Also, it's simpler to describe dot products as vectors projecting on one another

mjburgess2y ago

The explanation is just that NNs are a stat fitting alg learning a conditional probability distribution, P(next_word|previous_words). Their weights are a model of this distribution. LLMs are a hardware innovation: they make it possible for GPUs to compute this at scale across TBs of data.

Why does, 'mat' follow from 'the cat sat on the ...' because 'mat' is the most frequent word in the dataset; and the NN is a model of those frequencies.

Why is 'London in UK' "known" but 'London in France' isnt? Just because 'UK' much more frequently occurs in the dataset.

The algorithm isnt doing anything other than aligning computation to hardware; the computation isnt doing anything interesting. The value comes from the conditional probability structure in the data. -- that comes from people arranging words usefully, because they're communicating information with one another

nerdponx2y ago

I think you're downplaying the importance of the attention/transformer architecture here. If it was "just" a matter of throwing compute at probabilities, then we wouldn't need any special architecture at all.

P(next_word|previous_words) is ridiculously hard to estimate in a way that is actually useful. Remember how bad text generation used to be before GPT? There is innovation in discovering an architecture that makes it possible to learn P(next_word|previous_words), in addition to the computing techniques and hardware improvements required to make it work.

2 more replies

IanCal2y ago

This is wrong, or at least a simplification to the point of removing any value.

> NNs are a stat fitting alg learning a conditional probability distribution, P(next_word|previous_words).

They are trained to maximise this, yes.

> Their weights are a model of this distribution.

That doesn't really follow, but let's leave that.

> Why does, 'mat' follow from 'the cat sat on the ...' because 'mat' is the most frequent word in the dataset; and the NN is a model of those frequencies.

Here's the rub. If how you describe them is all they're doing then a sequence of never-before-seen words would have no valid response. All words would be equally likely. It would mean that a single brand new word would result in absolute gibberish following it as there's nothing to go on.

Let's try:

Input: I have one kjsdhlisrnj and I add another kjsdhlisrnj, tell me how many kjsdhlisrnj I now have.

Result: You now have two kjsdhlisrnj.

I would wager a solid amount that kjsdhlisrnj never appears in the input data. If it does pick another one, it doesn't matter.

So we are learning something more general than the frequencies of sequences of tokens.

I always end up pointing to this but OthelloGPT is very interesting https://thegradient.pub/othello/

While it's trained on sequences of moves, what it does is more than just "sequence a,b,c is followed by d most often"

2 more replies

albertzeyer2y ago

You are more speaking about n-gram models here. NNs do far more than that.

Or if you just want to say that NNs are used as a statistical model here: Well, yea, but that doesn't really tell you anything. Everything can be a statistical model.

E.g., you could also say "this is exactly the way the human brain works", but it doesn't really tell you anything how it really works.

2 more replies

michaelt2y ago

That's not really an explanation that tells people all that much, though.

I can explain that car engines 'just' convert gasoline into forward motion. But if a the person hearing the explanation is hoping to learn what a cam belt or a gearbox is, or why cars are more reliable now than they were in the 1970s, or what premium gas is for, or whether helicopter engines work on the same principle - they're going to need a more detailed explanation.

1 more reply

forrestthewoods2y ago

I find this take super weak sauce and shallow.

This recent $10,000 challenge is super super interesting imho. https://twitter.com/VictorTaelin/status/1778100581837480178

State of the art models are doing more than “just” predicting the probability of the next symbol.

1 more reply

sirsinsalot2y ago

It isn't some kind of Markov chain situation. Attention cross-links the abstract meaning of words, subtle implications based on context and so on.

So, "mat" follows "the cat sat on the" where we understand the entire worldview of the dataset used for training; not just the next-word probability based on one or more previous words ... it's based on all previous meaning probability, and those meaning probablility and so on.

seydor2y ago

People specifically would like to know what the attention calculations add to this learning of the distribution

1 more reply

astrange2y ago

LLMs don't work on words, they work on sequences of subword tokens. "It doesn't actually do anything" is a common explanation that's clearly a form of cope, because you can't even explain why it can form complete words, let alone complete sentences.

fspeech2y ago

There are an infinite number of distributions that can fit the training data well (e.g., one that completely memorize the data and therefore replicate the frequencies). The trick is to find the distributions that generalize well, and here the NN architecture is critical.

fellendrone2y ago

> Why does, 'mat' follow from 'the cat sat on the ...'

You're confidently incorrect by oversimplifying all LLMs to a base model performing a completion from a trivial context of 5 words.

This is tantamount to a straw man. Not only do few people use untuned base models, it completely ignores in-context learning that allows the model to build complex semantic structures from the relationships learnt from its training data.

Unlike base models, instruct and chat fine-tuning teaches models to 'reason' (or rather, perform semantic calculations in abstract latent spaces) with their "conditional probability structure", as you call it, to varying extents. The model must learn to use its 'facts', understand semantics, and perform abstractions in order to follow arbitrary instructions.

You're also confabulating the training metric of "predicting tokens" with the mechanisms required to satisfy this metric for complex instructions. It's like saying "animals are just performing survival of the fittest". While technically correct, complex behaviours evolve to satisfy this 'survival' metric.

You could argue they're "just stitching together phrases", but then you would be varying degrees of wrong:

For one, this assumes phrases are compressed into semantically addressable units, which is already a form of abstraction ripe for allowing reasoning beyond 'stochastic parroting'.

For two, it's well known that the first layers perform basic structural analysis such as grammar, and later layers perform increasing levels of abstract processing.

For three, it shows a lack of understanding in how transformers perform semantic computation in-context from the relationships learnt by the feed-forward layers. If you're genuinely interested in understanding the computation model of transformers and how attention can perform semantic computation, take a look here: https://srush.github.io/raspy/

For a practical example of 'understanding' (to use the term loosely), give an instruct/chat tuned model the text of an article and ask it something like "What questions should this article answer, but doesn't?" This requires not just extracting phrases from a source, but understanding the context of the article on several levels, then reasoning about what the context is not asserting. Even comparatively simple 4x7B MoE models are able to do this effectively.

raindear2y ago

But why do transformers perform better than older language models including other neural language models.

nextaccountic2y ago

> Why does, 'mat' follow from 'the cat sat on the ...' because 'mat' is the most frequent word in the dataset; and the NN is a model of those frequencies.

What about cases that are not present in the dataset?

The model must be doing something besides storing raw probabilities to avoid overfitting and enable generalization (imagine that you could have a very performant model - when it works - but it sometimes would spew "Invalid input, this was not in the dataset so I don't have a conditional probability and I will bail out")

blt2y ago

As a computer scientist, the "differentiable hash table" interpretation worked for me. The AIAYN paper alludes to it by using the query/key/value names, but doesn't explicitly say the words "hash table". I guess some other paper introduced them?

nerdponx2y ago

> TBF there is no good explanation why it works

My mental justification for attention has always been that the output of the transformer is a sequence of new token vectors such that each individual output token vector incorporates contextual information from the surrounding input token vectors. I know it's incomplete, but it's better than nothing at all.

eurekin2y ago

> TBF there is no good explanation why it works

I thought the general consesus was: "transformers allow neural networks to have adaptive weights".

As opposed to the previous architectures, were every edge connecting two neurons always has the same weight.

EDIT: a good video, where it's actually explained better: https://youtu.be/OFS90-FX6pg?t=750&si=A_HrX1P3TEfFvLay

rcarmo2y ago

You're effectively steering the predictions based on adjacent vectors (and precursors from the prompt). That mental model works fine.

bilsbie2y ago· 13 in thread

I finally understand this! Why did every other video make it so confusing!

chrishare2y ago

It is confusing, 3b1b is just that good.

visarga2y ago

At the same time it feels extremely simple

attention(Q,K,V) = softmax (Q K^T √ dK ) @ V

is just half a row; the multi-head, masking and positional stuff just toppings

we have many basic algorithms in CS that are more involved, it's amazing we get language understanding from such simple math

2 more replies

ur-whale2y ago

> Why did every other video make it so confusing!

In my experience, with very few notable exceptions (e.g. Feynmann), researchers are the worst when it comes to clearly explaining to others what they're doing.

I'm at the point where I'm starting believe that pedagogy and research generally are mutually exclusive skills.

namaria2y ago

It's extraordinarily difficult to imagine how it feels not to understand something. Great educators can bridge that gap. I don't think it's correlated with research ability in any way. It's just a very rare skill set, to be able to empathize with people who don't understand what you do.

thomasahle2y ago

I'm someone who would love to get better at making educational videos/content. 3b1b is obviously the gold standard here.

I'm curious what things other videos did worse compared to 3b1b?

bilsbie2y ago

I think he had a good, intuitive understanding that he wanted to communicate and he made it come through.

I like how he was able to avoid going into the weeds and stay focused on leading you to understanding. I remember another video where I got really hung up on positional encoding and I felt like I could t continue until I understood that. Or other videos that overfocus on matrix operations or softmax, etc.

thinkingtoilet2y ago

Grant has a gift of explaining complicated things very clearly. There's a good reason his channel is so popular.

Al-Khwarizmi2y ago

Not sure if you mean it as rhetorical question but I think it's an interesting question. I think there are at least three factors why most people are confused about Transformers:

1. The standard terminology is "meh" at most. The word "attention" itself is just barely intuitive, "self-attention" is worse, and don't get me started about "key" and "value".

2. The key papers (Attention is All You Need, the BERT paper, etc.) are badly written. This is probably an unpopular opinion. But note that I'm not diminishing their merits. It's perfectly compatible to write a hugely impactful, transformative paper describing an amazing breakthrough, but just don't explain it very well. And that's exactly what happened, IMO.

3. The way in which these architectures were discovered was largely by throwing things at the wall and seeing what sticked. There is no reflection process that ended on a prediction that such an architecture would work well, which was then empirically verified. It's empirical all the way through. This means that we don't have a full understanding of why it works so well, all explanations are post hoc rationalizations (in fact, lately there is some work implying that other architectures may work equally well if tweaked enough). It's hard to explain something that you don't even fully understand.

Everyone who is trying to explain transformers has to overcome these three disadvantages... so most explanations are confusing.

cmplxconjugate2y ago

>This is probably an unpopular opinion.

I wouldn't say so. Historically it's quite common. Maxwell's EM papers used such convoluted notation it it quite difficult to read. It wasn't until they were reformulated in vector calculus that they became infinitely more digestible.

I think though your third point is the most important; right now people are focused on results.

maleldil2y ago

> This is probably an unpopular opinion

There's a reason The Illustrated Transformer[1] was/is so popular: it made the original paper much more digestible.

[1] https://jalammar.github.io/illustrated-transformer/

Solvency2y ago

Because:

1. good communication requires an intelligence that most people sadly lack

2. because the type of people who are smart enough to invent transformers have zero incentive to make them easily understandable.

most documents are written by authors subconsciously desperate to mentally flex on their peers.

penguin_booze2y ago

Pedagogy requires empathy, to know what it's like to not know something. They'll often draw on experiences the listener is already familiar with, and then bridge the gap. This skill is orthogonal to the mastery of the subject itself, which I think is the reason most descriptions sound confusing, inadequate, and/or incomprehensible.

Often, the disseminating medium is a one-sided, like a video or a blog post, which doesn't help, either. A conversational interaction would help the expert sense why someone outside the domain find the subject confusing ("ah, I see what you mean"...), discuss common pitfalls ("you might think it's like this... but no, it's more like this...") etc.

1 more reply

WithinReason2y ago

2. It's not malice. The longer you have understood something the harder it is to explain it, since you already forgot what it was like to not understand it.

Xcelerate2y ago· 11 in thread

As someone with a background in quantum chemistry and some types of machine learning (but not neural networks so much) it was a bit striking while watching this video to see the parallels between the transformer model and quantum mechanics.

In quantum mechanics, the state of your entire physical system is encoded as a very high dimensional normalized vector (i.e., a ray in a Hilbert space). The evolution of this vector through time is given by the time-translation operator for the system, which can loosely be thought of as a unitary matrix U (i.e., a probability preserving linear transformation) equal to exp(-iHt), where H is the Hamiltonian matrix of the system that captures its “energy dynamics”.

From the video, the author states that the prediction of the next token in the sequence is determined by computing the next context-aware embedding vector from the last context-aware embedding vector alone. Our prediction is therefore the result of a linear state function applied to a high dimensional vector. This seems a lot to me like we have produced a Hamiltonian of our overall system (generated offline via the training data), then we reparameterize our particular subsystem (the context window) to put it into an appropriate basis congruent with the Hamiltonian of the system, then we apply a one step time translation, and finally transform the resulting vector back into its original basis.

IDK, when your background involves research in a certain field, every problem looks like a nail for that particular hammer. Does anyone else see parallels here or is this a bit of a stretch?

francasso2y ago

I don't think the analogy holds: even if you forget all the preceding non linear steps, you are still left with just a linear dynamical system. It's neither complex nor unitary, which are two fundamental characteristics of quantum mechanics.

bdjsiqoocwk2y ago

I think you're just describing a state machine, no? The fact that you encode the state in a vector and steps by matrices is an implementation detail...?

Xcelerate2y ago

Perhaps a probabilistic FSM describes the actual computational process better since we don’t have a concept equivalent to superposition with transformers (I think?), but the framework of a FSM alone doesn’t seem to capture the specifics of where the model/machine comes from (what I’m calling the Hamiltonian), nor how a given context window (the subsystem) relates to it. The change of basis that involves the attention mechanism (to achieve context-awareness) seems to align better with existing concepts in QM.

One might model the human brain as a FSM as well, but I’m not sure I’d call the predictive ability of the brain an implementation detail.

1 more reply

feoren2y ago

Not who you asked (and I don't quite understand everything) but I think that's about right, except in the continuous world. You pick an encoding scheme (either the Lagrangian or the Hamiltonian) to go from state -> vector. You have a "rules" matrix, very roughly similar to a Markov matrix, H, and (stretching the limit of my knowledge here) exp(-iHt) very roughly "translates" from the discrete stepwise world to the continuous world. I'm sure that last part made more knowledgeable people cringe, but it's roughly in the right direction. The part I don't understand at all is the -i factor: exp(-it) just circles back on itself after t=2pi, so it feels like exp(-iHt) should be a periodic function?

1 more reply

BoGoToTo2y ago

I've been thinking about his a bit lately. If time is non-continuous then could you model the time evolution of the universe as some operator recursively applied to the quantum state of the universe? If each application of the operator progresses the state of the universe by a single planck-time could we even observe a difference between that and a universe where time is continuous?

tweezy2y ago

So one of the most "out there" non-fiction books I've read recently is called "Alien Information Theory". It's a wild ride and there's a lot of flat-out crazy stuff in it but it's a really engaging read. It's written by a computational neuroscientist who's obsessed with DMT. The DMT parts are pretty wild, but the computational neuroscience stuff is intriguing.

In one part he talks about a thought experiment modeling the universe as a multidimensional cellular automata. Where fundamental particles are nothing more than the information they contain. And particles colliding is a computation that tells how that node and the adjacent nodes to update their state.

Way out and not saying there's anything truth to it. But it was a really interesting and fun concept to chew on.

3 more replies

BobbyTables22y ago

I think Wolfram made news proposing something roughly along these lines.

Either way, I find Planck time/energy to be a very spooky concept.

https://wolframphysics.org/

pas2y ago

This sounds like the Bohmian pilot wave theory (which is a global formulation of QM). ... Which might be not that crazy, since spooky action at a distance is already a given. And in cosmology (or quantum gravity) some models are describing a region of space based only its surface. So in some sense the universe is much less information dense, than we think.

https://en.m.wikipedia.org/wiki/Holographic_principle

cmgbhm2y ago

Not a direct comment on the question but I had a math PhD as an intern before. One of his comments was having tons of high dimensional linear algebra stuff was super advanced 1900s and has plenty of room for new cs discovery.

Didn’t make the “what was going on then in physics “ connection until now.

tpurves2y ago

So what you are saying is that, we've reached the point where our own most sophisticated computer models are starting to approach the same algorithms that define the universe we live in? Aka, the simulation is showing again?

lagrange772y ago

I only understand half of it, but it sounds very interesting. I've always wondered, if the principle of stationary action could be of any help with machine learning, e.g. provide an alternative point of view / formulation.

rollinDyno2y ago· 9 in thread

Hold on, every predicted token is only a function of the previous token? I must have something wrong. This would mean that within the embedding of "was", which is of length 12,228 in this example. Is it really possible that this space is so rich as to have a single point in it encapsulate a whole novel?

jgehring2y ago

That's what happens in the very last layer. But at that point the embedding for "was" got enriched multiple times, i.e., in each attention pass, with information from the whole context (which is the whole novel here). So for the example, it would contain the information to predict, let's say, the first token of the first name of the murderer.

Expanding on that, you could imagine that the intent of the sentence to complete (figuring out the murderer) would have to be captured in the first attention passes so that other layers would then be able to integrate more and more context in order to extract that information from the whole context. Also, it means that the forward passes for previous tokens need to have extracted enough salient high-level information already since you don't re-compute all attention passes for all tokens for each next token to predict.

causal2y ago

> you don't re-compute all attention passes for all tokens for each next token to predict.

You don't? I imagine the attention maps could be pretty different between n and n+1 tokens.

Edit: Or maybe you just meant you don't compute attention Σ(n) times for each new token?

diedyesterday2y ago

> "Is it really possible that this space is so rich as to have a single point in it encapsulate a whole novel?"

Not with this GPT. The context size would not allow keeping attention to the total meaning of more than 2048 tokens (as reflected in the transformed embedding of that context's last token). For a substantial part of a novel, it would require a much larger context size with then presumably will need a higher dimensional embedding/semantic space.

causal2y ago

I read this comment yesterday and keep thinking about it. That final token really must "comprehend" everything leading up to it, right? In which case longer context lengths are just trying to pack more meaning into that embedding state.

Which means the embedding model must do a lot of the lifting to be able to accurately represent meaning across long contexts so well. Now I want to know more about how those models are derived.

vanjajaja12y ago

at that point what it has is not a representation of the input, its a representation of what the next output could be. ie. its a lossy process and you can't extract what came in the past, only the details relevant to next word prediction

(is my understanding)

rollinDyno2y ago

If the point was the presentation of only the next token, and predicted tokens were a function of only the preceding token, then the vector of the new token wouldn’t have the information to produce new tokens that kept telling the novel.

faramarz2y ago

it's not about a single point encapsulating a novel, but how sequences of such embeddings can represent complex ideas when processed by the model's layers.

each prediction is based on a weighted context of all previous tokens, not just the immediately preceding one.

rollinDyno2y ago

That weighted context is the 12228 dimensional vector, no?

I suppose that when you each element in the vector weighs 16 bits then the space is immense and capable to have a novel in a point.

2 more replies

evolvingstuff2y ago

You are correct, that is an error in an otherwise great video. The k+1 token is not merely a function of the kth vector, but rather all prior vectors (combined using attention). There is nothing "special" about the kth vector.

rayval2y ago· 3 in thread

Here's a compelling visualization of the functioning of an LLM when processing a simple request: https://bbycroft.net/llm

This complements the detailed description provided by 3blue1brown

bugthe0ry2y ago

When visualised this way, the scale of GPT-3 is insane. I can't imagine what 4 would like here.

spi2y ago

IIRC, GPT-4 would actually be a bit _smaller_ to visualize than GPT3. Details are not public, but from the leaks GPT-4 (at least, some by-now old version of it) was a mixture of expert, with every model having around 110B parameters [1]. So, while the total number of parameters is bigger than GPT-3 (1800B vs. 175B), it is "just" 16 copies of a smaller (110B) parameters model. So if you wanted to visualize it in any meaningful way, the plot wouldn't grow bigger - or it would, if you included all different experts, but they are just copies of the same architecture with different parameters, which is not all that useful for visualization purposes.

[1] https://medium.com/@daniellefranca96/gpt4-all-details-leaked...

1 more reply

lying4fun2y ago

amazing visualisation

shahbazac2y ago· 3 in thread

Is there a reference which describes how the current architecture evolved? Perhaps from very simple core idea to the famous “all you need paper?”

Otherwise it feels like lots of machinery created out of nowhere. Lots of calculations and very little intuition.

Jeremy Howard made a comment on Twitter that he had seen various versions of this idea come up again and again - implying that this was a natural idea. I would love to see examples of where else this has come up so I can build an intuitive understanding.

HarHarVeryFunny2y ago

Roughly:

1) The initial seq-2-seq approach was using LSTMs - one to encode the input sequence, and one to decode the output sequence. It's amazing that this worked at all - encode a variable length sentence into a fixed size vector, then decode it back into another sequence, usually of different length (e.g. translate from one language to another).

2) There are two weaknesses of this RNN/LSTM approach - the fixed size representation, and the corresponding lack of ability to determine which parts of the input sequence to use when generating specific parts of the output sequence. These deficiencies were addressed by Bahdanau et al in an architecture that combined encoder-decoder RNNs with an attention mechanism ("Bahdanau attention") that looked at each past state of the RNN, not just the final one.

3) RNNs are inefficient to train, so Jakob Uszkoreit was motivated to come up with an approach that better utilized available massively parallel hardware, and noted that language is as much hierarchical as sequential, suggesting a layered architecture where at each layer the tokens of the sub-sequence would be processed in parallel, while retaining a Bahdanau-type attention mechanism where these tokens would attend to each other ("self-attention") to predict the next layer of the hierarchy. Apparently in initial implementation the idea worked, but not better than other contemporary approaches (incl. convolution), but then another team member, Noam Shazeer, took the idea and developed it, coming up with an architecture (which I've never seen described) that worked much better, which was then experimentally ablated to remove unnecessary components, resulting in the original transformer. I'm not sure who came up with the specific key-based form of attention in this final architecture.

4) The original transformer, as described in the "attention is all you need paper", still had a separate encoder and decoder, copying earlier RNN based approaches, and this was used in some early models such as Google's BERT, but this is unnecessary for language models, and OpenAI's GPT just used the decoder component, which is what everyone uses today. With this decoder-only transformer architecture the input sentence is input into the bottom layer of the transformer, and transformed one step at a time as it passes through each subsequent layer, before emerging at the top. The input sequence has an end-of-sequence token appended to it, which is what gets transformed into the next-token (last token) of the output sequence.

krat0sprakhar2y ago

Thank you for this summary! Very well explained. Any tips on what resources you use to keep updated on this field?

1 more reply

ollin2y ago

karpathy gave a good high-level history of the transformer architecture in this Stanford lecture https://youtu.be/XfpMkf4rD6E?si=MDICNzZ_Mq9uzRo9&t=618

tylerneylon2y ago· 2 in thread

Awesome video. This helps to show how the Q*K matrix multiplication is a bottleneck, because if you have sequence (context window) length S, then you need to store an SxS size matrix (the result of all queries times all keys) in memory.

One great way to improve on this bottleneck is a new-ish idea called Ring Attention. This is a good article explaining it:

https://learnandburn.ai/p/how-to-build-a-10m-token-context

(I edited that article.)

danielhanchen2y ago

Oh with Flash Attention, you never have to construct the (S, S) matrix ever (also in article) Since its softmax(Q @ K^T / sqrt(d)) @ V, you can form the final output in tiles.

In Unsloth, memory usage scales linearly (not quadratically) due to Flash Attention (+ you get 2x faster finetuning, 80% less VRAM use + 2x faster inference). Still O(N^2) FLOPs though.

On that note, on long contexts, Unsloth's latest release fits 4x longer contexts than HF+FA2 with +1.9% overhead. So 228K context on H100.

rahimnathwani2y ago

He lists Ring Attention and half a dozen other techniques, but they're not within the scope of this video: https://youtu.be/eMlx5fFNoYc?t=784

1 more reply

abotsis2y ago· 2 in thread

I think what made this so digestible for me were the animations. The timing, how they expand/contract and unfold while he’s speaking.. is all very well done.

_delirium2y ago

That is definitely one of the things he does better than most. He actually wrote his own custom animation library for math animations: https://github.com/3b1b/manim

divan2y ago

Also check out community edition: https://www.manim.community

spacecadet2y ago· 2 in thread

Fun video. Much of my "art" lately has been dissecting models, injecting or altering attention, and creating animated visualizations of their inner workings. Some really fun shit.

j_bum2y ago

Link? Sounds fun and reminds me of this tweet [0]

[0] https://x.com/jaschasd/status/1756930242965606582

spacecadet2y ago

Nah someone down voted it. And yes, it looks like that + 20 others that are animated.

1 more reply

YossarianFrPrez2y ago· 1 in thread

This video (with a slightly different title on YouTube) helped me realize that the attention mechanism isn't exactly a specific function so much as it is a meta-function. If I understand it correctly, Attention + learned weights effectively enables a Transformer to learn a semi-arbitrary function, one which involves a matching mechanism (i.e., the scaled dot-product.)

hackinthebochs2y ago

Indeed. The power of attention is that it searches the space of functions and surfaces the best function given the constraints. This is why I think linear attention will never come close to the ability of standard attention, the quadratic term is a necessary feature of searching over all pairs of inputs and outputs.

nostrebored2y ago· 1 in thread

Working in a closely related space and this instantly became part of my team's onboarding docs.

Worth noting that a lot of the visualization code is available in Github.

https://github.com/3b1b/videos/tree/master/_2024/transformer...

sthatipamala2y ago

Sounds interesting; what else is part of those onboarding docs?

jiggawatts2y ago· 1 in thread

It always blows my mind that Grant Sanderson can explain complex topics in such a clear, understandable way.

I've seen several tutorials, visualisations, and blogs explaining Transformers, but I didn't fully understand them until this video.

chrishare2y ago

His content and impact is phenomenal

1 more reply

mastazi2y ago· 1 in thread

That example with the "was" token at the end of a murder novel is genius (at 3:58 - 4:28 in the video) really easy for a non technical person to understand.

hamburga2y ago

I think Ilya gets credit for that example — I’ve heard him use it in his interview with Jensen Huang.

justanotherjoe2y ago· 1 in thread

It seems he brushes over the positional encoding, which for me was the most puzzling part of transformers. The way I understood it, positional encoding is much like dates. Just like dates, there are repeating minutes, hours, days, months...etc. Each of these values has shorter 'wavelength' than the next. The values are then used to identify the position of each tokens. Like, 'oh, im seeing january 5th tokens. I'm january 4th. This means this is after me'. Of course the real pos.encoding is much smoother and doesn't have abrupt end like dates/times, but i think this was the original motivation for positional encodings.

nerdponx2y ago

That's one way to think about it.

It's clever way to encode "position in sequence" as some kind of smooth signal that can be added to each input vector. You might appreciate this detailed explanation: https://towardsdatascience.com/master-positional-encoding-pa...

Incidentally, you can encode dates (e.g. day of week) in a model as sin(day of week) and cos(day of week) to ensure that "day 7" is mathematically adjacent to "day 1".

1 more reply

thomasahle2y ago· 1 in thread

I like the way he uses a low-rank decomposition of the Value matrix instead of Value+Output matrices. Much more intuitive!

imjonse2y ago

It is the first time I hear about the Value matrix being low rank, so for me this was the confusing part. Codebases I have seen also have value + output matrixes so it is clearer that Q,K,V are similar sizes and there's a separate projection matrix that adapts to the dimensions of the next network layer. UPDATE: He mentions this in the last sections of the video.

promiseofbeans2y ago

His previous post 'But what is a GPT?' is also really good: https://www.3blue1brown.com/lessons/gpt

namelosw2y ago

You might also want to check out other 3b1b videos on neural networks since there are sort of progressions between each video https://www.3blue1brown.com/topics/neural-networks

bjornsing2y ago

This was the best explanation I’ve seen. I think it comes down to essentially two aspects: 1) he doesn’t try to hide complexity and 2) he explains what he thinks is the purpose of each computation. This really reduces the room for ambiguity that ruins so many other attempts to explain transformers.

stillsut2y ago

In training we learn a.) the embeddings and b.) the KQ/MLP-weights.

How well do Transformers perform given learned embeddings but only randomly initialized decoder weights? Do they produce word soup of related concepts? Anything syntactically coherent?

Once a well trained high dimensional representation of tokens are established. can they learn KQ/MLP weights significantly faster?

kordlessagain2y ago

What I'm now wondering about is how intuition to connect completely separate ideas works in humans. I will have very strong intuition something is true, but very little way to show it directly. Of course my feedback on that may be biased, but it does seem some people have "better" intuition than others.

cs7022y ago

Fantastic work by Grant Sanderson, as usual.

Attention has won.[a]

It deserves to be more widely understood.

---

[a] Nothing has outperformed attention so far, not even Mamba: https://arxiv.org/abs/2402.01032

mehulashah2y ago

This is one of the best explanations that I’ve seen on the topic. I wish there was more work, however, not on how Transfomers work, but why they work. We are still figuring it out, but I feel that the exploration is not at all systematic.

kjhenner2y ago

The first time I really dug into transformers (back in the BERT days) I was working on a MS thesis involving link prediction in a graph of citations among academic documents. So I had graphs on the brain.

I have a spatial intuition for transformers as a sort of analog to a message passing network over a "leaky graph" in an embedding space. If each token is a node, its key vector sets the position of an outlet pipe that it spews value to diffuse out into the embedding space, while the query vector sets the position of an input pipe that sucks up value other tokens have pumped out into the same space. Then we repeat over multiple attention layers, meaning we have these higher order semantic flows through the space.

Seems to make a lot of sense to me, but I don't think I've seen this analogy anywhere else. I'm curious if anybody else thinks of transformers in this way. (Or wants to explain how wrong/insane I am?)

j / k navigate · click thread line to collapse

172 comments

91 comments · 23 top-level

seydor2y ago· 17 in thread

I have found the youtube videos by CodeEmporium to be simpler to follow https://www.youtube.com/watch?v=Nw_PJdmydZY

mjburgess2y ago

Why does, 'mat' follow from 'the cat sat on the ...' because 'mat' is the most frequent word in the dataset; and the NN is a model of those frequencies.

Why is 'London in UK' "known" but 'London in France' isnt? Just because 'UK' much more frequently occurs in the dataset.

nerdponx2y ago

2 more replies

IanCal2y ago

This is wrong, or at least a simplification to the point of removing any value.

> NNs are a stat fitting alg learning a conditional probability distribution, P(next_word|previous_words).

They are trained to maximise this, yes.

> Their weights are a model of this distribution.

That doesn't really follow, but let's leave that.

> Why does, 'mat' follow from 'the cat sat on the ...' because 'mat' is the most frequent word in the dataset; and the NN is a model of those frequencies.

Let's try:

Input: I have one kjsdhlisrnj and I add another kjsdhlisrnj, tell me how many kjsdhlisrnj I now have.

Result: You now have two kjsdhlisrnj.

I would wager a solid amount that kjsdhlisrnj never appears in the input data. If it does pick another one, it doesn't matter.

So we are learning something more general than the frequencies of sequences of tokens.

I always end up pointing to this but OthelloGPT is very interesting https://thegradient.pub/othello/

While it's trained on sequences of moves, what it does is more than just "sequence a,b,c is followed by d most often"

2 more replies

albertzeyer2y ago

You are more speaking about n-gram models here. NNs do far more than that.

Or if you just want to say that NNs are used as a statistical model here: Well, yea, but that doesn't really tell you anything. Everything can be a statistical model.

E.g., you could also say "this is exactly the way the human brain works", but it doesn't really tell you anything how it really works.

2 more replies

michaelt2y ago

That's not really an explanation that tells people all that much, though.

1 more reply

forrestthewoods2y ago

I find this take super weak sauce and shallow.

This recent $10,000 challenge is super super interesting imho. https://twitter.com/VictorTaelin/status/1778100581837480178

State of the art models are doing more than “just” predicting the probability of the next symbol.

1 more reply

sirsinsalot2y ago

It isn't some kind of Markov chain situation. Attention cross-links the abstract meaning of words, subtle implications based on context and so on.

seydor2y ago

People specifically would like to know what the attention calculations add to this learning of the distribution

1 more reply

astrange2y ago

fspeech2y ago

fellendrone2y ago

> Why does, 'mat' follow from 'the cat sat on the ...'

You're confidently incorrect by oversimplifying all LLMs to a base model performing a completion from a trivial context of 5 words.

You could argue they're "just stitching together phrases", but then you would be varying degrees of wrong:

For one, this assumes phrases are compressed into semantically addressable units, which is already a form of abstraction ripe for allowing reasoning beyond 'stochastic parroting'.

For two, it's well known that the first layers perform basic structural analysis such as grammar, and later layers perform increasing levels of abstract processing.

raindear2y ago

But why do transformers perform better than older language models including other neural language models.

nextaccountic2y ago

> Why does, 'mat' follow from 'the cat sat on the ...' because 'mat' is the most frequent word in the dataset; and the NN is a model of those frequencies.

What about cases that are not present in the dataset?

blt2y ago

nerdponx2y ago

> TBF there is no good explanation why it works

eurekin2y ago

> TBF there is no good explanation why it works

I thought the general consesus was: "transformers allow neural networks to have adaptive weights".

As opposed to the previous architectures, were every edge connecting two neurons always has the same weight.

EDIT: a good video, where it's actually explained better: https://youtu.be/OFS90-FX6pg?t=750&si=A_HrX1P3TEfFvLay

rcarmo2y ago

You're effectively steering the predictions based on adjacent vectors (and precursors from the prompt). That mental model works fine.

bilsbie2y ago· 13 in thread

I finally understand this! Why did every other video make it so confusing!

chrishare2y ago

It is confusing, 3b1b is just that good.

visarga2y ago

At the same time it feels extremely simple

attention(Q,K,V) = softmax (Q K^T √ dK ) @ V

is just half a row; the multi-head, masking and positional stuff just toppings

we have many basic algorithms in CS that are more involved, it's amazing we get language understanding from such simple math

2 more replies

ur-whale2y ago

> Why did every other video make it so confusing!

In my experience, with very few notable exceptions (e.g. Feynmann), researchers are the worst when it comes to clearly explaining to others what they're doing.

I'm at the point where I'm starting believe that pedagogy and research generally are mutually exclusive skills.

namaria2y ago

thomasahle2y ago

I'm someone who would love to get better at making educational videos/content. 3b1b is obviously the gold standard here.

I'm curious what things other videos did worse compared to 3b1b?

bilsbie2y ago

I think he had a good, intuitive understanding that he wanted to communicate and he made it come through.

thinkingtoilet2y ago

Grant has a gift of explaining complicated things very clearly. There's a good reason his channel is so popular.

Al-Khwarizmi2y ago

Not sure if you mean it as rhetorical question but I think it's an interesting question. I think there are at least three factors why most people are confused about Transformers:

1. The standard terminology is "meh" at most. The word "attention" itself is just barely intuitive, "self-attention" is worse, and don't get me started about "key" and "value".

Everyone who is trying to explain transformers has to overcome these three disadvantages... so most explanations are confusing.

cmplxconjugate2y ago

>This is probably an unpopular opinion.

I think though your third point is the most important; right now people are focused on results.

maleldil2y ago

> This is probably an unpopular opinion

There's a reason The Illustrated Transformer[1] was/is so popular: it made the original paper much more digestible.

[1] https://jalammar.github.io/illustrated-transformer/

Solvency2y ago

Because:

1. good communication requires an intelligence that most people sadly lack

2. because the type of people who are smart enough to invent transformers have zero incentive to make them easily understandable.

most documents are written by authors subconsciously desperate to mentally flex on their peers.

penguin_booze2y ago

1 more reply

WithinReason2y ago

2. It's not malice. The longer you have understood something the harder it is to explain it, since you already forgot what it was like to not understand it.

Xcelerate2y ago· 11 in thread

IDK, when your background involves research in a certain field, every problem looks like a nail for that particular hammer. Does anyone else see parallels here or is this a bit of a stretch?

francasso2y ago

bdjsiqoocwk2y ago

I think you're just describing a state machine, no? The fact that you encode the state in a vector and steps by matrices is an implementation detail...?

Xcelerate2y ago

One might model the human brain as a FSM as well, but I’m not sure I’d call the predictive ability of the brain an implementation detail.

1 more reply

feoren2y ago

1 more reply

BoGoToTo2y ago

tweezy2y ago

Way out and not saying there's anything truth to it. But it was a really interesting and fun concept to chew on.

3 more replies

BobbyTables22y ago

I think Wolfram made news proposing something roughly along these lines.

Either way, I find Planck time/energy to be a very spooky concept.

https://wolframphysics.org/

pas2y ago

https://en.m.wikipedia.org/wiki/Holographic_principle

cmgbhm2y ago

Didn’t make the “what was going on then in physics “ connection until now.

tpurves2y ago

lagrange772y ago

rollinDyno2y ago· 9 in thread

jgehring2y ago

causal2y ago

> you don't re-compute all attention passes for all tokens for each next token to predict.

You don't? I imagine the attention maps could be pretty different between n and n+1 tokens.

Edit: Or maybe you just meant you don't compute attention Σ(n) times for each new token?

diedyesterday2y ago

> "Is it really possible that this space is so rich as to have a single point in it encapsulate a whole novel?"

causal2y ago

Which means the embedding model must do a lot of the lifting to be able to accurately represent meaning across long contexts so well. Now I want to know more about how those models are derived.

vanjajaja12y ago

(is my understanding)

rollinDyno2y ago

faramarz2y ago

it's not about a single point encapsulating a novel, but how sequences of such embeddings can represent complex ideas when processed by the model's layers.

each prediction is based on a weighted context of all previous tokens, not just the immediately preceding one.

rollinDyno2y ago

That weighted context is the 12228 dimensional vector, no?

I suppose that when you each element in the vector weighs 16 bits then the space is immense and capable to have a novel in a point.

2 more replies

evolvingstuff2y ago

rayval2y ago· 3 in thread

Here's a compelling visualization of the functioning of an LLM when processing a simple request: https://bbycroft.net/llm

This complements the detailed description provided by 3blue1brown

bugthe0ry2y ago

When visualised this way, the scale of GPT-3 is insane. I can't imagine what 4 would like here.

spi2y ago

[1] https://medium.com/@daniellefranca96/gpt4-all-details-leaked...

1 more reply

lying4fun2y ago

amazing visualisation

shahbazac2y ago· 3 in thread

Is there a reference which describes how the current architecture evolved? Perhaps from very simple core idea to the famous “all you need paper?”

Otherwise it feels like lots of machinery created out of nowhere. Lots of calculations and very little intuition.

HarHarVeryFunny2y ago

Roughly:

krat0sprakhar2y ago

Thank you for this summary! Very well explained. Any tips on what resources you use to keep updated on this field?

1 more reply

ollin2y ago

karpathy gave a good high-level history of the transformer architecture in this Stanford lecture https://youtu.be/XfpMkf4rD6E?si=MDICNzZ_Mq9uzRo9&t=618

tylerneylon2y ago· 2 in thread

One great way to improve on this bottleneck is a new-ish idea called Ring Attention. This is a good article explaining it:

https://learnandburn.ai/p/how-to-build-a-10m-token-context

(I edited that article.)

danielhanchen2y ago

Oh with Flash Attention, you never have to construct the (S, S) matrix ever (also in article) Since its softmax(Q @ K^T / sqrt(d)) @ V, you can form the final output in tiles.

In Unsloth, memory usage scales linearly (not quadratically) due to Flash Attention (+ you get 2x faster finetuning, 80% less VRAM use + 2x faster inference). Still O(N^2) FLOPs though.

On that note, on long contexts, Unsloth's latest release fits 4x longer contexts than HF+FA2 with +1.9% overhead. So 228K context on H100.

rahimnathwani2y ago

He lists Ring Attention and half a dozen other techniques, but they're not within the scope of this video: https://youtu.be/eMlx5fFNoYc?t=784

1 more reply

abotsis2y ago· 2 in thread

I think what made this so digestible for me were the animations. The timing, how they expand/contract and unfold while he’s speaking.. is all very well done.

_delirium2y ago

That is definitely one of the things he does better than most. He actually wrote his own custom animation library for math animations: https://github.com/3b1b/manim

divan2y ago

Also check out community edition: https://www.manim.community

spacecadet2y ago· 2 in thread

Fun video. Much of my "art" lately has been dissecting models, injecting or altering attention, and creating animated visualizations of their inner workings. Some really fun shit.

j_bum2y ago

Link? Sounds fun and reminds me of this tweet [0]

[0] https://x.com/jaschasd/status/1756930242965606582

spacecadet2y ago

Nah someone down voted it. And yes, it looks like that + 20 others that are animated.

1 more reply

YossarianFrPrez2y ago· 1 in thread

hackinthebochs2y ago

nostrebored2y ago· 1 in thread

Working in a closely related space and this instantly became part of my team's onboarding docs.

Worth noting that a lot of the visualization code is available in Github.

https://github.com/3b1b/videos/tree/master/_2024/transformer...

sthatipamala2y ago

Sounds interesting; what else is part of those onboarding docs?

jiggawatts2y ago· 1 in thread

It always blows my mind that Grant Sanderson can explain complex topics in such a clear, understandable way.

I've seen several tutorials, visualisations, and blogs explaining Transformers, but I didn't fully understand them until this video.

chrishare2y ago

His content and impact is phenomenal

1 more reply

mastazi2y ago· 1 in thread

That example with the "was" token at the end of a murder novel is genius (at 3:58 - 4:28 in the video) really easy for a non technical person to understand.

hamburga2y ago

I think Ilya gets credit for that example — I’ve heard him use it in his interview with Jensen Huang.

justanotherjoe2y ago· 1 in thread

nerdponx2y ago

That's one way to think about it.

Incidentally, you can encode dates (e.g. day of week) in a model as sin(day of week) and cos(day of week) to ensure that "day 7" is mathematically adjacent to "day 1".

1 more reply

thomasahle2y ago· 1 in thread

I like the way he uses a low-rank decomposition of the Value matrix instead of Value+Output matrices. Much more intuitive!

imjonse2y ago

promiseofbeans2y ago

His previous post 'But what is a GPT?' is also really good: https://www.3blue1brown.com/lessons/gpt

namelosw2y ago

You might also want to check out other 3b1b videos on neural networks since there are sort of progressions between each video https://www.3blue1brown.com/topics/neural-networks

bjornsing2y ago

stillsut2y ago

In training we learn a.) the embeddings and b.) the KQ/MLP-weights.

How well do Transformers perform given learned embeddings but only randomly initialized decoder weights? Do they produce word soup of related concepts? Anything syntactically coherent?

Once a well trained high dimensional representation of tokens are established. can they learn KQ/MLP weights significantly faster?

kordlessagain2y ago

cs7022y ago

Fantastic work by Grant Sanderson, as usual.

Attention has won.[a]

It deserves to be more widely understood.

---

[a] Nothing has outperformed attention so far, not even Mamba: https://arxiv.org/abs/2402.01032

mehulashah2y ago

kjhenner2y ago

j / k navigate · click thread line to collapse