undefined | Better HN

0 pointsfastball3y ago0 comments

Multi-head attention just means that you're looking at all the words at once rather than only looking at one word at a time, and using that to generate the next word. So instead of using attention only on the last word you also have attention on the penultimate word and the one before that and the one before that, etc. I think it is fairly obvious why this gives better results than say an RNN – you are utilizing context better than in a recurrent system, which is also just closer to how a human brain works. When you read/write a sentence you're not really going one word at a time, you're thinking about all the words at once, even if the last word is technically the most important.

The other clear benefit of transformers over an arch like RNNs (and what has probably made more of a difference imo) is that its properly parallelizable, which means you can do huge training runs in a fraction of the time. RNNs might be able to get to a level of coherence that approaches GPT-3, but with current hardware that would be very time-prohibitive.

0 comments

4 comments · 2 top-level

heyitsguay3y ago· 1 in thread

That's not what multi-head attention means. Multi-head attention is the use of learned projection operators to perform attention operations within multiple lower-dimensional subspaces of the network's embedding space, rather than a single attention operation in the full embedding space. E.g. projecting 10 512-D vectors into 80 64-D vectors, attending separately to the 8 sets of 10 embedding projections, then concatenating the results together to reform 10 512-D vector outputs.

In fact the projection operations are the only learned part of a Transformer's self-attention function -- the rest of self-attention is just a weighted sum of the input vectors, where the weights come from the (scaled) vector correlation matrix.

fastballOP3y ago

How is that different from what I said?

petra3y ago· 1 in thread

So in training, chatgpt turned words into embedding , and given context window N , looked at N embeddings an created a probabilities list for the following next embedding ?

And if I tell it something that was excatly in it's trained context windows, I get the most likely next word and the one after itm

But what happens if I ask it something slighty different than it's training context ? Or something largely different?

MacsHeadroom3y ago

By "embedding" in this context what you're actually referring to is called a "token" which are sub-word strings of usually 1-4 characters.

It's not possible for you to ask it things even slightly different from it training data, unless you ask exclusively in emojis that didn't exist yet when it was trained (in which case it sees nothing, just like when someone sends you an emoji your phone doesn't support).

Any novel sentence and even novel words like "Blobdarfnk" ARE in its training data. "Blobdarfnk" is encoded as the five tokens Bl, ob, dar, fn, and k.

j / k navigate · click thread line to collapse