But what is a GPT? Visual intro to Transformers [video] (opens in new tab)

(youtube.com)

473 pointshuhhuh2y ago53 comments

53 comments

I've just started this video, but already have a question if anyone's familiar with GPT workings - I thought that these models chose the next word based on what's most likely. But if they choose based on "one of the likely" words, could (in general) that not lead to a situation where the list of predictions for the next word are much less likely? Running possibilities of "two words together", then, would be more beneficial if computationally possible (and so on for 3, 4 and n words). Does this exist?

(I realize that choosing the most likely word wouldn't necessarily solve the issue, but choosing the most likely phrase possibly might.)

Edit, post seeing the video and comments: it's beam search, along with temperature to control these things.

authorfly2y ago

In practice, beam search doesn't seem to work well for generative models.

Temperature and top_k (two very similar parameters) were both introduced to account for the fact that human text is unpredictable stochastically for each sentence someone might say as such - as shown in this 2021 similar graph/reproduction of an older graph from the 2018/2019 HF documentation: https://lilianweng.github.io/posts/2021-01-02-controllable-t...

It could be that beam search with much longer length does turn out to be better or some merging of the techniques works well, but I don't think so. The query-key-value part of transformers is focused on a single total in many ways - in relation to the overall context. The architecture is not meant for longer forms as such - there is no default "two token" system. And with 50k-100k tokens in most GPT models, you would be looking at 50k*50k = A great deal more parameters and then issues with sparsity of data.

Just everything about GPT models (e.g. learned positional encodings/embeddings depending on the model iteration) is so focused on bringing the richness of a single token or single token index that the architecture is not designed for beam search like this one could say. Without considering the training complications.

doctoboggan2y ago

The temperature setting is used to select how rare of a next token is possible. If set to 0 the. The top of the likely list is chosen, if set greater than 0 then some lower probability tokens may be chosen.

HarHarVeryFunny2y ago

Not quite... temperature and token selection are two different things.

At the output of an LLM the raw next-token prediction values (logits) are passed through a softmax to convert them into probabilities, then these probabilities drive token selection according to the chosen selection scheme such as greedy selection (always choose highest probability token), or a sampling scheme such as top-k or top-p. Under top-k sampling a random token selection is made from one of the top k most probable tokens.

The softmax temperature setting preserves the relative order of output probabilities, but at higher temperatures gives a boost to outputs that would otherwise have been low probability such that the output probabilities are more balanced. The effect of this on token selection depends on the selection scheme being used.

If greedy selection was chosen, then temperature has no effect since it preserves the relative order of probabilities, and the highest probability token will always be chosen.

If a sampling selection scheme (top-k or top-p) was chosen, then increased temperature will have boosted the likelihood of sampling choosing an otherwise lower probability token. Note however, that even with the lowest temperature setting, sampling is always probabilistic, so there is no guarantee (or desire!) for the highest probability token to be selected.

DrawTR2y ago

Can this be potentially dangerous -- e.g. if a user types "The answer to the expression 2 + 2 is", isn't there a chance it chooses an output beyond the most likely one?

Habgdnv2y ago

Unless you screw something, a different next token does not mean wrong answer. Examples:

(80% of the time) The answer to the expression 2 + 2 is 4

(15% of the time) The answer to the expression 2 + 2 is Four

(5% of the time) The answer to the expression 2 + 2 is certainly

(95% of the time) The answer to the expression 2 + 2 is certainly Four

This is how you can asp ChatGPT the same question few times and it can give you different words each time, and still be correct.

2 more replies

weitendorf2y ago

Yes, although it's also possible that the most likely token is incorrect and perhaps the next 4 most likely tokens would lead to a correct answer.

For example if you ask a model what is 0^0, the highest probability output may be "1", which is incorrect. The next most probable outputs may be words like "although", "because", "Due to", "unfortunately", etc. as the model prepares to explain to the user that the value of the expression is undefined; because there are many more ways to express and explain the undefined answer than there are to express a naively incorrect answer, the correct answer is split across more tokens so that even if eg the softmax value of "1" is 0.1 and across "although"+"because"+"due to"+"unfortunately">0.3, at temperature of 0, "1" gets chosen. At slightly higher temperatures, sampling across all outputs would increase the probability of a correct answer.

So it's true that increasing the temperature increases the probability that the model outputs tokens other than the single-most-likely token, but that might be what you want. Temperature purely controls the distribution of tokens, not "answers".

1 more reply

x-complexity2y ago

> Can this be potentially dangerous -- e.g. if a user types "The answer to the expression 2 + 2 is", isn't there a chance it chooses an output beyond the most likely one?

This is where the semi-ambiguity of the human languages helps a lot with.

There are multiple ways to answer with "4" that are acceptable, meaning that it just needs to be close enough to the desired outcome to work. This means that there isn't a single point that needs to be precisely aimed at, but a broader plot of space that's relatively easier to hit.

The hefty tolerances, redundancies, & general lossiness of the human language act as a metaphorical gravity well to drag LLMs to the most probable answer.

MrYellowP2y ago

> potentially dangerous

> 2 + 2

You really couldn't come up with an actual example of something that would be dangerous? I'd appreciate that, because I'm not seeing reason to believe that an "output beyond the most likely one" output would end up ever being dangerous, as in, harming someone or putting someone's life at risk.

Thanks.

2 more replies

jldugger2y ago

Yes, but the chance is quite small if the gap between "4" and any other token is quite large.

pelillian2y ago

That’s why we use top p and top k! They limit the probability space to a certain % or number of tokens ordered by likelihood

davekeck2y ago

> then some lower probability tokens may be chosen

Can you explain how it chooses one of the lower-probability tokens? Is it just random?

acchow2y ago

Reducing temperature reduces the impact of differences between raw output values giving a higher probability to pick other tokens.

1 more reply

not_a_dane2y ago

It is the part of softmax layer, but not all the time.

user_78322y ago

Thanks, learnt something new today!

ahzhou2y ago

Yes, this is a fundamental weakness with LLMs. Unfortunately this is likely unsolvable because the search space is exponential. Techniques like beam search help, but can only introduce a constant scaling factor.

That said, LLM reach their current performance despite this limitation.

mvsin2y ago

Something like this does exist, production systems rarely use greedy search but have more holistic search algorithms.

An example is Beam Search:https://www.width.ai/post/what-is-beam-search

Essentially we keep a window of probabilities of predicted tokens to improve the final quality of output.

user_78322y ago

Thanks, that's exactly what I was looking for! Any idea if it's possible to use beam search on local models like mistral? It sounds like the choice of beam search vs say top-p or top-k should be in the software and not embedded, right?

activatedgeek2y ago

If you use HuggingFace models, then a few simpler decoding algorithms are already implemented for `generate` method of all supported models.

Here is a blog post that describes it: https://huggingface.co/blog/how-to-generate.

I will warn you though that beam search is typically what you do NOT want. Beam search approximately optimizes for the "highest likely sequence at the token level." This is rarely what you need in practice with open-ended generations (e.g. a question-answering chat bot). In practice, you need "highest likely semantic sequence," which is much harder problem.

Of course, various approximations for semantic alignment are currently in the literature, but still a wide open problem.

yunohn2y ago

This is actually a great question for which I found an interesting attempt: https://andys.page/posts/llm_sampling_strategies/

(No affiliation)

qeternity2y ago

> production systems rarely use greedy search

I have no idea why you say this. Most of our pipelines will run greedy, for reproducibility.

Maybe we turn the temp up if we are returning conversational text back to a user.

jimmySixDOF2y ago

>words together

Thats basically chunking or at least how it starts. I was impressed by the ability to add and subtract the individual word vector embeddings and get meaningful results. Chunking a larger block blends this whole process so you can do the same thing but in conseptual space the so take a baseline method like sentence embedding and that becomes your working block for comparison.

lxe2y ago

There's a whole bunch of different normalization and sampling techniques that you can perform that can alter the quality or expressiveness of the model, e.g. https://docs.sillytavern.app/usage/common-settings/#sampler-...

chessgecko2y ago

There’s some fancier stuff too like techniques that take into account where recent tokens were drawn from in the distribution and update either the top_p or the temperature so that sequences of tokens have a minimum unlikeliness. Beam search is less common with really large models because the computation is really expensive.

lucidrains2y ago

I can't think of anyone better to teach attention mechanism to the masses. This is a dream come true

acchow2y ago

Incredible. This 3B1B series was started 6 years ago and keeps going today with chapter 5.

If you haven't seen the first few chapters, I cannot recommend enough.

user_78322y ago

Would you be able to compare them to Andrew Ng's course?

mFixman2y ago

IMO the style, formatting, and animations in 3B1B videos is what Coursera courses should have been about in the first place.

Andrew Ng's course doesn't use video effectively at all: half of each class is Andrew talking to the camera, while the other half is him slowly writing things down with a mouse. There's a reason why a lot of people recommend watching at 1.5x speed.

Online classes are online classes. If they try to make copy in-person classes, like most Coursera courses do, they will keep all of the weaknesses of online classes without any of its strengths.

abraxas2y ago

I personally preferred Andrej Karpathy's CS231n taught by him and his private videos about neural nets in general and transformers in particular. He has a youtube vid where he builds one from scratch in Python!

3BlueOneBrown videos are a great complement to Karpathy's lectures to aid in visualising what is going on.

spmurrayzzz2y ago

IMO I think the 3Blue1Brown video is a good place to start to build intuitions about how things work generally if you're new, and Andrew Ng's courses will help you dig into more detail, experiment, and implement things to build on those intuitions.

sk110012y ago

They're not really comparable - if you're wondering if you should do one or the other, you should do both.

ctrw2y ago

The way you compare a technical drawing of a steam engine to The Fighting Temeraire oil painting.

Vespasian2y ago

If you liked that, Andrej karpathy has a few interesting videos on his channels explaining Neural Networks and their inner workings which are aimed at people who know how to program.

jtonz2y ago

As a reasonably experienced programmer that has watched Andrej's videos the one thing I would recommend is that they not be used as a starting point to learn neural networks but as a reinforcement or enhancement method once you know the fundamentals.

I was ignorant enough to try and jump straight in to his videos and despite him recommending I watch his preceeding videos I incorrectly assumed I could figure it out as I went. There is verbiage in there that you simply must know to get the most out of it. After giving up, going away and filling in the gaps though some other learnings, I went back and his videos become (understandably) massively more valueable for me.

I would strongly recommend anyone else wanting to learn neural networks that they learn from my mistake.

kovrik2y ago

Could you please share what other learning materials you used?

6mian2y ago

For me 3brown1blue series: https://m.youtube.com/watch?v=aircAruvnKk was an excellent introduction that made Andrej's videos understandable. Then I did 3 first chapters of fastai book, but found it too high level, while I was interested in how things works under the hood.

Going through Andrej's makemore tutorials required quite a lot of time but it's definitely worth it. I used free tier of Google Colab until the last one.

Pausing the video a lot after he explains what he plans to do and trying to do it by myself was a very rewarding way to learn, with a lot of "aha" moments.

yinser2y ago

What an unbelievable salve for all the April Fool's content. Pipe this directly into my veins.

Terr_2y ago

Also relevant would be this interactive visualization: https://bbycroft.net/llm

Prior discussion: https://news.ycombinator.com/item?id=38505211

__loam2y ago

3B1B is one of the best stem educators in YouTube.

throwawayk7h2y ago

The next token is taken by sampling the logits in the final column after unembedding. But isn't that just the last token again? Or is the matrix resized to N+1 at some step?

HarHarVeryFunny2y ago

There is an end-of-sequence token appended to the input sequence, and this is what is transformed into the predicted next token.

lovestaco2y ago

Good channel, I just get most of what he says.

lxe2y ago

Can't wait for the next videos. I think I'll finally be able to internalize and understand how these things work.

RecycledEle2y ago

Great video.

Thank you for sharing.

j / k navigate · click thread line to collapse

53 comments

user_78322y ago

(I realize that choosing the most likely word wouldn't necessarily solve the issue, but choosing the most likely phrase possibly might.)

Edit, post seeing the video and comments: it's beam search, along with temperature to control these things.

authorfly2y ago

In practice, beam search doesn't seem to work well for generative models.

doctoboggan2y ago

HarHarVeryFunny2y ago

Not quite... temperature and token selection are two different things.

If greedy selection was chosen, then temperature has no effect since it preserves the relative order of probabilities, and the highest probability token will always be chosen.

DrawTR2y ago

Can this be potentially dangerous -- e.g. if a user types "The answer to the expression 2 + 2 is", isn't there a chance it chooses an output beyond the most likely one?

Habgdnv2y ago

Unless you screw something, a different next token does not mean wrong answer. Examples:

(80% of the time) The answer to the expression 2 + 2 is 4

(15% of the time) The answer to the expression 2 + 2 is Four

(5% of the time) The answer to the expression 2 + 2 is certainly

(95% of the time) The answer to the expression 2 + 2 is certainly Four

This is how you can asp ChatGPT the same question few times and it can give you different words each time, and still be correct.

2 more replies

weitendorf2y ago

Yes, although it's also possible that the most likely token is incorrect and perhaps the next 4 most likely tokens would lead to a correct answer.

1 more reply

x-complexity2y ago

> Can this be potentially dangerous -- e.g. if a user types "The answer to the expression 2 + 2 is", isn't there a chance it chooses an output beyond the most likely one?

This is where the semi-ambiguity of the human languages helps a lot with.

The hefty tolerances, redundancies, & general lossiness of the human language act as a metaphorical gravity well to drag LLMs to the most probable answer.

MrYellowP2y ago

> potentially dangerous

> 2 + 2

Thanks.

2 more replies

jldugger2y ago

Yes, but the chance is quite small if the gap between "4" and any other token is quite large.

pelillian2y ago

That’s why we use top p and top k! They limit the probability space to a certain % or number of tokens ordered by likelihood

davekeck2y ago

> then some lower probability tokens may be chosen

Can you explain how it chooses one of the lower-probability tokens? Is it just random?

acchow2y ago

Reducing temperature reduces the impact of differences between raw output values giving a higher probability to pick other tokens.

1 more reply

not_a_dane2y ago

It is the part of softmax layer, but not all the time.

user_78322y ago

Thanks, learnt something new today!

ahzhou2y ago

That said, LLM reach their current performance despite this limitation.

mvsin2y ago

Something like this does exist, production systems rarely use greedy search but have more holistic search algorithms.

An example is Beam Search:https://www.width.ai/post/what-is-beam-search

Essentially we keep a window of probabilities of predicted tokens to improve the final quality of output.

user_78322y ago

activatedgeek2y ago

If you use HuggingFace models, then a few simpler decoding algorithms are already implemented for `generate` method of all supported models.

Here is a blog post that describes it: https://huggingface.co/blog/how-to-generate.

Of course, various approximations for semantic alignment are currently in the literature, but still a wide open problem.

yunohn2y ago

This is actually a great question for which I found an interesting attempt: https://andys.page/posts/llm_sampling_strategies/

(No affiliation)

qeternity2y ago

> production systems rarely use greedy search

I have no idea why you say this. Most of our pipelines will run greedy, for reproducibility.

Maybe we turn the temp up if we are returning conversational text back to a user.

jimmySixDOF2y ago

>words together

lxe2y ago

chessgecko2y ago

lucidrains2y ago

I can't think of anyone better to teach attention mechanism to the masses. This is a dream come true

acchow2y ago

Incredible. This 3B1B series was started 6 years ago and keeps going today with chapter 5.

If you haven't seen the first few chapters, I cannot recommend enough.

user_78322y ago

Would you be able to compare them to Andrew Ng's course?

mFixman2y ago

IMO the style, formatting, and animations in 3B1B videos is what Coursera courses should have been about in the first place.

Online classes are online classes. If they try to make copy in-person classes, like most Coursera courses do, they will keep all of the weaknesses of online classes without any of its strengths.

abraxas2y ago

3BlueOneBrown videos are a great complement to Karpathy's lectures to aid in visualising what is going on.

spmurrayzzz2y ago

sk110012y ago

They're not really comparable - if you're wondering if you should do one or the other, you should do both.

ctrw2y ago

The way you compare a technical drawing of a steam engine to The Fighting Temeraire oil painting.

Vespasian2y ago

If you liked that, Andrej karpathy has a few interesting videos on his channels explaining Neural Networks and their inner workings which are aimed at people who know how to program.

jtonz2y ago

I would strongly recommend anyone else wanting to learn neural networks that they learn from my mistake.

kovrik2y ago

Could you please share what other learning materials you used?

6mian2y ago

Going through Andrej's makemore tutorials required quite a lot of time but it's definitely worth it. I used free tier of Google Colab until the last one.

Pausing the video a lot after he explains what he plans to do and trying to do it by myself was a very rewarding way to learn, with a lot of "aha" moments.

yinser2y ago

What an unbelievable salve for all the April Fool's content. Pipe this directly into my veins.

Terr_2y ago

Also relevant would be this interactive visualization: https://bbycroft.net/llm

Prior discussion: https://news.ycombinator.com/item?id=38505211

__loam2y ago

3B1B is one of the best stem educators in YouTube.

throwawayk7h2y ago

The next token is taken by sampling the logits in the final column after unembedding. But isn't that just the last token again? Or is the matrix resized to N+1 at some step?

HarHarVeryFunny2y ago

There is an end-of-sequence token appended to the input sequence, and this is what is transformed into the predicted next token.

lovestaco2y ago

Good channel, I just get most of what he says.

lxe2y ago

Can't wait for the next videos. I think I'll finally be able to internalize and understand how these things work.

RecycledEle2y ago

Great video.

Thank you for sharing.

j / k navigate · click thread line to collapse