The Illustrated Word2vec (opens in new tab)

(jalammar.github.io)

348 pointsjalammar7y ago37 comments

37 comments

35 comments · 7 top-level

jalammarOP7y ago· 14 in thread

Hello HN,

Author here. I wrote this blog post attempting to visually explain the mechanics of word2vec's skipgram with negative sampling algorithm (SGNS). It's motivated by:

1- The need to develop more visual language around embedding algorithms.

2- The need for a gentle on-ramp to SGNS for people who are using it for recommender systems. A use-case I find very interesting (there are links in the post to such applications)

I'm hoping it could also be useful if you wanted to explain to someone new to the field the value of vector representations of things. Hope you enjoy it. All feedback is appreciated!

Radim7y ago

Nice work jalammar! Author of gensim here. Quotes from Dune are always appreciated :-)

Here's some more layman reading "from back when", for people interested in how word2vec compares to other methods and works technically:

- https://rare-technologies.com/making-sense-of-word2vec/ (my experiments with word2vec vs GloVe vs sparse SVD / PMI)

- https://www.youtube.com/watch?v=vU4TlwZzTfU&t=3s (my PyData talk on optimizing word2vec)

titanix27y ago

I read some of your posts of few weeks ago when searching more info about gensim, there well explained and understandable even for a beginner. Thanks.

wyldfire7y ago

The Dune references aren't limited to this article. :)

The BERT article [1] has 'em too!

[1] https://jalammar.github.io/illustrated-bert/

1 more reply

jalammarOP7y ago

Oh wow. Hi Radim! Huge fan of Gensim! Thanks for the links!

misterman07y ago

I'm half-way through your excellent article. How do you produce such great artwork?

I believe I understand the concepts of CBOW and skip-gram. But I'm a little bit stuck. I kind of don't understand this [0]. In fact I understand it so poorly that I can't even formulate a question around it.

Now what do we do?

[0] https://skymind.ai/images/wiki/word2vec_diagrams.png

Edit: An attempt at formulating a question: is it the process of feeding the model with the [context][context][output] vector that you are depicting?

jalammarOP7y ago

Thanks! Mostly Keynote, and lots of iteration.

I'll be honest, I personally found this figure puzzling. Still not 100% clear on it, but I don't believe it refers to the negative sampling approach. My best guess is that it's referring to earlier word2vec variants where the input in skipgram (or sum of inputs in CBOW) are multiplied by a weights matrix that projects the input to an output vector.

hadsed7y ago

It shows the input output pairs you would use to train the network. Projection is simply your fully connected layer of dimension the embedding size you want (e.g., something like 300). The output column is what is being predicted by the model, for which you have the true data and you'll calculate a loss and backprop as usual. In the BOW case you take multiple context words and predict the middle word (as shown in your diagram) and skip gram is the opposite approach.

elexhobby7y ago

Great post, thanks!

Is there a reason why the training is started off with two separate matrices - the embedding and the context matrix? If the context matrix is anyway discarded at the end, why not start and work with only the embedding matrix?

alexbilbie7y ago

Thank you so much for writing this. As a software developer with some familiarity with ML concepts and terminology (I’d heard of word2vec for example) I found this post really easy to follow along with.

ascavalcante807y ago

What a great work, man! It makes ML way simpler to understand. For those interested in see a similar content to learn advanced Maths, here is good YouTube channel: https://www.youtube.com/channel/UCYO_jab_esuFRV4b17AJtAw

wwarner7y ago

Then you definitely succeeded! Those are the parts where I learned the most.

obahareth7y ago

Another great article as always Jay.

Also thumbs up for the Dune references :)

focom7y ago

Thanks for this, beautiful work!

giacaglia7y ago

This is great!

microtonal7y ago· 7 in thread

There are clear places where “king” and “queen” are similar to each other and distinct from all the others. Could these be coding for a vague concept of royalty?

This is a common misunderstanding and unfortunately strengthened by the example of 'personality embeddings'. It is easy to understand intuitively why this is normally not the case. If you rotate a vector/embedding space, all the cosine similarities between words are preserved. Suppose that component 20 encoded 'royalty', there is an infinite number of rotations of the vector space where 'royalty' is distributed among many dimensions. Consider e.g. the personality vector openness-extroversion as values between 0-1. Now we have two persons:

[0, 1] [1, 0]

These vectors are orthogonal, so the cosine similarity is 0. Now let's rotate the vector space by 45 degrees:

[-0.707, 0.707] [0.707, 0.707]

The cosine similarity between the two vectors is still 0. Clearly, the direct mapping of personality traits to vector components is lost (as can be seen in the second vector).

Obviously, this is something that you generally do not want to do for personality vectors. However, there is nothing in the word2vec objectives that would prefer a vector space with meaningful dimensions. E.g. take the skip-gram model, which maximizes the log-likelihood of the probability of a context word f, cooccuring with a context word c. Shortened: p(1|w_f,w_c) = 𝜎(w_f·w_c). So, the objective in vanilla word2vec prefers vector spaces that maximize the inner product of words that co-occur and minimize the inner product of words that do not co-occur. Consequently, if we have an optimal parametrization of W and C (the word and context matrices), any rotation of the vector space is also an optimal solution. Which rotation you actually get is dependent on accidental factors, such as the initial (typically randomized) parameters.

Of course, it is possible to rotate the vector space such that dimensions become meaningful (see e.g. [1]), but with word2vec's default objective meaningful dimensions are purely accidental, and the meaning of vectors is defined in their relation to other vectors / their neighborhood.

[1] https://www.aclweb.org/anthology/D17-1041

6gvONxR4sf7o7y ago

This doesn't refute the point. If there's a royalty dimension and then you rotate it, there's still a royalty dimension. It just isn't a basis vector. In a blog post intended for introducing the idea, is that distinction really worth dwelling on? It could be a misunderstanding, or just a pedagogical simplification.

jalammarOP7y ago

Very interesting. I'll read the paper to wrap my head around the concept. Thanks for the feedback!

maffydub7y ago

I agree that there's no reason that these properties are axis-aligned.

Isn't the normal approach to look at whether

word2vec('king') - word2vec('man') ?= word2vec('queen') - word2vec('woman')

There's an entertaining investigation of this applied to Game of Thrones at https://towardsdatascience.com/game-of-thrones-word-embeddin...!

make37y ago

No one is saying that the "royalty" direction should be in the same angle as an axis, or that it should be in the same direction every time you train word2vec of course. It doesn't mean that that direction doesn't exist, and that word2vec doesn't code for such a Royalty direction (or region)

microtonal7y ago

Well, obviously, all royalty are going to have similar vectors. The skipgram is just an implicit matrix vectorization of a shifted PMI matrix. And most royalty will have similar co-occurrences. My point is that the vector components do not mean anything in isolation. There is no dimension directly encoding such properties. The king vector means 'royalty' because queen, prince, princess, etc. are have similar directions.

feanaro7y ago

Related to this is factor analysis, the technique predominantly used in psychology to extract meaningful factors (analogue of components in principal component analysis).

Unlike PCA, it assumes meaningful "latent" factors (such as "royalty" above) and tries to find a rotation which best loads these onto the data. To achieve this, it doesn't attempt to encode the data perfectly but leaves room for error in the reduction to factors.

nullc7y ago

Has anyone tried a word2vec like training with an L1 norm minimizing regularization?

stared7y ago· 4 in thread

Well, I think that is important to remember that dimensions of word2vec DO NOT have any specific meaning (unlike Extraversion etc in Big Five). All of it is "up to a rotation". Using it looks clunky at best. To be fair, I may be biased as I wrote a different intro to word2vec (http://p.migdal.pl/2017/01/06/king-man-woman-queen-why.html).

For implementation, I am surprised it leaves out https://adoni.github.io/2017/11/08/word2vec-pytorch/. There are many other, including in NumPy and TF, but I find the PyTorch one the most straightforward and didactic, by a large margin.

sidesentists7y ago

For some context, the "up to a rotation" argument is something that has gone on for decades in the psychological measurement literature.

This is true, but the clustering of points in space is not. So while the choice of axes is arbitrary, it becomes nonarbitrary if you're trying to choose the axes in such a way as to represent the clustering of points. This is why you end up with different rotations in factor analysis, because of different definitions of how to best represent clusterings.

I think there's some ties here to compressed sensing but that's getting a little tangential. My main point is that while it's true that the default word2vec embedding may lack meaning, if you define "meaning" in some way (even if in terms of information loss) you can rotate it to a meaningful embedding.

b_tterc_p7y ago

Well, sort of. They do have a meaning. It’s probably not an easily findable or understandable concept to humans. If you hypothetically had a large labeled corpus for a bunch of different features, you could create linear regressions over the embedding space to find vectors that do represent exactly (perhaps not uniquely) the meaning you’re looking for... and from that you could imagine a function that transforms the existing embedding space into an organized one with meaning.

stared7y ago

No, it is not true. Everything is up to an orthogonal rotation. It is not an SVD (though, even for SVD, usually only the first few dimensions have a human interpretation).

Instead, you can:

- rotate it with SVD (works really well, when working on a subset of words)

- project it on given axes (e.g. "woman - man" and "king - man")

sixo7y ago

you could still interchange the dimensions arbitrarily. You can't say "dimension 1 = happiness", a re-training would not replicate that, and would not necessarily produce a dimension for "happiness" at all.

1 more reply

rmbryan7y ago· 3 in thread

Excellent article, thank you. My snag in thinking about word2vec is how the vector model stores information about words with multiple, significantly different definitions, such as 'polish', Eastern Europe or glistening clean.

physicsyogi7y ago

Word2vec doesn't really address multiple meanings (polysemy). There has been some progress on this though. Sebastian Ruder has been tracking the state-of-the-art in this here: [1].

[1] https://nlpprogress.com/english/word_sense_disambiguation.ht...

Edit: formatting

yodon7y ago

In a sufficiently high dimensional space, like used in word2vec, concepts can have lots of neighbors along wildly different dimensions.

c2567y ago

Clearly distinct meanings are usually fine, as there’s little chance of context overlap causing clashes. The subtle distinctions can cause more trouble, but those tend to occur in situations more refined than we expect systems like these to handle anyway. The section on interchangeability versus context might be helpful for illuminating this idea.

DLA7y ago

Thank you very much for writing this and for making such excellent visuals. This is the single best description of word2vec I've personally ever seen. Well done!

siavosh7y ago

Is word2vec still the cutting edge of NLP?

451mov7y ago

fantastic explanation

j / k navigate · click thread line to collapse

37 comments

35 comments · 7 top-level

jalammarOP7y ago· 14 in thread

Hello HN,

Author here. I wrote this blog post attempting to visually explain the mechanics of word2vec's skipgram with negative sampling algorithm (SGNS). It's motivated by:

1- The need to develop more visual language around embedding algorithms.

2- The need for a gentle on-ramp to SGNS for people who are using it for recommender systems. A use-case I find very interesting (there are links in the post to such applications)

I'm hoping it could also be useful if you wanted to explain to someone new to the field the value of vector representations of things. Hope you enjoy it. All feedback is appreciated!

Radim7y ago

Nice work jalammar! Author of gensim here. Quotes from Dune are always appreciated :-)

Here's some more layman reading "from back when", for people interested in how word2vec compares to other methods and works technically:

- https://rare-technologies.com/making-sense-of-word2vec/ (my experiments with word2vec vs GloVe vs sparse SVD / PMI)

- https://www.youtube.com/watch?v=vU4TlwZzTfU&t=3s (my PyData talk on optimizing word2vec)

titanix27y ago

I read some of your posts of few weeks ago when searching more info about gensim, there well explained and understandable even for a beginner. Thanks.

wyldfire7y ago

The Dune references aren't limited to this article. :)

The BERT article [1] has 'em too!

[1] https://jalammar.github.io/illustrated-bert/

1 more reply

jalammarOP7y ago

Oh wow. Hi Radim! Huge fan of Gensim! Thanks for the links!

misterman07y ago

I'm half-way through your excellent article. How do you produce such great artwork?

Now what do we do?

[0] https://skymind.ai/images/wiki/word2vec_diagrams.png

Edit: An attempt at formulating a question: is it the process of feeding the model with the [context][context][output] vector that you are depicting?

jalammarOP7y ago

Thanks! Mostly Keynote, and lots of iteration.

hadsed7y ago

elexhobby7y ago

Great post, thanks!

alexbilbie7y ago

ascavalcante807y ago

wwarner7y ago

Then you definitely succeeded! Those are the parts where I learned the most.

obahareth7y ago

Another great article as always Jay.

Also thumbs up for the Dune references :)

focom7y ago

Thanks for this, beautiful work!

giacaglia7y ago

This is great!

microtonal7y ago· 7 in thread

There are clear places where “king” and “queen” are similar to each other and distinct from all the others. Could these be coding for a vague concept of royalty?

[0, 1] [1, 0]

These vectors are orthogonal, so the cosine similarity is 0. Now let's rotate the vector space by 45 degrees:

[-0.707, 0.707] [0.707, 0.707]

The cosine similarity between the two vectors is still 0. Clearly, the direct mapping of personality traits to vector components is lost (as can be seen in the second vector).

[1] https://www.aclweb.org/anthology/D17-1041

6gvONxR4sf7o7y ago

jalammarOP7y ago

Very interesting. I'll read the paper to wrap my head around the concept. Thanks for the feedback!

maffydub7y ago

I agree that there's no reason that these properties are axis-aligned.

Isn't the normal approach to look at whether

word2vec('king') - word2vec('man') ?= word2vec('queen') - word2vec('woman')

There's an entertaining investigation of this applied to Game of Thrones at https://towardsdatascience.com/game-of-thrones-word-embeddin...!

make37y ago

microtonal7y ago

feanaro7y ago

Related to this is factor analysis, the technique predominantly used in psychology to extract meaningful factors (analogue of components in principal component analysis).

nullc7y ago

Has anyone tried a word2vec like training with an L1 norm minimizing regularization?

stared7y ago· 4 in thread

sidesentists7y ago

For some context, the "up to a rotation" argument is something that has gone on for decades in the psychological measurement literature.

b_tterc_p7y ago

stared7y ago

No, it is not true. Everything is up to an orthogonal rotation. It is not an SVD (though, even for SVD, usually only the first few dimensions have a human interpretation).

Instead, you can:

- rotate it with SVD (works really well, when working on a subset of words)

- project it on given axes (e.g. "woman - man" and "king - man")

sixo7y ago

1 more reply

rmbryan7y ago· 3 in thread

physicsyogi7y ago

Word2vec doesn't really address multiple meanings (polysemy). There has been some progress on this though. Sebastian Ruder has been tracking the state-of-the-art in this here: [1].

[1] https://nlpprogress.com/english/word_sense_disambiguation.ht...

Edit: formatting

yodon7y ago

In a sufficiently high dimensional space, like used in word2vec, concepts can have lots of neighbors along wildly different dimensions.

c2567y ago

DLA7y ago

Thank you very much for writing this and for making such excellent visuals. This is the single best description of word2vec I've personally ever seen. Well done!

siavosh7y ago

Is word2vec still the cutting edge of NLP?

451mov7y ago

fantastic explanation

j / k navigate · click thread line to collapse