Author here. I wrote this blog post attempting to visually explain the mechanics of word2vec's skipgram with negative sampling algorithm (SGNS). It's motivated by:
1- The need to develop more visual language around embedding algorithms.
2- The need for a gentle on-ramp to SGNS for people who are using it for recommender systems. A use-case I find very interesting (there are links in the post to such applications)
I'm hoping it could also be useful if you wanted to explain to someone new to the field the value of vector representations of things. Hope you enjoy it. All feedback is appreciated!
Here's some more layman reading "from back when", for people interested in how word2vec compares to other methods and works technically:
- https://rare-technologies.com/making-sense-of-word2vec/ (my experiments with word2vec vs GloVe vs sparse SVD / PMI)
- https://www.youtube.com/watch?v=vU4TlwZzTfU&t=3s (my PyData talk on optimizing word2vec)
The BERT article [1] has 'em too!
I believe I understand the concepts of CBOW and skip-gram. But I'm a little bit stuck. I kind of don't understand this [0]. In fact I understand it so poorly that I can't even formulate a question around it.
Now what do we do?
[0] https://skymind.ai/images/wiki/word2vec_diagrams.png
Edit: An attempt at formulating a question: is it the process of feeding the model with the [context][context][output] vector that you are depicting?
I'll be honest, I personally found this figure puzzling. Still not 100% clear on it, but I don't believe it refers to the negative sampling approach. My best guess is that it's referring to earlier word2vec variants where the input in skipgram (or sum of inputs in CBOW) are multiplied by a weights matrix that projects the input to an output vector.
Is there a reason why the training is started off with two separate matrices - the embedding and the context matrix? If the context matrix is anyway discarded at the end, why not start and work with only the embedding matrix?
Also thumbs up for the Dune references :)
This is a common misunderstanding and unfortunately strengthened by the example of 'personality embeddings'. It is easy to understand intuitively why this is normally not the case. If you rotate a vector/embedding space, all the cosine similarities between words are preserved. Suppose that component 20 encoded 'royalty', there is an infinite number of rotations of the vector space where 'royalty' is distributed among many dimensions. Consider e.g. the personality vector openness-extroversion as values between 0-1. Now we have two persons:
[0, 1] [1, 0]
These vectors are orthogonal, so the cosine similarity is 0. Now let's rotate the vector space by 45 degrees:
[-0.707, 0.707] [0.707, 0.707]
The cosine similarity between the two vectors is still 0. Clearly, the direct mapping of personality traits to vector components is lost (as can be seen in the second vector).
Obviously, this is something that you generally do not want to do for personality vectors. However, there is nothing in the word2vec objectives that would prefer a vector space with meaningful dimensions. E.g. take the skip-gram model, which maximizes the log-likelihood of the probability of a context word f, cooccuring with a context word c. Shortened: p(1|w_f,w_c) = 𝜎(w_f·w_c). So, the objective in vanilla word2vec prefers vector spaces that maximize the inner product of words that co-occur and minimize the inner product of words that do not co-occur. Consequently, if we have an optimal parametrization of W and C (the word and context matrices), any rotation of the vector space is also an optimal solution. Which rotation you actually get is dependent on accidental factors, such as the initial (typically randomized) parameters.
Of course, it is possible to rotate the vector space such that dimensions become meaningful (see e.g. [1]), but with word2vec's default objective meaningful dimensions are purely accidental, and the meaning of vectors is defined in their relation to other vectors / their neighborhood.
Isn't the normal approach to look at whether
word2vec('king') - word2vec('man') ?= word2vec('queen') - word2vec('woman')
There's an entertaining investigation of this applied to Game of Thrones at https://towardsdatascience.com/game-of-thrones-word-embeddin...!
Unlike PCA, it assumes meaningful "latent" factors (such as "royalty" above) and tries to find a rotation which best loads these onto the data. To achieve this, it doesn't attempt to encode the data perfectly but leaves room for error in the reduction to factors.
For implementation, I am surprised it leaves out https://adoni.github.io/2017/11/08/word2vec-pytorch/. There are many other, including in NumPy and TF, but I find the PyTorch one the most straightforward and didactic, by a large margin.
This is true, but the clustering of points in space is not. So while the choice of axes is arbitrary, it becomes nonarbitrary if you're trying to choose the axes in such a way as to represent the clustering of points. This is why you end up with different rotations in factor analysis, because of different definitions of how to best represent clusterings.
I think there's some ties here to compressed sensing but that's getting a little tangential. My main point is that while it's true that the default word2vec embedding may lack meaning, if you define "meaning" in some way (even if in terms of information loss) you can rotate it to a meaningful embedding.
Instead, you can:
- rotate it with SVD (works really well, when working on a subset of words)
- project it on given axes (e.g. "woman - man" and "king - man")
[1] https://nlpprogress.com/english/word_sense_disambiguation.ht...
Edit: formatting