My HN submission of this endeavour received no love, but I think it's worthwhile nevertheless as the Python code is not only more concise, readable and extendable, but the training's actually faster too [2].
[1] https://github.com/piskvorky/gensim/blob/develop/gensim/mode...
[2] http://radimrehurek.com/2013/09/word2vec-in-python-part-two-...
That is some amazing work, thanks!
Mikolov said that he hoped word2vec would "significantly advance the state of the art" of NLP, but really the state of the art can only advance when people can understand and manipulate the code. You're making that possible. Thank you.
Are there really only 1000 independent concepts in the English language?
So with 1000 continuous dimensions (typically values between -1 and 1 coded on 32 bit floats) you can encode quite a bunch of concepts and their nuances.
Note: the default dimensionality of word2vec is 100 instead of 1000. Apparently you can get better results with dim=300 and a very large training corpus. To leverage higher dimensions you need: more CPU time to reach convergence and a lot more data to leverage the added model capacity.
FWIW, 2^61 > 26^5, so even the binary vector 2^1000 has an expressive space about 2^939 times larger than 26^5 (all possible words up to 5 letters).
But yeah, the continuous dimensions can hide many more binary dimensions.
For example, 4-D rgba can be smashed into 1 continuous (or 64-bit) dimension, but that feels a bit like cheating.
So it sort of feels like 1000 64-bit dimensions is a tricky name. 64000 1bit dimensions.
And the paper itelf is a very worthwhile read: http://arxiv.org/abs/1301.3781
->math shopping reading science
I think shopping doesnt belong in this list!
->rain snow sleet sun
I think sun doesnt belong in this list!
etc.