word2vec in yhat: Word vector similarity (opens in new tab)

(danielfrg.github.io)

59 pointsdfrodriguez14312y ago15 comments

15 comments

15 comments · 6 top-level

judk12y ago· 4 in thread

Word2vec seemed intuitively obvious me, but I really have a hard time believing that it works in only 1000 dimensions, generating results beyond cherry picked demo examples.

Are there really only 1000 independent concepts in the English language?

ogrisel12y ago

No but with n binary dimensions (with value 0 or 1) you can encode 2^n unique identifiers.

So with 1000 continuous dimensions (typically values between -1 and 1 coded on 32 bit floats) you can encode quite a bunch of concepts and their nuances.

Note: the default dimensionality of word2vec is 100 instead of 1000. Apparently you can get better results with dim=300 and a very large training corpus. To leverage higher dimensions you need: more CPU time to reach convergence and a lot more data to leverage the added model capacity.

gojomo12y ago

I'm still impressed it only takes 26 letters, in words of average size around 5! By comparison, 1000 continuous dimensions seems positively resplendent with expressiveness.

FWIW, 2^61 > 26^5, so even the binary vector 2^1000 has an expressive space about 2^939 times larger than 26^5 (all possible words up to 5 letters).

judk12y ago

Yes, but there are exponentially more concepts than words. The words we have are sparse set of labels for particularly relevant combinations.

But yeah, the continuous dimensions can hide many more binary dimensions.

For example, 4-D rgba can be smashed into 1 continuous (or 64-bit) dimension, but that feels a bit like cheating.

So it sort of feels like 1000 64-bit dimensions is a tricky name. 64000 1bit dimensions.

IanCal12y ago

I wouldn't be surprised if you cover most basic english with 1000 concepts. That would give a lot of combinations.

Radim12y ago· 3 in thread

For people interested in a cleaned-up, commented and de-obfuscated word2vec, I recently ported the original C code to Python [1].

My HN submission of this endeavour received no love, but I think it's worthwhile nevertheless as the Python code is not only more concise, readable and extendable, but the training's actually faster too [2].

[1] https://github.com/piskvorky/gensim/blob/develop/gensim/mode...

[2] http://radimrehurek.com/2013/09/word2vec-in-python-part-two-...

dfrodriguez143OP12y ago

Your submission receives no love but my one afternoon hack does... oh the humanity... lol

That is some amazing work, thanks!

bowyakka12y ago

Its sad you didnt get the love on your submission; you changes are very neat and having word2vec inside gensim feels like a really awesome feature.

rspeer12y ago

Well done!

Mikolov said that he hoped word2vec would "significantly advance the state of the art" of NLP, but really the state of the art can only advance when people can understand and manipulate the code. You're making that possible. Thank you.

gojomo12y ago· 2 in thread

Eventually computers will be talking about us behind our backs in these high-dimensional vectors, only occasionally translating down to English approximations, to humor us. "Goo goo, gah gah, human?"

seiji12y ago

Have you read the [Message Contains No Recognizable Symbols] series? It's pretty great: http://www.ssec.wisc.edu/~billh/g/mcnrs.html

gojomo12y ago

Haven't but will check it out, thanks!

3JPLW12y ago

Very cool. I missed the original word2vec software discussion back in August: https://news.ycombinator.com/item?id=6216044

And the paper itelf is a very worthwhile read: http://arxiv.org/abs/1301.3781

dhammack12y ago

The vectors learned from word2vec are pretty amazing. A few days after the tool was released I wrote a script which uses the vector representations to figure out which word in a list isn't like the others [1]. Things like:

->math shopping reading science

I think shopping doesnt belong in this list!

->rain snow sleet sun

I think sun doesnt belong in this list!

etc.

[1] https://github.com/dhammack/Word2VecExample

gojomo12y ago

Cool web demo powered by word2vec, by Christopher Moody:

http://thisplusthat.me/

j / k navigate · click thread line to collapse

15 comments

15 comments · 6 top-level

judk12y ago· 4 in thread

Word2vec seemed intuitively obvious me, but I really have a hard time believing that it works in only 1000 dimensions, generating results beyond cherry picked demo examples.

Are there really only 1000 independent concepts in the English language?

ogrisel12y ago

No but with n binary dimensions (with value 0 or 1) you can encode 2^n unique identifiers.

So with 1000 continuous dimensions (typically values between -1 and 1 coded on 32 bit floats) you can encode quite a bunch of concepts and their nuances.

gojomo12y ago

I'm still impressed it only takes 26 letters, in words of average size around 5! By comparison, 1000 continuous dimensions seems positively resplendent with expressiveness.

FWIW, 2^61 > 26^5, so even the binary vector 2^1000 has an expressive space about 2^939 times larger than 26^5 (all possible words up to 5 letters).

judk12y ago

Yes, but there are exponentially more concepts than words. The words we have are sparse set of labels for particularly relevant combinations.

But yeah, the continuous dimensions can hide many more binary dimensions.

For example, 4-D rgba can be smashed into 1 continuous (or 64-bit) dimension, but that feels a bit like cheating.

So it sort of feels like 1000 64-bit dimensions is a tricky name. 64000 1bit dimensions.

IanCal12y ago

I wouldn't be surprised if you cover most basic english with 1000 concepts. That would give a lot of combinations.

Radim12y ago· 3 in thread

For people interested in a cleaned-up, commented and de-obfuscated word2vec, I recently ported the original C code to Python [1].

[1] https://github.com/piskvorky/gensim/blob/develop/gensim/mode...

[2] http://radimrehurek.com/2013/09/word2vec-in-python-part-two-...

dfrodriguez143OP12y ago

Your submission receives no love but my one afternoon hack does... oh the humanity... lol

That is some amazing work, thanks!

bowyakka12y ago

Its sad you didnt get the love on your submission; you changes are very neat and having word2vec inside gensim feels like a really awesome feature.

rspeer12y ago

Well done!

gojomo12y ago· 2 in thread

Eventually computers will be talking about us behind our backs in these high-dimensional vectors, only occasionally translating down to English approximations, to humor us. "Goo goo, gah gah, human?"

seiji12y ago

Have you read the [Message Contains No Recognizable Symbols] series? It's pretty great: http://www.ssec.wisc.edu/~billh/g/mcnrs.html

gojomo12y ago

Haven't but will check it out, thanks!

3JPLW12y ago

Very cool. I missed the original word2vec software discussion back in August: https://news.ycombinator.com/item?id=6216044

And the paper itelf is a very worthwhile read: http://arxiv.org/abs/1301.3781

dhammack12y ago

->math shopping reading science

I think shopping doesnt belong in this list!

->rain snow sleet sun

I think sun doesnt belong in this list!

etc.

[1] https://github.com/dhammack/Word2VecExample

gojomo12y ago

Cool web demo powered by word2vec, by Christopher Moody:

http://thisplusthat.me/

j / k navigate · click thread line to collapse