Facebook's InferSent[1] has worked reasonably well for me for a variety of sentence level tasks, but I don't have anything I can point to to say that it is really substantially better than averaging word embeddings.
More options is good.
(Also, is Kurzweil part of Google Brain or separate. He doesn't really have nay background in NLP does he?)
From Wikipedia: "Raymond "Ray" Kurzweil (/ˈkɜːrzwaɪl/ KURZ-wyl; born February 12, 1948) is an American author, computer scientist, inventor and futurist. Aside from futurism, he is involved in fields such as optical character recognition (OCR), text-to-speech synthesis, speech recognition technology, and electronic keyboard instruments.... Kurzweil was the principal inventor of... the first print-to-speech reading machine for the blind,[3] the first commercial text-to-speech synthesizer,[4]... and the first commercially marketed large-vocabulary speech recognition."
He's been in the general space of NLP for quite a while.
The reason people want better representations is for the applications where they don’t. For example, Bag of words doesn’t capture agreement or disagree well, whereas better representations can.
2. "by Ray Kurzweil's Team", although accurate I find that fetishization of certain stars to pretty insulting to the other authors, we already have a convention and it's "Cer et al. (2018)"
Personally I think the idea of this paper is pretty good, but the evaluation is weak.
Awesome. Now what does all that mean in English?
Well, simply put:
[ccebb 677ce 28f77 86558 2d7cc d67b4 e8f31 8c393 ae867 13593 aa869 3c265],
[c0021 72510 cee7a 31580 554d3 d49a6 306b9 c1f2c 60c1a 1157c f44c8 31273],
[682f2 6a4df dc970 3c106 2107c 3dfd5 1506a 6f1b5 af428 829f8 11d06 797dc],
[d6f84 25e73 76558 6feb0 c67d4 fcc73 b5c8d af4db 2f647 82247 852e7 fc010],
[f08a8 2ed8f c71bb 12043 5f0f9 190c8 f2ae8 7b30a 4a574 269d0 03be0 a363c],
[b38c2 10031 37ada 504a8 f2919 3b82b 258fc 5673f c939c a0ef1 46be5 a50d6],
[93fcd e19f7 0558f e01a6 8beb1 d54b9 9ad20 d6185 adf9b 876a1 a1a94 c9197],
[92b49 ed290 7a072 fdf1d a61a8 65124 a2025 27153 afa71 a27db 29a2a e5b47],
[2793f 7171f b18c9 e1945 d31d5 edb66 a1ee0 d9982 e8442 7795d bd4e4 30b41]But, no. As it turns out, the very first problem you encounter when trying to implement ML on text is that you need to transform the text into some set of numbers (the "vectors"), with the elements in the set matching the number of nodes in your input layer.
This is a tricky thing to do. You're essentially trying to "hash" the text in a way which uniquely represents the text you're working with and also gives the neural net something it can operate on. Which is to say, you can't just use a common hashing algorithm, because the neural net won't be able to learn anything from the random output of the hashing algorithm.
There are several different approaches being used for this. One of them, mentioned elsethread, is "bag-of-words", where you build a big dictionary of word-to-number associations and then do some variety of transformations on that. Another is "feature extraction", where you might try to input a value representing properties like the length of the sentence, the number of words, the vocabulary level of the words, and so on. (This would probably be a bad approach for most ML goals on long text.)
This paper presents another approach.
Singularity any day now
what is tf hub? I assume it stands for tensor flow hub but what is that
Seems to be announcing today at the TF summit this afternoon: https://www.tensorflow.org/dev-summit/schedule/
pip/github links not yet activated: https://pypi.python.org/pypi/tensorflow-hub/0.1.0
If so, can someone explain how this project is related to NLP? Thanks!