Polyglot Word Embeddings Discover Language Clusters (opens in new tab)

(blog.shriphani.com)

27 pointsshriphani6y ago5 comments

5 comments

5 comments · 2 top-level

pattusk6y ago· 2 in thread

I read the title and got excited thinking this would be using embeddings to gather insights about language family. As in, if you ran k-means on the same corpus of n languages with k < n, how would, say, Finnish, Mongolian, Turkish and Japanese turn out in the clusters. Curious too as to whether it would be possible to interpret the results rigorously to gather scientifically valid linguistic conclusions.

Instead it looks like this just performs language detection. Is there a significant advantage to that method as opposed to just reusing one of the many existing open sources solutions based on simpler models such as [1] and retraining them with a corpus that includes the language(s) that weren't supported? You offer a comparative table for FastText & GCP, how do you explain FastText's abysmal performance on English in terms of precision? The value just seems way too low to not be a bug of some sort?

[1] https://code.google.com/archive/p/language-detection/

shriphaniOP6y ago

In the Indian subcontinent, most native content features a lot of code-mixing - so in a Bengali or Hindi document you'll see some English words for sure.

And the vast majority of native content is written in the Roman script - (native script keyboards are poorly designed or unavailable I suppose).

Thus a large chunk of content gets tossed in as English - granted it won't be a high confidence prediction but it still is the produced label.

Corpora in the subcontinent can manifest in 10 - 12 languages - say the Rohingya language for instance - it is near impossible to get an annotator that speaks that language. Getting a monolingual corpus out with zero annotation is quite useful.

gumby6y ago

I misunderstood this the same way but the work is still interesting. I'm also glad to see more South Asian language support as some languages (e.g. Marathi) while spoken by a lot of people aren't well supported by e.g. Google yet.

The multi-language detection is pretty important too; I have a cousin who posts on FB in a mixture of English, Malay, Marathi and Bengali (the last of which I don't at all understand), mostly in the Latin alphabet. My kid also switches languages mid sentence. I t's pretty common.

nl6y ago· 1 in thread

This is nice, but the blog post should point out that FastText has language identification built in[1].

The authors knew this, because it compares it in the paper, but doesn't call it out in the post!

Edit: just realised the link on popular "open source" goes to the FastText post I linked below. Still - I think it would have been good to explicitly note this!

[1] https://fasttext.cc/blog/2017/10/02/blog-post.html

shriphaniOP6y ago

Sorry about that I'll edit the post with an explicit mention right away.

j / k navigate · click thread line to collapse

5 comments

5 comments · 2 top-level

pattusk6y ago· 2 in thread

[1] https://code.google.com/archive/p/language-detection/

shriphaniOP6y ago

In the Indian subcontinent, most native content features a lot of code-mixing - so in a Bengali or Hindi document you'll see some English words for sure.

And the vast majority of native content is written in the Roman script - (native script keyboards are poorly designed or unavailable I suppose).

Thus a large chunk of content gets tossed in as English - granted it won't be a high confidence prediction but it still is the produced label.

gumby6y ago

nl6y ago· 1 in thread

This is nice, but the blog post should point out that FastText has language identification built in[1].

The authors knew this, because it compares it in the paper, but doesn't call it out in the post!

Edit: just realised the link on popular "open source" goes to the FastText post I linked below. Still - I think it would have been good to explicitly note this!

[1] https://fasttext.cc/blog/2017/10/02/blog-post.html

shriphaniOP6y ago

Sorry about that I'll edit the post with an explicit mention right away.

j / k navigate · click thread line to collapse