Instead it looks like this just performs language detection. Is there a significant advantage to that method as opposed to just reusing one of the many existing open sources solutions based on simpler models such as [1] and retraining them with a corpus that includes the language(s) that weren't supported? You offer a comparative table for FastText & GCP, how do you explain FastText's abysmal performance on English in terms of precision? The value just seems way too low to not be a bug of some sort?
And the vast majority of native content is written in the Roman script - (native script keyboards are poorly designed or unavailable I suppose).
Thus a large chunk of content gets tossed in as English - granted it won't be a high confidence prediction but it still is the produced label.
Corpora in the subcontinent can manifest in 10 - 12 languages - say the Rohingya language for instance - it is near impossible to get an annotator that speaks that language. Getting a monolingual corpus out with zero annotation is quite useful.
The multi-language detection is pretty important too; I have a cousin who posts on FB in a mixture of English, Malay, Marathi and Bengali (the last of which I don't at all understand), mostly in the Latin alphabet. My kid also switches languages mid sentence. I t's pretty common.
The authors knew this, because it compares it in the paper, but doesn't call it out in the post!
Edit: just realised the link on popular "open source" goes to the FastText post I linked below. Still - I think it would have been good to explicitly note this!