No Language Left Behind (opens in new tab)

(ai.facebook.com)

181 pointspesenti3y ago159 comments

159 comments

I'll believe it when I actually see it. I'm a native of a reasonably small language spoken by about a million people and never have I ever seen a good automatic translation for it. The only translations that are good are the ones that have been manually entered, and those that match the structure of the manually entered ones. I think the sentiment is laudable and wish godspeed to the people working on this, but for the time being I don't see it becoming a reality yet. When Google Translate regularly struggles even with big pairs such as German-English-German, I have reservations about someone making it work for languages where datasets are orders of magnitude smaller.

hello_im_angela3y ago

It's an extremely difficult problem indeed. A lot of people on the team speak low-resource languages too (my native language as well!), so definitely resonate with what you're saying. My overall feeling is: yeah it's hard, and after decades we can't even do German translation perfectly. But if we don't work on it, it's not gonna happen. I really hope that people who are excited about technology for more languages can use what we've open sourced.

azinman23y ago

> But if we don't work on it, it's not gonna happen.

That’s exactly right. There’s too much bias in society that if something isn’t perfect, then why bother? Nothing is perfect, so with that attitude there can be no progress. Thank you for doing important work!

Gigachad3y ago

Personally I'm hoping that globalisation prunes out as many languages as possible before we end up with brain implants automatically translating everything for us and no one can communicate without these chips.

4 more replies

yosito3y ago

I speak a medium-resource language with 11 million speakers. Google Translate works so poorly with it that translations are often nonsensical. But DeepL works so well with it that translations are often indistinguishable from native speaking translations. I'm a big believer that the model can make a huge difference.

cyphar3y ago

On the other hand, as a non-native Japanese learner, it is very obvious when Japanese text has been DeepL-translated because it often makes 敬語/register and context mistakes (and translating Japanese to English it does even worse because it struggles with null-subject languages). I am sure a native Japanese speaker would be able to see even more mistakes than I can.

trinix9123y ago

DeepL seems to handle grammar a bit better (ex. run-on sentences) but for whatever reason, it struggles with basic vocabulary sometimes. Also, when it does make mistakes, they change the meaning subtly enough to render the translation unusable.

jhugo3y ago

Google Translate does the same in many languages, to the point that it will often reverse the meaning of a sentence. I honestly feel like these tools are still mostly useful when you don't really need to know what the text means.

alfiedotwtf3y ago

> I ever seen a good automatic translation for it. > > When Google Translate regularly struggles even with big pairs such as German-English-German, I have reservations about someone making it work for languages where datasets are orders of magnitude smaller.

I speak a language where I've never seen any translation for it... and when translated manually, my mum totally butchers the meaning lol.

Either way, any work in this area is more than welcome, but damn it's a hard problem.

bobsmooth3y ago

There's a section where you can try reading translated children's books. See if your language is supported and how good the translation is.

underlines3y ago

Burmese and Cambodian are 100% useless on google/bing translate, but the children books translations on the example page are really, really good.

aeontech3y ago

Surprisingly, translations of the books into Russian seem considerably better than into English (at least for the first three books I tried)

thriftwy3y ago

There's a large tradition of having texts translated into Russian, whereas English speakers would very rarely read anything translated from another language.

1 more reply

pesentiOP3y ago

Blog post: https://ai.facebook.com/blog/nllb-200-high-quality-machine-t...

Paper: https://research.facebook.com/publications/no-language-left-...

Github: https://github.com/facebookresearch/fairseq/tree/nllb/

robocat3y ago

Also note comments from hello_im_angela (= Angela Fan) and jw4ng (= Jeff Wang). Those are the HN accounts for Angela and Jeff from No Languages left Behind.

jkw3y ago

Hey all, I work on this project. Full list of languages can be found here: https://github.com/facebookresearch/flores/tree/main/flores2...

As well as in the research paper: https://research.facebook.com/publications/no-language-left-...

mikewarot3y ago

The analogy I like the most is that they've found the "shape" of languages in high dimensions, and if you rotate the shape for English the right way, you get an unreasonably good fit for the shape of Spanish, again for all the other languages.

We're at a point where it's now possible to determine the shape of every language, provided there are enough speakers of the language left who are both able and willing to help.

<Snark> Once done, Facebook can then commodify their dissent, and sell it back to them in their native language. </Snark>

gfaster3y ago

Anyone who knows or is learning another language can easily tell you that the "warping" methodology of MTL is insufficient. There was a really good video by Tom Scott [1] that talked about this but the short version is that there is critical bits of language in context and inferred by speakers. Any accurate MTL needs nearly full context both on the page and in the cultural moment, in addition to probably needing to ask questions of the author.

[1]: https://www.youtube.com/watch?v=GAgp7nXdkLU

mikewarot3y ago

So, if I had a corpus of all the literature from 1800-1850 digitized, the context would be sufficiently different as to be a new language?

It seems to me that the happy accident of doing this research at the start of getting all human knowledge digitized is part of the unreasonable effectiveness of this overall technique.

Had it happened in 200 years, it might not have worked, right?

adrianN3y ago

Darmok and Jalad at Tanagra.

goldemerald3y ago

The shape analogy doesn't really apply with modern language models. Each word gets its own context dependent high dimensional point. With everything being context dependent, simple transformations like rotations are impossible. A more accurate perception is that any concept expressible in language now has its own high dimensional representation, which can then be decoded into any other language.

Groxx3y ago

>REAL-WORLD APPLICATION

>Translating Wikipedia for everyone

Hmmm.

While there is very definitely utility in doing things like this, I do kinda fear "poisoning the well"-like effects of feeding (even partially-) AI-generated-data into extremely common AI-data-sources.

There's some info on it in a blog post[1] and the MediaWiki "Content translation" page[2], but does anyone know of any studies on the quality of the translations produced? I can absolutely see it being a huge time-saver for people who are essentially fluent in both (there's a lot of semi-mechanical drudgery in translating stuff like this that could be mostly eliminated)... but people are pretty darn good at choosing the easy option of trusting whatever they're given rather than being as careful as they should be. It kinda feels like it runs the risk of passively encouraging people to trust the machine's choice over their own, as long as it isn't obviously nonsense, and the cumulative effect could be rather large after a while.

[1]: https://diff.wikimedia.org/2021/11/16/content-translation-to...

[2]: https://www.mediawiki.org/wiki/Content_translation

jhugo3y ago

Yeah, I really hope they don't do this. I live in a country where I don't speak the language well, so I am using Google Translate and DeepL [0] all day every day. The quality of translations of real-world text is so incredibly variable. There is literally no way to know when it will suddenly reverse the meaning of a sentence, or produce something that sounds like it makes sense, but in terms of meaning bears no relation to the input at all.

A machine-translated Wikipedia would not be a trustworthy source of information at all, yet would look like one. I think that does significantly more harm than good.

[0] Suggestions for better alternatives welcomed.

debesyla3y ago

On top of that - a lot of language specific content has to include sources in that same language.

(As an example, it would be absurd for lithuanian wikipedia to include sources in japanese - that would be not usable AND not usefull for the wikipedia readers, editors...)

jw4ng3y ago

Jeff Wang here with my fellow Meta AI colleague Angela Fan from No Languages left Behind, seeing the comments flowing through. If you want to ask us anything, go for it!

david_allison3y ago

Hi Jeff,

I currently host the largest collection of bilingual Manx[0] <-> English texts (~1MM words). How would I formally get in contact to chat about the steps to make machine translation available (and would there be grant opportunities available for further production of machine-readable data?)

[0] https://en.wikipedia.org/wiki/Manx_language

hello_im_angela3y ago

could you send me an email please? It's available on our paper, page 1: https://research.facebook.com/publications/no-language-left-...

Regarding grants: we have offered compute grants previously with the Workshop on Machine Translation (last year: https://www.statmt.org/wmt21/flores-compute-grants.html, this year: https://statmt.org/wmt22/large-scale-multilingual-translatio...) and we have an RFP, but it's currently focused on African languages: https://ai.facebook.com/research/request-for-proposals/trans...

david_allison3y ago

Done, let me know if it didn't go through

tkgally3y ago

Thank you for your exciting work and for coming onto HN to respond to questions.

I am a former professional translator (Japanese to English) and am now supervising research at the University of Tokyo on the use of machine translation in second-language education. As I have written in a few papers and essays [1], advances in MT have raised serious questions for language teachers. The ready availability of MT today, including on Facebook and Instagram, means that language students use it a lot while studying. We don’t know yet, though, how that use of MT might affect our students’ acquisition of other languages or their motivation to keep studying those languages.

One of the hurdles educators and researchers face is finding out how MT is being used in the real world. Most education in modern languages is focused on giving students language skills that they will be able to use later in work, education, and daily life, and textbooks and other learning materials are typically shaped around real-world situations. We are now struggling to adapt those materials for the age of MT, because data on the actual use of MT is very hard to get.

Like Google, Microsoft, Baidu, DeepL, and others, Meta must have huge amounts of data on how your users are using MT to communicate. Any information and insights about that MT usage that you can share with world—just as you have generously shared your NLLB models—would be most welcome.

[1] http://gally.net/writings.html

yosito3y ago

I've learned two languages with the help of MT. I'm sure you've interviewed people like me, but I get excited about the potential of MT for language learning, so I'd like to share my thoughts.

When I learned Spanish, I spent a lot of time chatting on Facebook with native speakers, and using Google Translate as "training wheels" to help me formulate sentences, and understand words and phrases I hadn't learned yet. It worked pretty well at the time (2012) except in cases of slang and typos that Google couldn't handle. I also used it a lot to help me translate blog posts from English to Spanish. Eventually, I graduated from the training wheels and was able to use Spanish fluently without the help of MT. More than once, while not using MT, I was told that I spoke Spanish with a "Google Translate accent", which I'm sure was more of a reference to my grammar than my accent, since my spoken practice was 100% with native speakers.

When I learned Hungarian (2019-now), at the beginning, Google Translate wasn't good enough to use for much more than getting a rough understanding of formal text, so I learned in a more traditional way at a school and with native speakers. Then the pandemic prevented me from doing both of those. I started chatting with native speakers on Facebook, but it was very difficult without MT and involved a lot of asking my conversation partners for translations and explanations. Progress was frustratingly slow. Then I discovered DeepL's MT, which was extremely good with Hungarian. I started using for chat conversations and emails, and people were shocked that I was managing to communicate with them so fluently. My progress of actually learning the language for myself accelerated dramatically. I've become conversational (B2/C1) in Hungarian in 2.5 years with very little in-person practice. Often, it takes native English speakers 5 years of in-person practice to reach that level. I'm convinced that MT played a key role in my ability to learn quickly.

When I use MT, I have a simple rule, that I have to understand each word of a translation before I send it. So I carefully read the translation, making sure that I understand each word. Sometimes that means I have to look up individual words/grammar before sending a message (I often use wiktionary for that, because it shows etymology), and other times, it means that I'll replace unfamiliar words or phrases in a translation with words and phrases from my own vocabulary. Over time I rely on MT less and less because my own vocabulary becomes stronger. I really believe that they key to learning a language quickly is to start USING the language as quickly as possible. Once you're using a language, your brain automatically starts picking up the skill. With traditional language learning, using a language can be very difficult in the beginning until you've reached a conversational level, but with MT, you can start using a language before you know everything.

For Spanish, I almost never use MT anymore. Sometimes I use it as a quick dictionary for an unfamiliar word, but my Spanish level is C2 and I use Spanish every day so it feels natural. I'm not ever translating in my head anymore.

For Hungarian, I'm still using MT often, but I don't need it during conversation (either written or spoken). Besides using it to translate things I don't know, I also find it useful for inputting Hungarian characters that are a pain to type with my US keyboard, and for conjugating words correctly when I know the root but am struggling for the correct ending. Often I'll know what I want to say in Hungarian, but I'll open DeepL and type in English, then adjust the translation to use the words I want before I copy and paste the Hungarian. I'm essentially using MT as a guide to help me craft my sentences even when I know what I want to say.

In summary, MT is awesome for language learning and for assisting language skills in development.

tkgally3y ago

Thank you! Those are really interesting and valuable comments. I haven’t, in fact, heard many stories like yours, especially with such clear insights about how you have been able to use MT constructively in your language learning.

Most of the discussions I’ve had about MT have focused on language learning in school contexts. In Japan and most other countries (though often not in English-speaking countries), all children have to study at least one foreign language in school. As with all compulsory education, low motivation and poor study skills are a constant problem. In such contexts, MT seems to many teachers and students just to be a way to cheat on classwork. And since very few of today’s veteran teachers were able to use MT when we were young and studying languages, we don’t understand how we can guide even our motivated students on using it productively. I will be sure to share your insights with my colleagues.

A couple of comments:

> More than once, while not using MT, I was told that I spoke Spanish with a "Google Translate accent", which I'm sure was more of a reference to my grammar than my accent....

This can happen with traditional methods of language learning, too. The language of textbooks, like the output of MT, usually reflects the standard written language, which can be very different from how people actually speak, especially in the case of languages with large dialect and register variations.

> When I use MT, I have a simple rule, that I have to understand each word of a translation before I send it.

That sounds like an excellent rule. I will pass on that advice to educators I know who are trying to figure out how to guide their students on the use of MT.

Many thanks again.

kouteiheika3y ago

If your goal is to make inclusive translation more widely available why license the models under a non-commercial license? This basically makes it impossible to use legally (or at least without a lot of legal risk) for essentially anyone due to the vague definition of what's commercial. Is Facebook hurting for money and looking to commercially license this model on request?

jw4ng3y ago

This enables any researcher to use our code freely, and build on top of it for their own research. We are not intending to commercially licensing our project.

kouteiheika3y ago

Okay, but why?

If your aim is to make this technology more widely available and, as you claim, "give people the opportunity to access and share web content in their native language, and communicate with anyone, anywhere, regardless of their language preferences", then why make it so that the model essentially can't be used for anything useful? It doesn't really make any sense.

Legally even the use case which you're promoting on your frontpage - the Wikipedia Foundation's Content Translation - is illegal under the non-commercial license in certain jurisdictions! For example, see here: https://www.techdirt.com/2014/03/27/german-court-says-creati...

Even using it for research would be illegal as it's also not exactly "personal use".

shuraih3y ago

Hey Jeff, I’m a native speaker of Dhivehi — the language spoken by the people of Maldives. Since I couldn’t find a full list of supported languages I was wondering if Dhivehi is / would be integrated.

jkw3y ago

Dhivehi is currently not supported, unfortunately. We view this as a starting point and are committed to expanding to many other languages as in the spirit of our project name.

Full list of currently supported languages can be found here: https://github.com/facebookresearch/flores/tree/main/flores2...

yosito3y ago

I'm curious how much work it takes to prepare training data for a language. From anecdotal experience, I've always been able to learn some basic survival skills in a new language by studying the translations of about 20 key phrases for a week or so, which give me the ability to combine them into a few hundred different phrases and survive most daily transactions. So I always imagine that training a language model is similar, just on a much larger scale. It seemed to me that there could be a standard text that includes a lot of important topics and contexts, which just needs to be manually translated into a target language and then fed to the model. I imagine it being about the size of a large book, so I imagine that adding a new language to a model would cost a similar amount to paying to have a book translated. Obviously the size of the input text would have an effect on how good the model's translations are, and domain specific translations would require more specific input. While having a full translation of an entire library seems like a good way to train a model that's used to translate everything, it seems like a small percentage of the library would be enough to produce native-level translations for most domains.

How far off are my intuitions on this? What are the costs of adding a new language to a model like this? Is there a ballpark dollar amount per language?

jw4ng3y ago

Without any supervised training data, it's pretty difficult to create a very good translation model. For many languages, data might only be available through religious domains such as the Bible, or not available at all. We created a dataset called NLLB-Seed for this reason --- it's approx 6K sentences available for 39 languages that translates a broad set of topics from Wikipedia. We found that with a dataset like NLLB-Seed, we're able to have sufficient supervised signal to jumpstart our automatic dataset creation pipeline. Of course, the more high quality aligned data the better the model performance, but our project explores how we can make models more efficient at learning even when the training data is small.

Importantly, models can learn from other languages are similar. If we train separate models for each direction on small amounts of data, the performance is significantly worse than grouping languages in one large multilingual model.

concinds3y ago

These initiatives are always couched in "inclusion" rhetoric (the very name of your project is telling); I don't doubt for a second that it's a genuine sentiment, but I strongly suspect your team hasn't thought through the full, self-defeating implications of universal language translation.

The problem is that it increases the risk of monoculture to 100%. Without language barriers, cultural diversity is lost, not gained, since you have winner-take-all effects[0]. Instead of helping revive languages, it'll make American ideas, mores, morality (Puritanism), philosophies, and political values more dominant worldwide.

To be clear, this will increase economic opportunity, but will inevitably kill cultural diversity.

Is your team considering or studying this?

[0]: https://www.sampaxinos.com/mental-models/the-winner-takes-al... (or see Taleb's works)

pagekicker3y ago

Hi, I'm putting together an online event called 31 Days of AI for Book-Lovers to coincide with US National Book Month, October 2022. I was struck by the specific call-out to translating literature on your demo page and would like to feature a specifically book-related application of NLLB on one of 'anchor days'. Can someone work with me on this?

Vetch3y ago

Hi, I'm looking but can't seem to find instructions on how to do tokenization. Where is spm model, is it "flores200_sacrebleu_tokenizer_spm.model" or something else? And is it direct or spm -> dict? Or how to prime model for a specific language pair?

hello_im_angela3y ago

We tokenize with the flores-200 spm model, correct. To generate from the model, check out the instructions here: https://github.com/facebookresearch/fairseq/tree/nllb/exampl...

pesentiOP3y ago

Are all the 200x200 translations going directly or is English (or another language) used as an intermediate for some of them?

jw4ng3y ago

All translation directions are direct from language X to language Y, with no intermediary. We evaluate the quality through 40,602 different translation directions using FLORES-200. 2,440 directions contain supervised training data created through our data effort, and the remaining 38,162 are zero-shot.

dangom3y ago

What is the greatest insight you gained and could share with non-experts from working on this project?

jw4ng3y ago

I gained a deeper understanding of what it truly means to be inclusive. Every language is unique just like everybody and making sure content works for all and including as many people as possible is really really hard, but through this project i'm hopeful we are taking it one step further

Jabbles3y ago

> Every language is unique just like everybody

TBH it just sounds like you've redefined the word "unique".

jefflombardjr3y ago

Gangi þér vel!

kgeist3y ago

I wonder how it differs from what Yandex.Translate did back in 2016: [0]

>The affinity of languages allows one common model to be trained for their translation. That is, “under the hood” of the translator, the same neural network translates into Russian from Yakut, Tatar, Chuvash and other Turkic languages. This approach is called many-to-one, that is, "from many languages \u200b\u200binto one." This is a more versatile tool than the classic bilingual neural network. And most importantly, it is the many-to-one approach that makes it possible to use knowledge about the structure and vocabulary of the Turkic languages, learned on the rich material of Turkish or Tatar, to translate languages like Chuvash or Yakut, which are less “resource-rich”, but no less important for the cultural diversity of the planet.

>In order to create a unified model for translating Turkic languages, Yandex developed a synthetic common script. Any Turkic language is translated into it, so that, for example, the Tatar “dүrt” (“four”) written in Cyrillic becomes similar to the Turkish dört (“four”), not only from the point of view of a person, but also at the level of similarity of lines for a computer.

This way they added support for Turkic and Uralic languages which are very underrepresented on the Internet. But I don't know what the quality of their translation is: even though I live in a region where Mari is spoken (indigenous Uralic language) and my wife is Mari, none of us, sadly, speak the language.

[0] https://techno-yandex-ru.translate.goog/machine-translation/...

hello_im_angela3y ago

We represent all languages in their natural script, rather than transliterating them into a common synthetic one.

Regarding Mari: extremely interesting language, exciting to hear that you are from that region. We are interested in working on this one (likely in the "Hill Mari" variant), but currently do not support it.

microtherion3y ago

As a native Swiss German speaker, my native language is not only low resource in general, but has the additional difficulty of not having a standardized orthography (many native speakers will exclusively write in Standard German, and use Swiss German only for spoken communication).

So you have a language with some economic opportunity (a few million speakers in a fairly wealthy country) but no clearly defined written interface, and an ambivalent attitude of many speakers towards the very idea of writing the language.

hello_im_angela3y ago

sooo real. Many low-resource languages have many different natural variants, can be written in multiple scripts, don't have as much written standardization, or are mainly oral. As part of the creation of our benchmark, FLORES-200, we tried to support languages in multiple scripts (if they are naturally written like that) and explored translating regional variants (such as Moroccan Arabic, not just Arabic).

As an aside, the question of how to think about language standardization is really complex. We wrote some thoughts in Appendix A of our paper: https://research.facebook.com/publications/no-language-left-...

visarga3y ago

Another avenue for machine translation is to use audio instead of text. There is much more audio data available and being generated on a daily basis, especially for cases like yours it would be very useful.

tsm3y ago

Similar issue with Scots, which has many variant orthographies but is frequently written in mostly-English anyway.

rmbyrro3y ago

This only makes the problem behind the NLLB project even more interesting to solve

otreblatercero3y ago

Not a single mesoamerican language is present. Maya, Náhuatl, Otomí, Zapoteco, etc. And these languages are big, they are spoken by millions and even have literature. Náhuatl and Maya are spoken in Central America.

bertil3y ago

Are there online corpora, like Wikipedia, that could be used to train the models? Are those under a permissive enough license to be used for model training?

If there are spoken, with enough budget, a library of voices could be recorded. I think you’d prefer that collection to be gathered and maintained by a non-profit rather than Meta.

otreblatercero3y ago

For náhuatl, I found this: Wikipedia in nahuatl https://nah.wikipedia.org/wiki/Cal%C4%ABxatl

bertil3y ago

I’m wondering if 7065 articles is enough to train the model.

albertzeyer3y ago

Note that very recently Google has done something very similar: "Building Machine Translation Systems for the Next Thousand Languages": https://arxiv.org/abs/2205.03983 https://ai.googleblog.com/2022/05/24-new-languages-google-tr...

The Facebook paper has some direct comparison to that work.

jkw3y ago

Evaluation was important to us, and we really wanted to have a benchmark that covers all 200 languages

yellowapple3y ago

Hopefully the Scots language model wasn't trained on Wikipedia.

btheshoe3y ago

I'm not entirely sure why low resource languages are seen as such a high priority for AI research. It seems that by definition there's little payoff to solving translation for them.

albertzeyer3y ago

I don't really remember the exact numbers anymore, but covering only the top 5 languages will cover maybe 40% of the world population, while covering the top 200 languages (many of them low resource) will cover maybe 90% of the world population.

Some numbers (but you can not exactly infer from them such accumulated numbers): https://en.wikipedia.org/wiki/List_of_languages_by_total_num...

Some more numbers from here: https://www.sciencedirect.com/science/article/pii/S016763931...

"96% of the world’s languages are spoken by only 4% of its people."

Although this statement is more about the tail from the approx 7000 languages.

jefftk3y ago

It doesn't sound like you're considering that people are very often fluent in a major language in addition to their regional one?

albertzeyer3y ago

I am. That's why I mentioned that you can not infer my statements directly from the numbers you find on Wikipedia etc. You can not simply add up those numbers.

goodside3y ago

"Low-resource language" isn't just a euphemism for "language almost nobody speaks". There are many languages that are widely spoken but nonetheless are hard to obtain training data for. Getting something like Wikipedia going for a minority language can be a difficult chicken-and-egg problem because users will use English for its completeness/recency, despite their limited fluency, and the native-language Wikipedia remains neglected. So you can end up in a situation where users use one language for social media and another for news/research, and Facebook is in a unique position to care about the former.

cyphar3y ago

Aside from the fact that being able to generalise a model with very little training data is an important AI research problem to solve, language death is a serious concern and is being accelerated due to the fact that many languages are not supported at all by modern technology (leading to "prestige language" pressures that are a known cause of historical language death).

For instance, Icelandic is not supported by any modern smartphone platform which has lead to Icelandic natives communicating with each other in English and very little information is translated to Icelandic[1,2].

That being said, I am worried that having translations that are "too good" could also act to accelerate language death as the importance of keeping languages alive will seem less significant (to non-language-nerds) if we can translate works written in that language to any other language with very small datasets. Luckily I'm not convinced that AI models will be able to produce convincing and consistent translations for a long time -- languages are so different in so many ways that I can't see how adding more dimensions and parameters to a model would account for them.

[1]: https://youtu.be/qYlmFfsyLMo?t=141 [2]: https://www.nytimes.com/2017/04/22/world/europe/iceland-icel...

wilde3y ago

The point is that there are lots of humans who speak these languages and use tech. They just don’t use Wikipedia so getting a good translation corpus going was harder.

gwern3y ago

And it's both cumulative across all those languages (see above), cheap/amortized (if you can do a good multilingual NMT for 50 languages, how hard can 50+1 languages be?), and many of those languages are likely to grow both in terms of sheer population and in GDP. (Think about South Asian or African countries like Indonesia or Nigeria.) The question isn't why are FB & Google investing so much in powerful multilingual models which handle hundreds of languages, but why aren't other entities as well?

ausbah3y ago

what other entities would really have access to the text resources that FB & Google? outside of a few other large companies I can't imagine many

Jabbles3y ago

Surely the fact that they did all the high-resource languages first and are only now getting round to the less-popular ones demonstrates that that is not, in fact, the case?

tehsauce3y ago

I think the reason low resource languages are prioritized is to compensate for the fact that AI research normally has a tendency to marginalize these languages.

btheshoe3y ago

yes, but what principles justify the importance placed on low resource languages?

froskur3y ago

Low resource in this context means that there are few resources available to train a neural network with, not that there are few speakers. Although many low resource languages have relatively few speakers, there are also ones with tens of millions of speakers.

The reason for emphasis is in my opinion twofold: 1) Allowing these people to use the fancy language technology in their own language is good in and of itself. 2) Training neural networks on fewer resources is more difficult than using more resources and therefore a fun and interesting challenge.

1 more reply

quink3y ago

The examples given are, with native speaker numbers, Assamese (15 million), Catalan (4 million) and Kinyarwanda (10 million). These alone are more than an Australia.

Furthermore, Facebook considers the internet to consist of Facebook and Wikipedia (Zero).

I view this as just another extension of their Next Billion initiative, an effort to ensure that another billion people are monopolised by Facebook.

That's the payoff.

jw4ng3y ago

We think it's important for AI to truly support everyone in the world. A world where AI only serves a subset of the population is not ideal. In machine translation, this means supporting as many language as possible at high quality. We also imagine a future where anyone will be able to communicate with anyone else seamlessly; this also means solving translations for all languages.

daniel-cussen3y ago

Wouldn't that also entail a bot speaking in any language?

bobsmooth3y ago

Text to speech is a separate problem.

dunefox3y ago

Small data, big meaning is much more important than big data, little meaning. Much closer to real intelligence.

munificent3y ago

Cynical answer: It's good PR.

onurcel3y ago

hi @btheshoe, I work on this project in the data part. As others mentioned, the amount of data available for a language is not correlated to the number of speakers of that language, which explains the potential impact of focusing on these.

labrador3y ago

I'll know AI translators are any good when the United Nations starts using them

"Skills required: United Nations translators are required to have a perfect command of their main language and an excellent knowledge of, in most cases, two other official languages"

https://www.un.org/dgacm/en/content/translation

epolanski3y ago

My ex is a translator at an embassy, and she always said that ai translators are a godsend.

On one side they make their work easier as they can focus more on correcting the ai produced text and focus on author's meaning while eliminating lots of plumbing.

On the other hand they increased the amount of business because much more text is translated than at any other point in history, which requires validation in most business, legal and even personal contexts. Without ai translators those translations would've not happened in the first place.

astrange3y ago

Most media translators consider MTL worse than nothing, because editing it is actually harder than just doing it yourself. Can especially be an issue for neural MTL because the output is both fluent (looks natural) and inaccurate.

labrador3y ago

I'm surprised this didn't occur to me until after I posted because it fits with my general feeling that AIs will be nothing more than collaborative tools for the foreseeable future.

samatman3y ago

An organization built out of pure prestige, with no concept of monetary profit, has zero pressure to stop employing their classmates as translators, ever.

1 more reply

thamer3y ago

Does this mean that Facebook's advertising system will finally start rejecting ads calling for genocide in Myanmar, and that they will finally flag comments expressing the same intent? As recently as March of this year there were reports that Facebook accepted ads that said "The current killing of the Kalar is not enough, we need to kill more!" or "They are very dirty. The Bengali/Rohingya women have a very low standard of living and poor hygiene. They are not attractive".

Full story: https://abcnews.go.com/Business/wireStory/kill-facebook-fail...

These were submitted to test Facebook's systems, because there's a good reason not to trust their promises on this front. Facebook was used extensively to propagate hate speech in Myanmar during the crisis of 2017, with their moderation tools and hate speech detection system letting through a ton of hateful content with real-world consequences, in the course of an actual ethnic cleansing campaign.

Other references: "Facebook Admits It Was Used to Incite Violence in Myanmar" https://www.nytimes.com/2018/11/06/technology/myanmar-facebo... (2018)

"Violent hate speech continues to thrive on Facebook in Myanmar, AP report finds" https://www.cbsnews.com/news/myanmar-facebook-violent-hate-s... (9 months ago)

bertil3y ago

The issue here wasn’t that Facebook didn’t have resources for a basic translation tool (able to translate open death threats) but that Burmese had inconsistent encoding. That delayed the translation effort.

https://www.localizationlab.org/blog/2019/3/25/burmese-font-...

vjerancrnjak3y ago

What are hardware requirements to run this?

I see the mixture model is ~ 300 GB and was trained on 256 GPUs.

I assume distilled versions can easily be run on one GPU.

hello_im_angela3y ago

We release several smaller models as well: https://github.com/facebookresearch/fairseq/tree/nllb/exampl... that are 1.3B and 615M parameters. These are usable on smaller GPUs. To create these smaller models but retain good performance, we use knowledge distillation. If you're curious to learn more, we describe the process and results in Section 8.6 of our paper: https://research.facebook.com/publications/no-language-left-...

mdda3y ago

"All models are licensed under CC-BY-NC 4.0" :

So, to clarify, does this mean that companies cannot use these models in the course of business, or is it more about selling the translation results directly?

kwhitefoot3y ago

What is a "low resource language"?

jw4ng3y ago

hey there, I work on this project. We categorize a language as low-resource if there are fewer than 1M publicly available, de-duplicated bitext samples.

also see section 3, table 1 in the paper: https://research.facebook.com/publications/no-language-left-...

maestrae3y ago

hey, this sounds silly but I can't seem to find a link of all the languages covered in the 200 hundred languages. I've looked at the website and the blogpost and neither have a readily available link. Seems like a major oversight. There is of course a drop down in both but the languages there are a lot less than 200. I'm particularly interested in a list of the 55 African languages for example.

hello_im_angela3y ago

We have a full list here (copy pastable): https://github.com/facebookresearch/flores/tree/main/flores2... and Table 1 of our paper (https://research.facebook.com/publications/no-language-left-...) has a complete list as well.

2 more replies

protomyth3y ago

Looking at the list, I see a lack of Native American languages. Did anyone try to contact the tribes during this?

hello_im_angela3y ago

We interviewed speakers of low-resource languages from all over the world to understand the human need for this kind of technology --- what do people actually want, how would they use it, and what's the quality they would find useful? Many low-resource languages lack data online, but are spoken by millions. However, many indigenous languages are spoken by smaller numbers of people, and we are definitely interested in partnering with local communities to co-develop technology and have been actively investigating these collaborations but don't have much to share yet.

1 more reply

pesentiOP3y ago

https://datascience.stackexchange.com/questions/62868/high-l...

Tabular-Iceberg3y ago

My concern with this is that in low resource languages the unavoidable biases of the ML models might overpower their own organic development.

We shrug off all the little quirks of machine translated text because it usually gets the point across, and we recognize them as quirks because most of what we read was written by real people with no such quirks. But when most of what you read contain those quirks, I fear those will quickly become the standard way of writing and even speaking in those languages.

madrox3y ago

This happens without machine translation in the wild already with pidgin. If you want to see real life pidgin in action, watch korean and english gamers interact in FPS games. This has been common at the borders of cultures where two languages interact.

Point being, I'm not sure if language purity is more valuable than functionally allowing its people to interact with things they couldn't otherwise. Put another way, should we leave these people locked out of many online resources they can't read because we fear of corrupting their language? Give these people the option and let them decide. Language evolves over time anyway.

makeitdouble3y ago

People present these as the choice between 0 (“locked out”) and 1.

In real world instances (the proverbial 80%), it’s more often transforming a 0.4 (“don’t know much english”) into a 0.7. And the people who get away with near 0 knowledge will usually have no critical need for translation, or an access to other means (an actual translator, social help etc.) when really needed.

My mental image is grandmas reading online news, and machine translation would be a blessing and a curse. Or low grade school kids trying to look for some help on a topic, and a I’d wish they get more time with the original text to at least somewhat learn, than only getting the rough translation full of errors.

For interpersonal communication, people adjust, that’s what has been happening for centuries now.

Tabular-Iceberg3y ago

> This happens without machine translation in the wild already with pidgin.

I said nothing about purity, I said organic evolution, which this is an example of. If the actual speakers want to develop a pidgin, fine, I just think it should be a decision made by people and not models.

texaslonghorn53y ago

In a worst case you can end up with the Scots Wikipedia situation, where some power editor created a bunch of pages using an entirely fabricated, overly stereotypical language and that influenced what people thought Scots actually was.

onurcel3y ago

This is one of the examples we keep in mind and that's also why we can't 100% trust public dataset labels. This motivated us to train a Language IDentification system for all the languages we wanted to handle in order to build the monolingual dataset. More details in the paper ;) Or here, if you have questions

protomyth3y ago

I think it will interesting when it runs into a language (e.g. Dakota) where the women and men speak differently. Should be an interesting test.

zen_13y ago

Doesn't seem to be a big issue for Arabic, where verbs are gendered (so in the sentence "I am going to the store", the verb "to go" will be either masculine or feminine, reflecting the speaker's gender).

thaumasiotes3y ago

> so in the sentence "I am going to the store", the verb "to go" will be either masculine or feminine, reflecting the speaker's gender

But there the rules are the same for everyone. This is not true in general; there are languages where men and women speak according to different rules.

Here's a selection from Empires of the Word:

> These works [written by women] are usually written in Emesal, 'the fine tongue', a separate dialect of Sumerian, well documented in scribal dictionaries. In dialogue works this dialect is used for the speech of goddesses. It differs from standard Sumerian, Emegir, 'the princely tongue', both in vocabulary (including the names of many gods) and also in pronunciation (consonants by and large being articulated farther forward in the mouth); it differs not at all in its grammar. For example, when the goddess Inanna is affecting to repel the advances of an importunate suitor, she cries:

> kuli Mulila šu bamu emeše daŋen amaŋu lulaše ta munaben amaŋu Gašangale lulaše ta munaben

> Friend of Enlil, let me free! Let me go to my house! What lie shall I tell my mother? What lie shall I tell my mother Ningal?

> Both Enlil and Ningal are, of course, gods. In Emegir this would have been (with the differences highlighted):

> kuli Enlila šu bamu eŋuše gaŋen amaŋu lulaše ana munaben amaŋu Ningale lulaše ana munaben

nemothekid3y ago

Arabic is the 5th or 6th most spoken language. I think the concern for low resource languages is that nuances like that won't get picked up.

1 more reply

protomyth3y ago

That's not what I meant. It isn't the words that are gendered, but the way the speaker talks that is gendered. My old boss was taught to speak by her uncles. Her female relatives teased her since she talked like a man.

HKH23y ago

Won't people trying to learn a low resource language as as a second language also bring their influence?

TaupeRanger3y ago

So they have a system that can translate to languages for which there isn't as much data as English, Spanish, etc. Waiting for a Twitter thread from a native speaker of one of these "low resource languages" to let us know how good the actual translations are. Cynically, I'd venture that they hired some native speakers to cherry pick their best translations for the story books. But mostly this just seems like a nice bit of PR (calling it a "breakthrough", etc.). I can't imagine this is going to help anyone who actually speaks a random, e.g., Nilo-Saharan language.

hello_im_angela3y ago

If you're curious to try the system yourself, it's actually being used to help Wikipedia editors write articles for low-resource language Wikipedias: https://twitter.com/Wikimedia/status/1544699850960281601

netol3y ago

How is the license of the models (CC NC) compatible with licenses used in Wikipedia? Did you sign an special agreement with the Wikimedia Foundation?

alexott3y ago

Twitter may not be representative imho because of the short text. It should first come to a problem of reliable language detection, and Twitter is quite often wrong there

onurcel3y ago

in this work we tried to rely not only on automated evaluation scores but also on human evaluation for exactly this reason: we wanted to have a better understanding of how our model actually performs and how it correlates to automated scores.

account423y ago

> Essential cookies

> These cookies are required to use Meta Products. They’re necessary for these sites to work as intended.

What cookies does Facebook "need" to serve a simple article?

LtWorf3y ago

Facebook translations are horrifying for the mainstream languages already. They go from completely wrong to kinda understandable but still wrong.

rmbyrro3y ago

Looks like they're investing to get better. The model is also available and they called for contributions to improve it.

LtWorf3y ago

Why would I help them? If it was public data sure.

ShamelessC3y ago

Look, I fucking hate Facebook to the point that I can't really be objective about their research. Whenever I see a section on ethical implications or impacts I just think about shit like Myanmar or the insurrection and laugh (cry).

But this is a shallow dismissal that doesn't add anything valuable to the discussion.

"Oh they made their _terrible_ (probably state of the art) machine translation _better_??! Those monsters!!"

NoInkling3y ago

I know DeepL doesn't do low-resource languages, but it would be interesting to see a translation quality comparison between the two.

enos_feedler3y ago

I was two sentences in before I realized the headline wasn’t “No Luggage Left Behind”

onurcel3y ago

this is actually our recurring joke for our team meeting offsites!

schoen3y ago

I wonder if spy agencies have already developed, but not published, high-quality SMT methods for lots of minority and little-known languages. :-(

(Edit: and speech-to-text models.)

_nalply3y ago

"No Language Left Behind" - really?

Did the people at Meta think about the Signed Languages of the Deaf?

I didn't find a mention. Even Ctrl-F deaf didn't yield anything.

langsoul-com3y ago

So so many words but not a hint of any demo. It's just magic according to Facebook. Plz couldn't they at least have a crappy demo to break?

pdonis3y ago

tl/dr: Now your words can be misconstrued by far more people than before, because AIs will translate the misunderstandings into as many languages as possible.

zzzeek3y ago

So glad it's Facebook doing this and not some other weird company, when translating and delivering information to every culture on the planet it's good to have a trustworthy, ethical company without any past (or heck, even any current, ongoing) issues in spreading misinformation around the globe and contributing to the rise of fascism across the world while profiting massively off of it and denying any culpability, making sure it all goes smoothly.

bvanderveen3y ago

Great! Facebook no longer have to provide content moderation in all the various corners of the world where they could accidentally enable the dissemination of misinformation and hate speech in minority languages. They can simply transform it into English and run it back through the existing moderation tooling!

Understanding foreign culture is about reading automated translations of online comments into your native language. It has nothing to do with putting the effort into learning a language and understanding the nuances and current events and issues of the culture it embeds.

The ESL (English as a single language) speakers over at Facebook don't even need to understand foreign cultures, because they already know everyone in the world needs to spend their lives staring into the Metaverse. So grateful that they are working on the world's fattest pipeline for exporting Anglophone culture to every corner of the planet!

j / k navigate · click thread line to collapse

159 comments

Etheryte3y ago

hello_im_angela3y ago

azinman23y ago

> But if we don't work on it, it's not gonna happen.

Gigachad3y ago

4 more replies

yosito3y ago

cyphar3y ago

trinix9123y ago

jhugo3y ago

alfiedotwtf3y ago

I speak a language where I've never seen any translation for it... and when translated manually, my mum totally butchers the meaning lol.

Either way, any work in this area is more than welcome, but damn it's a hard problem.

bobsmooth3y ago

There's a section where you can try reading translated children's books. See if your language is supported and how good the translation is.

underlines3y ago

Burmese and Cambodian are 100% useless on google/bing translate, but the children books translations on the example page are really, really good.

aeontech3y ago

Surprisingly, translations of the books into Russian seem considerably better than into English (at least for the first three books I tried)

thriftwy3y ago

There's a large tradition of having texts translated into Russian, whereas English speakers would very rarely read anything translated from another language.

1 more reply

pesentiOP3y ago

Blog post: https://ai.facebook.com/blog/nllb-200-high-quality-machine-t...

Paper: https://research.facebook.com/publications/no-language-left-...

Github: https://github.com/facebookresearch/fairseq/tree/nllb/

robocat3y ago

Also note comments from hello_im_angela (= Angela Fan) and jw4ng (= Jeff Wang). Those are the HN accounts for Angela and Jeff from No Languages left Behind.

jkw3y ago

Hey all, I work on this project. Full list of languages can be found here: https://github.com/facebookresearch/flores/tree/main/flores2...

As well as in the research paper: https://research.facebook.com/publications/no-language-left-...

mikewarot3y ago

We're at a point where it's now possible to determine the shape of every language, provided there are enough speakers of the language left who are both able and willing to help.

<Snark> Once done, Facebook can then commodify their dissent, and sell it back to them in their native language. </Snark>

gfaster3y ago

[1]: https://www.youtube.com/watch?v=GAgp7nXdkLU

mikewarot3y ago

So, if I had a corpus of all the literature from 1800-1850 digitized, the context would be sufficiently different as to be a new language?

It seems to me that the happy accident of doing this research at the start of getting all human knowledge digitized is part of the unreasonable effectiveness of this overall technique.

Had it happened in 200 years, it might not have worked, right?

adrianN3y ago

Darmok and Jalad at Tanagra.

goldemerald3y ago

Groxx3y ago

>REAL-WORLD APPLICATION

>Translating Wikipedia for everyone

Hmmm.

[1]: https://diff.wikimedia.org/2021/11/16/content-translation-to...

[2]: https://www.mediawiki.org/wiki/Content_translation

jhugo3y ago

A machine-translated Wikipedia would not be a trustworthy source of information at all, yet would look like one. I think that does significantly more harm than good.

[0] Suggestions for better alternatives welcomed.

debesyla3y ago

On top of that - a lot of language specific content has to include sources in that same language.

(As an example, it would be absurd for lithuanian wikipedia to include sources in japanese - that would be not usable AND not usefull for the wikipedia readers, editors...)

jw4ng3y ago

Jeff Wang here with my fellow Meta AI colleague Angela Fan from No Languages left Behind, seeing the comments flowing through. If you want to ask us anything, go for it!

david_allison3y ago

Hi Jeff,

[0] https://en.wikipedia.org/wiki/Manx_language

hello_im_angela3y ago

could you send me an email please? It's available on our paper, page 1: https://research.facebook.com/publications/no-language-left-...

david_allison3y ago

Done, let me know if it didn't go through

tkgally3y ago

Thank you for your exciting work and for coming onto HN to respond to questions.

[1] http://gally.net/writings.html

yosito3y ago

I've learned two languages with the help of MT. I'm sure you've interviewed people like me, but I get excited about the potential of MT for language learning, so I'd like to share my thoughts.

In summary, MT is awesome for language learning and for assisting language skills in development.

tkgally3y ago

A couple of comments:

> More than once, while not using MT, I was told that I spoke Spanish with a "Google Translate accent", which I'm sure was more of a reference to my grammar than my accent....

> When I use MT, I have a simple rule, that I have to understand each word of a translation before I send it.

That sounds like an excellent rule. I will pass on that advice to educators I know who are trying to figure out how to guide their students on the use of MT.

Many thanks again.

kouteiheika3y ago

jw4ng3y ago

This enables any researcher to use our code freely, and build on top of it for their own research. We are not intending to commercially licensing our project.

kouteiheika3y ago

Okay, but why?

Even using it for research would be illegal as it's also not exactly "personal use".

shuraih3y ago

jkw3y ago

Dhivehi is currently not supported, unfortunately. We view this as a starting point and are committed to expanding to many other languages as in the spirit of our project name.

Full list of currently supported languages can be found here: https://github.com/facebookresearch/flores/tree/main/flores2...

yosito3y ago

How far off are my intuitions on this? What are the costs of adding a new language to a model like this? Is there a ballpark dollar amount per language?

jw4ng3y ago

concinds3y ago

To be clear, this will increase economic opportunity, but will inevitably kill cultural diversity.

Is your team considering or studying this?

[0]: https://www.sampaxinos.com/mental-models/the-winner-takes-al... (or see Taleb's works)

pagekicker3y ago

Vetch3y ago

hello_im_angela3y ago

We tokenize with the flores-200 spm model, correct. To generate from the model, check out the instructions here: https://github.com/facebookresearch/fairseq/tree/nllb/exampl...

pesentiOP3y ago

Are all the 200x200 translations going directly or is English (or another language) used as an intermediate for some of them?

jw4ng3y ago

dangom3y ago

What is the greatest insight you gained and could share with non-experts from working on this project?

jw4ng3y ago

Jabbles3y ago

> Every language is unique just like everybody

TBH it just sounds like you've redefined the word "unique".

jefflombardjr3y ago

Gangi þér vel!

kgeist3y ago

I wonder how it differs from what Yandex.Translate did back in 2016: [0]

[0] https://techno-yandex-ru.translate.goog/machine-translation/...

hello_im_angela3y ago

We represent all languages in their natural script, rather than transliterating them into a common synthetic one.

microtherion3y ago

hello_im_angela3y ago

visarga3y ago

tsm3y ago

Similar issue with Scots, which has many variant orthographies but is frequently written in mostly-English anyway.

rmbyrro3y ago

This only makes the problem behind the NLLB project even more interesting to solve

otreblatercero3y ago

bertil3y ago

Are there online corpora, like Wikipedia, that could be used to train the models? Are those under a permissive enough license to be used for model training?

If there are spoken, with enough budget, a library of voices could be recorded. I think you’d prefer that collection to be gathered and maintained by a non-profit rather than Meta.

otreblatercero3y ago

For náhuatl, I found this: Wikipedia in nahuatl https://nah.wikipedia.org/wiki/Cal%C4%ABxatl

bertil3y ago

I’m wondering if 7065 articles is enough to train the model.

albertzeyer3y ago

The Facebook paper has some direct comparison to that work.

jkw3y ago

Evaluation was important to us, and we really wanted to have a benchmark that covers all 200 languages

yellowapple3y ago

Hopefully the Scots language model wasn't trained on Wikipedia.

btheshoe3y ago

I'm not entirely sure why low resource languages are seen as such a high priority for AI research. It seems that by definition there's little payoff to solving translation for them.

albertzeyer3y ago

Some numbers (but you can not exactly infer from them such accumulated numbers): https://en.wikipedia.org/wiki/List_of_languages_by_total_num...

Some more numbers from here: https://www.sciencedirect.com/science/article/pii/S016763931...

"96% of the world’s languages are spoken by only 4% of its people."

Although this statement is more about the tail from the approx 7000 languages.

jefftk3y ago

It doesn't sound like you're considering that people are very often fluent in a major language in addition to their regional one?

albertzeyer3y ago

I am. That's why I mentioned that you can not infer my statements directly from the numbers you find on Wikipedia etc. You can not simply add up those numbers.

goodside3y ago

cyphar3y ago

[1]: https://youtu.be/qYlmFfsyLMo?t=141 [2]: https://www.nytimes.com/2017/04/22/world/europe/iceland-icel...

wilde3y ago

The point is that there are lots of humans who speak these languages and use tech. They just don’t use Wikipedia so getting a good translation corpus going was harder.

gwern3y ago

ausbah3y ago

what other entities would really have access to the text resources that FB & Google? outside of a few other large companies I can't imagine many

Jabbles3y ago

Surely the fact that they did all the high-resource languages first and are only now getting round to the less-popular ones demonstrates that that is not, in fact, the case?

tehsauce3y ago

I think the reason low resource languages are prioritized is to compensate for the fact that AI research normally has a tendency to marginalize these languages.

btheshoe3y ago

yes, but what principles justify the importance placed on low resource languages?

froskur3y ago

1 more reply

quink3y ago

The examples given are, with native speaker numbers, Assamese (15 million), Catalan (4 million) and Kinyarwanda (10 million). These alone are more than an Australia.

Furthermore, Facebook considers the internet to consist of Facebook and Wikipedia (Zero).

I view this as just another extension of their Next Billion initiative, an effort to ensure that another billion people are monopolised by Facebook.

That's the payoff.

jw4ng3y ago

daniel-cussen3y ago

Wouldn't that also entail a bot speaking in any language?

bobsmooth3y ago

Text to speech is a separate problem.

dunefox3y ago

Small data, big meaning is much more important than big data, little meaning. Much closer to real intelligence.

munificent3y ago

Cynical answer: It's good PR.

onurcel3y ago

labrador3y ago

I'll know AI translators are any good when the United Nations starts using them

"Skills required: United Nations translators are required to have a perfect command of their main language and an excellent knowledge of, in most cases, two other official languages"

https://www.un.org/dgacm/en/content/translation

epolanski3y ago

My ex is a translator at an embassy, and she always said that ai translators are a godsend.

On one side they make their work easier as they can focus more on correcting the ai produced text and focus on author's meaning while eliminating lots of plumbing.

astrange3y ago

labrador3y ago

I'm surprised this didn't occur to me until after I posted because it fits with my general feeling that AIs will be nothing more than collaborative tools for the foreseeable future.

samatman3y ago

An organization built out of pure prestige, with no concept of monetary profit, has zero pressure to stop employing their classmates as translators, ever.

1 more reply

thamer3y ago

Full story: https://abcnews.go.com/Business/wireStory/kill-facebook-fail...

Other references: "Facebook Admits It Was Used to Incite Violence in Myanmar" https://www.nytimes.com/2018/11/06/technology/myanmar-facebo... (2018)

"Violent hate speech continues to thrive on Facebook in Myanmar, AP report finds" https://www.cbsnews.com/news/myanmar-facebook-violent-hate-s... (9 months ago)

bertil3y ago

https://www.localizationlab.org/blog/2019/3/25/burmese-font-...

vjerancrnjak3y ago

What are hardware requirements to run this?

I see the mixture model is ~ 300 GB and was trained on 256 GPUs.

I assume distilled versions can easily be run on one GPU.

hello_im_angela3y ago

mdda3y ago

"All models are licensed under CC-BY-NC 4.0" :

So, to clarify, does this mean that companies cannot use these models in the course of business, or is it more about selling the translation results directly?

kwhitefoot3y ago

What is a "low resource language"?

jw4ng3y ago

hey there, I work on this project. We categorize a language as low-resource if there are fewer than 1M publicly available, de-duplicated bitext samples.

also see section 3, table 1 in the paper: https://research.facebook.com/publications/no-language-left-...

maestrae3y ago

hello_im_angela3y ago

2 more replies

protomyth3y ago

Looking at the list, I see a lack of Native American languages. Did anyone try to contact the tribes during this?

hello_im_angela3y ago

1 more reply

pesentiOP3y ago

https://datascience.stackexchange.com/questions/62868/high-l...

Tabular-Iceberg3y ago

My concern with this is that in low resource languages the unavoidable biases of the ML models might overpower their own organic development.

madrox3y ago

makeitdouble3y ago

People present these as the choice between 0 (“locked out”) and 1.

For interpersonal communication, people adjust, that’s what has been happening for centuries now.

Tabular-Iceberg3y ago

> This happens without machine translation in the wild already with pidgin.

texaslonghorn53y ago

onurcel3y ago

protomyth3y ago

I think it will interesting when it runs into a language (e.g. Dakota) where the women and men speak differently. Should be an interesting test.

zen_13y ago

thaumasiotes3y ago

> so in the sentence "I am going to the store", the verb "to go" will be either masculine or feminine, reflecting the speaker's gender

But there the rules are the same for everyone. This is not true in general; there are languages where men and women speak according to different rules.

Here's a selection from Empires of the Word:

> kuli Mulila šu bamu emeše daŋen amaŋu lulaše ta munaben amaŋu Gašangale lulaše ta munaben

> Friend of Enlil, let me free! Let me go to my house! What lie shall I tell my mother? What lie shall I tell my mother Ningal?

> Both Enlil and Ningal are, of course, gods. In Emegir this would have been (with the differences highlighted):

> kuli Enlila šu bamu eŋuše gaŋen amaŋu lulaše ana munaben amaŋu Ningale lulaše ana munaben

nemothekid3y ago

Arabic is the 5th or 6th most spoken language. I think the concern for low resource languages is that nuances like that won't get picked up.

1 more reply

protomyth3y ago

HKH23y ago

Won't people trying to learn a low resource language as as a second language also bring their influence?

TaupeRanger3y ago

hello_im_angela3y ago

netol3y ago

How is the license of the models (CC NC) compatible with licenses used in Wikipedia? Did you sign an special agreement with the Wikimedia Foundation?

alexott3y ago

Twitter may not be representative imho because of the short text. It should first come to a problem of reliable language detection, and Twitter is quite often wrong there

onurcel3y ago

account423y ago

> Essential cookies

> These cookies are required to use Meta Products. They’re necessary for these sites to work as intended.

What cookies does Facebook "need" to serve a simple article?

LtWorf3y ago

Facebook translations are horrifying for the mainstream languages already. They go from completely wrong to kinda understandable but still wrong.

rmbyrro3y ago

Looks like they're investing to get better. The model is also available and they called for contributions to improve it.

LtWorf3y ago

Why would I help them? If it was public data sure.

ShamelessC3y ago

But this is a shallow dismissal that doesn't add anything valuable to the discussion.

"Oh they made their _terrible_ (probably state of the art) machine translation _better_??! Those monsters!!"

NoInkling3y ago

I know DeepL doesn't do low-resource languages, but it would be interesting to see a translation quality comparison between the two.

enos_feedler3y ago

I was two sentences in before I realized the headline wasn’t “No Luggage Left Behind”

onurcel3y ago

this is actually our recurring joke for our team meeting offsites!

schoen3y ago

I wonder if spy agencies have already developed, but not published, high-quality SMT methods for lots of minority and little-known languages. :-(

(Edit: and speech-to-text models.)

_nalply3y ago

"No Language Left Behind" - really?

Did the people at Meta think about the Signed Languages of the Deaf?

I didn't find a mention. Even Ctrl-F deaf didn't yield anything.

langsoul-com3y ago

So so many words but not a hint of any demo. It's just magic according to Facebook. Plz couldn't they at least have a crappy demo to break?

pdonis3y ago

tl/dr: Now your words can be misconstrued by far more people than before, because AIs will translate the misunderstandings into as many languages as possible.

zzzeek3y ago

bvanderveen3y ago

j / k navigate · click thread line to collapse