That’s exactly right. There’s too much bias in society that if something isn’t perfect, then why bother? Nothing is perfect, so with that attitude there can be no progress. Thank you for doing important work!
I speak a language where I've never seen any translation for it... and when translated manually, my mum totally butchers the meaning lol.
Either way, any work in this area is more than welcome, but damn it's a hard problem.
As well as in the research paper: https://research.facebook.com/publications/no-language-left-...
We're at a point where it's now possible to determine the shape of every language, provided there are enough speakers of the language left who are both able and willing to help.
<Snark> Once done, Facebook can then commodify their dissent, and sell it back to them in their native language. </Snark>
It seems to me that the happy accident of doing this research at the start of getting all human knowledge digitized is part of the unreasonable effectiveness of this overall technique.
Had it happened in 200 years, it might not have worked, right?
>Translating Wikipedia for everyone
Hmmm.
While there is very definitely utility in doing things like this, I do kinda fear "poisoning the well"-like effects of feeding (even partially-) AI-generated-data into extremely common AI-data-sources.
There's some info on it in a blog post[1] and the MediaWiki "Content translation" page[2], but does anyone know of any studies on the quality of the translations produced? I can absolutely see it being a huge time-saver for people who are essentially fluent in both (there's a lot of semi-mechanical drudgery in translating stuff like this that could be mostly eliminated)... but people are pretty darn good at choosing the easy option of trusting whatever they're given rather than being as careful as they should be. It kinda feels like it runs the risk of passively encouraging people to trust the machine's choice over their own, as long as it isn't obviously nonsense, and the cumulative effect could be rather large after a while.
[1]: https://diff.wikimedia.org/2021/11/16/content-translation-to...
A machine-translated Wikipedia would not be a trustworthy source of information at all, yet would look like one. I think that does significantly more harm than good.
[0] Suggestions for better alternatives welcomed.
(As an example, it would be absurd for lithuanian wikipedia to include sources in japanese - that would be not usable AND not usefull for the wikipedia readers, editors...)
I currently host the largest collection of bilingual Manx[0] <-> English texts (~1MM words). How would I formally get in contact to chat about the steps to make machine translation available (and would there be grant opportunities available for further production of machine-readable data?)
Regarding grants: we have offered compute grants previously with the Workshop on Machine Translation (last year: https://www.statmt.org/wmt21/flores-compute-grants.html, this year: https://statmt.org/wmt22/large-scale-multilingual-translatio...) and we have an RFP, but it's currently focused on African languages: https://ai.facebook.com/research/request-for-proposals/trans...
I am a former professional translator (Japanese to English) and am now supervising research at the University of Tokyo on the use of machine translation in second-language education. As I have written in a few papers and essays [1], advances in MT have raised serious questions for language teachers. The ready availability of MT today, including on Facebook and Instagram, means that language students use it a lot while studying. We don’t know yet, though, how that use of MT might affect our students’ acquisition of other languages or their motivation to keep studying those languages.
One of the hurdles educators and researchers face is finding out how MT is being used in the real world. Most education in modern languages is focused on giving students language skills that they will be able to use later in work, education, and daily life, and textbooks and other learning materials are typically shaped around real-world situations. We are now struggling to adapt those materials for the age of MT, because data on the actual use of MT is very hard to get.
Like Google, Microsoft, Baidu, DeepL, and others, Meta must have huge amounts of data on how your users are using MT to communicate. Any information and insights about that MT usage that you can share with world—just as you have generously shared your NLLB models—would be most welcome.
When I learned Spanish, I spent a lot of time chatting on Facebook with native speakers, and using Google Translate as "training wheels" to help me formulate sentences, and understand words and phrases I hadn't learned yet. It worked pretty well at the time (2012) except in cases of slang and typos that Google couldn't handle. I also used it a lot to help me translate blog posts from English to Spanish. Eventually, I graduated from the training wheels and was able to use Spanish fluently without the help of MT. More than once, while not using MT, I was told that I spoke Spanish with a "Google Translate accent", which I'm sure was more of a reference to my grammar than my accent, since my spoken practice was 100% with native speakers.
When I learned Hungarian (2019-now), at the beginning, Google Translate wasn't good enough to use for much more than getting a rough understanding of formal text, so I learned in a more traditional way at a school and with native speakers. Then the pandemic prevented me from doing both of those. I started chatting with native speakers on Facebook, but it was very difficult without MT and involved a lot of asking my conversation partners for translations and explanations. Progress was frustratingly slow. Then I discovered DeepL's MT, which was extremely good with Hungarian. I started using for chat conversations and emails, and people were shocked that I was managing to communicate with them so fluently. My progress of actually learning the language for myself accelerated dramatically. I've become conversational (B2/C1) in Hungarian in 2.5 years with very little in-person practice. Often, it takes native English speakers 5 years of in-person practice to reach that level. I'm convinced that MT played a key role in my ability to learn quickly.
When I use MT, I have a simple rule, that I have to understand each word of a translation before I send it. So I carefully read the translation, making sure that I understand each word. Sometimes that means I have to look up individual words/grammar before sending a message (I often use wiktionary for that, because it shows etymology), and other times, it means that I'll replace unfamiliar words or phrases in a translation with words and phrases from my own vocabulary. Over time I rely on MT less and less because my own vocabulary becomes stronger. I really believe that they key to learning a language quickly is to start USING the language as quickly as possible. Once you're using a language, your brain automatically starts picking up the skill. With traditional language learning, using a language can be very difficult in the beginning until you've reached a conversational level, but with MT, you can start using a language before you know everything.
For Spanish, I almost never use MT anymore. Sometimes I use it as a quick dictionary for an unfamiliar word, but my Spanish level is C2 and I use Spanish every day so it feels natural. I'm not ever translating in my head anymore.
For Hungarian, I'm still using MT often, but I don't need it during conversation (either written or spoken). Besides using it to translate things I don't know, I also find it useful for inputting Hungarian characters that are a pain to type with my US keyboard, and for conjugating words correctly when I know the root but am struggling for the correct ending. Often I'll know what I want to say in Hungarian, but I'll open DeepL and type in English, then adjust the translation to use the words I want before I copy and paste the Hungarian. I'm essentially using MT as a guide to help me craft my sentences even when I know what I want to say.
In summary, MT is awesome for language learning and for assisting language skills in development.
Full list of currently supported languages can be found here: https://github.com/facebookresearch/flores/tree/main/flores2...
How far off are my intuitions on this? What are the costs of adding a new language to a model like this? Is there a ballpark dollar amount per language?
Importantly, models can learn from other languages are similar. If we train separate models for each direction on small amounts of data, the performance is significantly worse than grouping languages in one large multilingual model.
The problem is that it increases the risk of monoculture to 100%. Without language barriers, cultural diversity is lost, not gained, since you have winner-take-all effects[0]. Instead of helping revive languages, it'll make American ideas, mores, morality (Puritanism), philosophies, and political values more dominant worldwide.
To be clear, this will increase economic opportunity, but will inevitably kill cultural diversity.
Is your team considering or studying this?
[0]: https://www.sampaxinos.com/mental-models/the-winner-takes-al... (or see Taleb's works)
>The affinity of languages allows one common model to be trained for their translation. That is, “under the hood” of the translator, the same neural network translates into Russian from Yakut, Tatar, Chuvash and other Turkic languages. This approach is called many-to-one, that is, "from many languages \u200b\u200binto one." This is a more versatile tool than the classic bilingual neural network. And most importantly, it is the many-to-one approach that makes it possible to use knowledge about the structure and vocabulary of the Turkic languages, learned on the rich material of Turkish or Tatar, to translate languages like Chuvash or Yakut, which are less “resource-rich”, but no less important for the cultural diversity of the planet.
>In order to create a unified model for translating Turkic languages, Yandex developed a synthetic common script. Any Turkic language is translated into it, so that, for example, the Tatar “dүrt” (“four”) written in Cyrillic becomes similar to the Turkish dört (“four”), not only from the point of view of a person, but also at the level of similarity of lines for a computer.
This way they added support for Turkic and Uralic languages which are very underrepresented on the Internet. But I don't know what the quality of their translation is: even though I live in a region where Mari is spoken (indigenous Uralic language) and my wife is Mari, none of us, sadly, speak the language.
[0] https://techno-yandex-ru.translate.goog/machine-translation/...
Regarding Mari: extremely interesting language, exciting to hear that you are from that region. We are interested in working on this one (likely in the "Hill Mari" variant), but currently do not support it.
So you have a language with some economic opportunity (a few million speakers in a fairly wealthy country) but no clearly defined written interface, and an ambivalent attitude of many speakers towards the very idea of writing the language.
As an aside, the question of how to think about language standardization is really complex. We wrote some thoughts in Appendix A of our paper: https://research.facebook.com/publications/no-language-left-...
If there are spoken, with enough budget, a library of voices could be recorded. I think you’d prefer that collection to be gathered and maintained by a non-profit rather than Meta.
The Facebook paper has some direct comparison to that work.
Some numbers (but you can not exactly infer from them such accumulated numbers): https://en.wikipedia.org/wiki/List_of_languages_by_total_num...
Some more numbers from here: https://www.sciencedirect.com/science/article/pii/S016763931...
"96% of the world’s languages are spoken by only 4% of its people."
Although this statement is more about the tail from the approx 7000 languages.
For instance, Icelandic is not supported by any modern smartphone platform which has lead to Icelandic natives communicating with each other in English and very little information is translated to Icelandic[1,2].
That being said, I am worried that having translations that are "too good" could also act to accelerate language death as the importance of keeping languages alive will seem less significant (to non-language-nerds) if we can translate works written in that language to any other language with very small datasets. Luckily I'm not convinced that AI models will be able to produce convincing and consistent translations for a long time -- languages are so different in so many ways that I can't see how adding more dimensions and parameters to a model would account for them.
[1]: https://youtu.be/qYlmFfsyLMo?t=141 [2]: https://www.nytimes.com/2017/04/22/world/europe/iceland-icel...
Furthermore, Facebook considers the internet to consist of Facebook and Wikipedia (Zero).
I view this as just another extension of their Next Billion initiative, an effort to ensure that another billion people are monopolised by Facebook.
That's the payoff.
"Skills required: United Nations translators are required to have a perfect command of their main language and an excellent knowledge of, in most cases, two other official languages"
On one side they make their work easier as they can focus more on correcting the ai produced text and focus on author's meaning while eliminating lots of plumbing.
On the other hand they increased the amount of business because much more text is translated than at any other point in history, which requires validation in most business, legal and even personal contexts. Without ai translators those translations would've not happened in the first place.
Full story: https://abcnews.go.com/Business/wireStory/kill-facebook-fail...
These were submitted to test Facebook's systems, because there's a good reason not to trust their promises on this front. Facebook was used extensively to propagate hate speech in Myanmar during the crisis of 2017, with their moderation tools and hate speech detection system letting through a ton of hateful content with real-world consequences, in the course of an actual ethnic cleansing campaign.
Other references: "Facebook Admits It Was Used to Incite Violence in Myanmar" https://www.nytimes.com/2018/11/06/technology/myanmar-facebo... (2018)
"Violent hate speech continues to thrive on Facebook in Myanmar, AP report finds" https://www.cbsnews.com/news/myanmar-facebook-violent-hate-s... (9 months ago)
https://www.localizationlab.org/blog/2019/3/25/burmese-font-...
I see the mixture model is ~ 300 GB and was trained on 256 GPUs.
I assume distilled versions can easily be run on one GPU.
So, to clarify, does this mean that companies cannot use these models in the course of business, or is it more about selling the translation results directly?
also see section 3, table 1 in the paper: https://research.facebook.com/publications/no-language-left-...
We shrug off all the little quirks of machine translated text because it usually gets the point across, and we recognize them as quirks because most of what we read was written by real people with no such quirks. But when most of what you read contain those quirks, I fear those will quickly become the standard way of writing and even speaking in those languages.
Point being, I'm not sure if language purity is more valuable than functionally allowing its people to interact with things they couldn't otherwise. Put another way, should we leave these people locked out of many online resources they can't read because we fear of corrupting their language? Give these people the option and let them decide. Language evolves over time anyway.
In real world instances (the proverbial 80%), it’s more often transforming a 0.4 (“don’t know much english”) into a 0.7. And the people who get away with near 0 knowledge will usually have no critical need for translation, or an access to other means (an actual translator, social help etc.) when really needed.
My mental image is grandmas reading online news, and machine translation would be a blessing and a curse. Or low grade school kids trying to look for some help on a topic, and a I’d wish they get more time with the original text to at least somewhat learn, than only getting the rough translation full of errors.
For interpersonal communication, people adjust, that’s what has been happening for centuries now.
I said nothing about purity, I said organic evolution, which this is an example of. If the actual speakers want to develop a pidgin, fine, I just think it should be a decision made by people and not models.
> These cookies are required to use Meta Products. They’re necessary for these sites to work as intended.
What cookies does Facebook "need" to serve a simple article?
But this is a shallow dismissal that doesn't add anything valuable to the discussion.
"Oh they made their _terrible_ (probably state of the art) machine translation _better_??! Those monsters!!"
(Edit: and speech-to-text models.)
Did the people at Meta think about the Signed Languages of the Deaf?
I didn't find a mention. Even Ctrl-F deaf didn't yield anything.
Understanding foreign culture is about reading automated translations of online comments into your native language. It has nothing to do with putting the effort into learning a language and understanding the nuances and current events and issues of the culture it embeds.
The ESL (English as a single language) speakers over at Facebook don't even need to understand foreign cultures, because they already know everyone in the world needs to spend their lives staring into the Metaverse. So grateful that they are working on the world's fattest pipeline for exporting Anglophone culture to every corner of the planet!