Omnilingual ASR: Advancing automatic speech recognition for 1600 languages (opens in new tab)

(ai.meta.com)

162 pointsjean-7mo ago45 comments

HF Demo: https://huggingface.co/spaces/facebook/omniasr-transcription...

GitHub: https://github.com/facebookresearch/omnilingual-asr

45 comments

41 comments · 15 top-level

____tom____7mo ago· 6 in thread

What I really want to know is how well these could work for non-human languages. No, not aliens, but chimpanzees, dolphins, bonobos. We have hundreds or thousands of hours of recordings.

What would it take to start working on them?

benob7mo ago

Not tested on that particular model, but the idea has been flying around for some time: https://arxiv.org/abs/2509.04166v1

nshm7mo ago

You can check whale sound recognition project https://arxiv.org/abs/2104.08614

netdevphoenix7mo ago

I think linguistics, don't deem animals to have languages as you require human level intelligence to use and understand some of the features in human languages like communicating about things that are away from your current timespace location. Animals have communication systems

____tom____7mo ago

I'm not asserting that bonobos, for example, have as complex a language as humans, just that it would be interesting to understand what language that they do have.

"You haven't experienced Shakespeare until you've read him in the original Bonobo". :-)

1 more reply

akreal7mo ago

There is a dolphin language model project from Google and Georgia Tech: https://blog.google/technology/ai/dolphingemma/

____tom____7mo ago

That's exactly the kind of thing I was hoping people were working on!

samat7mo ago· 5 in thread

How hard is it to make TTS out of this? A few independent journalists from Belarus asked for TTS in their language, but I am no expert, was thinking about re-using Mozilla's work. What's the easiest way to get working TTS for a language?

woodson7mo ago

EDIT: My bad, please disregard; As akreal pointed out, the MMS TTS models aren’t using the SSL models.

Original post:

You can use the OmniASR SSL models instead of their older MMS models to create TTS models: https://github.com/ylacombe/finetune-hf-vits

akreal7mo ago

As far as I understand, the MMS TTS models are trained from scratch (section 7.1 of [1]), they do not employ any SSL models. So the OmniASR SSL models are not useful here.

What might be interesting is the newly released OmniASR data, because the MMS data, which was used for the MMS TTS, was never released.

Also, the OmniASR can be used to transcribe some untranscribed speech to train a TTS on it.

[1] MMS paper: https://arxiv.org/pdf/2305.13516

1 more reply

willwade7mo ago

Meta cheated with the mms models. That is they didn’t use a phonemeizsr step. This means they just won’t work or sound very strange. ASR data is usually not quite right for tts. But anyhow - not really answering your question but many of these languages already done in mms. Try them https://huggingface.co/spaces/willwade/sherpa-onnx-tts

kulahan7mo ago

From TFA, it says that it’s extremely easy to add new languages with just a few examples. I didn’t see specifics on how “few” it really is, though.

nl7mo ago

This is ASR not TTS though.

District55247mo ago· 3 in thread

I agree that this is a very exciting and really crucial research and I'm glad there is funding for this. But it's very strange that Hungarian is marked as "highly endangered" at https://aidemos.atmeta.com/omnilingualasr/language-globe Highly endangered is supposed to mean "The language is used by grandparents and older generations; while the parent generation may still understand the language, they typically do not speak it to children or among themselves." Then why is Hungarian marked as such? Obviously not true with 14 million active speakers and being the 20th in terms of the most language resources published on the Internet. Additionally, the feedback mechanism seems also broken ("There was an error submitting your feedback. Please try again.")

internet_points7mo ago

Finnish: "safe" – sounds right

South Estonian: "vulnerable" – sure, yeah

Karelian: "endangered" – seems correct

Swedish: also "endangered" – wat

Ghari (12k speakers): "safe" – :facepalm:

Are these really language-vulnerability ratings or did they just make a mapping from Trump's tariff rates?

yorwba7mo ago

The Ethnologue link in footnote 7 of the paper has utm_source=chatgpt.com at the end, so I suspect whoever was tasked with listing languages and determining their status thought this wasn't important enough to do it themselves and just had ChatGPT give them a list. FWIW, Ethnologue does say that Ghari is "Stable" https://www.ethnologue.com/language/gri/ Meanwhile Swedish is "Institutional," the highest possible level of vitality https://www.ethnologue.com/language/swe/

District55247mo ago

My new favourite mistake is Malayalam being highly endangered...

stuffoverflow7mo ago· 3 in thread

This seems like a massive improvement for openly available local ASR. Even the 300M model outperforms whisper-large-v3 according to the paper's benchmarks.

lostmsu7mo ago

Not sure, I recorded 3 seconds of voice (a single sentence) and the hf demo misrecognized about half of the words.

nshm7mo ago

This model is actually expected to be bad for popular languages, just like previous MMS it is not accurate at all, it wins by supporting something rare well but never had good ASR accuracy even for Swedish etc. It is more a research thing than a real tool. Unlike Whisper.

nshm7mo ago

And moreover, you can not tune those models for practical applications. The model is originally trained on very clean data, so lower layers are also not very stable for diverse inputs. To finetune you have to update the whole model, not just upper layers.

1 more reply

dSebastien7mo ago· 3 in thread

I'm going to test this with Voice AI to see how it works compared to Whisper and Parakeet

https://voice-ai.knowii.net

sipjca7mo ago

looks like a paid and closed source fork of the free and open source project Handy: https://github.com/cjpais/Handy

can't say for sure, but a lot of the UI (and text) is quite familiar. the history page is a near rip off which is a giveaway.

i believe the mit license should be distributed since it's almost certainly a derivative work.

"The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software."

I can't confirm if the license is infact distributed, since I would have to pay $50, which quite frankly I'm not going to do.

a bit sad to see a ui reskin claimed as original work. the reskin is totally fine, but I believe the license must be distributed. i believe in the proliferation of this software so im happy to see this overall (it's good enough someone wants to charge for it! that's a big win!) but it's just a bit of a shame how this project has gone about it imo.

copypirate7mo ago

I thought it looked familiar! Looks like they only changed some of the UI/colors lol.

1 more reply

dSebastien7mo ago

I never claimed this was fully original work. My project indeed started as a fork of Handy. I've discussed this here: https://github.com/DeveloPassion/knowii-voice-ai-docs/issues... and https://www.knowii.net/c/announcements/knowii-voice-ai. I also mentioned it in the user documentation (FAQ page) and About page within the app.

I am trying to approach this will full transparency, honesty and respect for what the creator of Handy did. I'm not a grifter.

Please consider that my project is still very young. I didn't include the third-party licenses in my first few releases (I honestly didn't know this about the MIT license, my bad!), but will fix this asap with the next release (hopefully coming out in a few days), and I'll pull the previous releases to avoid distributing versions that don't include the licenses. I'll also add information about the other production dependencies that I'm using.

If you look at my announcement, you'll see that I'm being fully transparent about this and am not interested in cloning Handy at all. My code is already very distant from the initial version I started with and I'm exploring and building features that will probably never be included in Handy. For instance, my app's UI has been created from scratch (with a lot of inspiration from Handy), it is fully responsive and now works on Omarchy (Hyprland/Wayland), which Handy doesn't support at the moment. I have added various features for my own needs and for my first customers (e.g., . In the roadmap of the product, you can see some of the ideas I intend to develop.

I also intend to contribute back to Handy over time. I already have and will continue to do so.

ks20487mo ago· 2 in thread

Just killed my startup. https://6k.ai

Half joking - hopefully, we can still contribute something to this to this field. Looking forward to doing some tests with this.

internet_points7mo ago

what is the "Penguin" language?

Also, 1.6k < 6k, and I highly doubt this model is anywhere near as good as it is on EU languages for most of them.

ks20487mo ago

That's a dumb joke. Yes, I hope to look in detail at their performance on a couple of low-resource languages. Without lots of speakers and data, I think good metrics are hard to come by. I've found that in Meta's massively-multilingual TTS - what looks impressive at first glance, you can see performance is quite bad on smaller languages.

mcswell7mo ago· 1 in thread

First, let me say that this is impressive. And then let me pose some questions:

As a linguist, I would like to know more about the kinds of languages this works well with, or does not work well with. For example, half the world's languages are tone languages, and the way tones work varies greatly among these. Some just have high and low tones, while others are considerably more complicated; Thai has high, mid, low, rising and falling. Also, tone is relative, e.g. a man's high tone might be a woman's low tone. And some African languages have tones whose absolute frequencies vary across an utterance. So transcribing tone is a quite different problem from transcribing phonemes--and yet for many tone languages, the tone is crucial.

There are also rare(r) phonemes, like the clicks in many languages of southern Africa. Of course maybe they've already trained on some of these languages.

The HuggingFace demo says "Supported Languages[:] For this public demo, we've restricted transcription to low-resource languages with error rates below 10%." That's unclear: 10% word error rate, or character/ phoneme error rate? The meta.com page refers to character error rate (CER); a 10% character error rate can imply a much higher word error rate (WER), since most words contain several characters/ phonemes. That said, there are ways to get around that, like using a dictionary to select among different paths through possible character sequences so you only get known words, and adding to that a morphological parser for languages that have lots of affixes (meaning not all the word forms will be in the dictionary--think walk, walks, walked, walking--only the first will be in most dictionaries.)

Enquiring minds want to know!

aargh_aargh7mo ago

I'm not an expert but the rule of thumb is to expect something like this:

https://xkcd.com/1838/

cadamsdotcom7mo ago· 1 in thread

Only a few gb of weights will recognize speech in 1600+ languages.

Freely downloadable and usable by anyone for almost anything.

We truly live in the future.

prodigycorp7mo ago

Seeing the absurd number of languages made me think of the norm macdonald joke:

Music is the universal language, but one day soon it will be replaced by Chinese.

tmikaeld7mo ago· 1 in thread

Swedish

Status: Endangered

"The child-bearing generation can use the language among themselves, but it is seldom being transmitted to children."

What!? A lot must have changed in one generation..

District55247mo ago

Yes, there seems to be lots of mistakes and no easy way to mark it. Highly endangered: Malayalam (=35 million speakers), Hungarian (14 million), Uighur (11 million), or Swedish as endangered... These are quite obvious mistakes even for a layperson.

meetpateltech7mo ago· 1 in thread

HF Demo: https://huggingface.co/spaces/facebook/omniasr-transcription...

GitHub: https://github.com/facebookresearch/omnilingual-asr

dang7mo ago

Thanks! I've added those links to the toptext as well.

momojo7mo ago

Does anyone else feel like they buried the lead?

> Omnilingual ASR was designed as a community-driven framework. People around the world can extend Omnilingual ASR to new languages by using just a few of their own samples.

The world just got smaller

oezi7mo ago

Unfortunately I don't read anything in the paper about improvements to timing/timestamping. In particular unclean word boundaries are hard with wav2vev2.

And their use of LLMs as part of the transcription process makes it likely that they trained the model to correct mispronounciations by the speaker. This lowers CER because the human transcription often corrects for mispronounciations as well, but reduces the ability of the model to actually transcribe what was said.

benob7mo ago

> Bring Your Own Language

Few-shot new languages is going to be a game changer for linguists

tschellenbach7mo ago

any insights on latency?

AIorNot7mo ago

the global language explorer is fascinating -great work guys

https://aidemos.atmeta.com/omnilingualasr/language-globe

- we are getting closer to BabelFish.. at least for the Earth!

j / k navigate · click thread line to collapse

45 comments

41 comments · 15 top-level

____tom____7mo ago· 6 in thread

What I really want to know is how well these could work for non-human languages. No, not aliens, but chimpanzees, dolphins, bonobos. We have hundreds or thousands of hours of recordings.

What would it take to start working on them?

benob7mo ago

Not tested on that particular model, but the idea has been flying around for some time: https://arxiv.org/abs/2509.04166v1

nshm7mo ago

You can check whale sound recognition project https://arxiv.org/abs/2104.08614

netdevphoenix7mo ago

____tom____7mo ago

I'm not asserting that bonobos, for example, have as complex a language as humans, just that it would be interesting to understand what language that they do have.

"You haven't experienced Shakespeare until you've read him in the original Bonobo". :-)

1 more reply

akreal7mo ago

There is a dolphin language model project from Google and Georgia Tech: https://blog.google/technology/ai/dolphingemma/

____tom____7mo ago

That's exactly the kind of thing I was hoping people were working on!

samat7mo ago· 5 in thread

woodson7mo ago

EDIT: My bad, please disregard; As akreal pointed out, the MMS TTS models aren’t using the SSL models.

Original post:

You can use the OmniASR SSL models instead of their older MMS models to create TTS models: https://github.com/ylacombe/finetune-hf-vits

akreal7mo ago

As far as I understand, the MMS TTS models are trained from scratch (section 7.1 of [1]), they do not employ any SSL models. So the OmniASR SSL models are not useful here.

What might be interesting is the newly released OmniASR data, because the MMS data, which was used for the MMS TTS, was never released.

Also, the OmniASR can be used to transcribe some untranscribed speech to train a TTS on it.

[1] MMS paper: https://arxiv.org/pdf/2305.13516

1 more reply

willwade7mo ago

kulahan7mo ago

From TFA, it says that it’s extremely easy to add new languages with just a few examples. I didn’t see specifics on how “few” it really is, though.

nl7mo ago

This is ASR not TTS though.

District55247mo ago· 3 in thread

internet_points7mo ago

Finnish: "safe" – sounds right

South Estonian: "vulnerable" – sure, yeah

Karelian: "endangered" – seems correct

Swedish: also "endangered" – wat

Ghari (12k speakers): "safe" – :facepalm:

Are these really language-vulnerability ratings or did they just make a mapping from Trump's tariff rates?

yorwba7mo ago

District55247mo ago

My new favourite mistake is Malayalam being highly endangered...

stuffoverflow7mo ago· 3 in thread

This seems like a massive improvement for openly available local ASR. Even the 300M model outperforms whisper-large-v3 according to the paper's benchmarks.

lostmsu7mo ago

Not sure, I recorded 3 seconds of voice (a single sentence) and the hf demo misrecognized about half of the words.

nshm7mo ago

1 more reply

dSebastien7mo ago· 3 in thread

I'm going to test this with Voice AI to see how it works compared to Whisper and Parakeet

https://voice-ai.knowii.net

sipjca7mo ago

looks like a paid and closed source fork of the free and open source project Handy: https://github.com/cjpais/Handy

can't say for sure, but a lot of the UI (and text) is quite familiar. the history page is a near rip off which is a giveaway.

i believe the mit license should be distributed since it's almost certainly a derivative work.

"The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software."

I can't confirm if the license is infact distributed, since I would have to pay $50, which quite frankly I'm not going to do.

copypirate7mo ago

I thought it looked familiar! Looks like they only changed some of the UI/colors lol.

1 more reply

dSebastien7mo ago

I am trying to approach this will full transparency, honesty and respect for what the creator of Handy did. I'm not a grifter.

I also intend to contribute back to Handy over time. I already have and will continue to do so.

ks20487mo ago· 2 in thread

Just killed my startup. https://6k.ai

Half joking - hopefully, we can still contribute something to this to this field. Looking forward to doing some tests with this.

internet_points7mo ago

what is the "Penguin" language?

Also, 1.6k < 6k, and I highly doubt this model is anywhere near as good as it is on EU languages for most of them.

ks20487mo ago

mcswell7mo ago· 1 in thread

First, let me say that this is impressive. And then let me pose some questions:

There are also rare(r) phonemes, like the clicks in many languages of southern Africa. Of course maybe they've already trained on some of these languages.

Enquiring minds want to know!

aargh_aargh7mo ago

I'm not an expert but the rule of thumb is to expect something like this:

https://xkcd.com/1838/

cadamsdotcom7mo ago· 1 in thread

Only a few gb of weights will recognize speech in 1600+ languages.

Freely downloadable and usable by anyone for almost anything.

We truly live in the future.

prodigycorp7mo ago

Seeing the absurd number of languages made me think of the norm macdonald joke:

Music is the universal language, but one day soon it will be replaced by Chinese.

tmikaeld7mo ago· 1 in thread

Swedish

Status: Endangered

"The child-bearing generation can use the language among themselves, but it is seldom being transmitted to children."

What!? A lot must have changed in one generation..

District55247mo ago

meetpateltech7mo ago· 1 in thread

HF Demo: https://huggingface.co/spaces/facebook/omniasr-transcription...

GitHub: https://github.com/facebookresearch/omnilingual-asr

dang7mo ago

Thanks! I've added those links to the toptext as well.

momojo7mo ago

Does anyone else feel like they buried the lead?

> Omnilingual ASR was designed as a community-driven framework. People around the world can extend Omnilingual ASR to new languages by using just a few of their own samples.

The world just got smaller

oezi7mo ago

Unfortunately I don't read anything in the paper about improvements to timing/timestamping. In particular unclean word boundaries are hard with wav2vev2.

benob7mo ago

> Bring Your Own Language

Few-shot new languages is going to be a game changer for linguists

tschellenbach7mo ago

any insights on latency?

AIorNot7mo ago

the global language explorer is fascinating -great work guys

https://aidemos.atmeta.com/omnilingualasr/language-globe

- we are getting closer to BabelFish.. at least for the Earth!

j / k navigate · click thread line to collapse