AiOla open-sources ultra-fast ‘multi-head’ speech recognition model (opens in new tab)

(aiola.com)

71 pointscheptsov1y ago14 comments

14 comments

12 comments · 5 top-level

Doohickey-d1y ago· 4 in thread

I'm curious which of the Whisper derivatives is actually the fastest ?

Since faster-whisper claims 4x speedup over base Whisper, and I've found WhisperX to be faster still (for longer audio where it can do batch inference), at least on consumer GPUs.

So with AiOla saying "50% speedup", is that actually noteworthy?

gronky_1y ago

From my understanding faster-whisper optimizes the inference without changing the model itself. Here they seem to be changing the model architecture but not applying other optimizations.

50% on its own doesn’t make this the current best choice for production. But I imagine this could become the new base model that all of the inference optimizations are applied to.

Wonder if it’s plug and play or if faster-whisper and others would need to reimplement from scratch?

dloss1y ago

Is this even faster? https://github.com/Vaibhavs10/insanely-fast-whisper

If so, is the quality still acceptable?

tomp1y ago

Depends what you mean by “fast”.

I’ve tested WhisperLive, it’s basically real-time (i.e. low latency).

gunalx1y ago

Dosent whisperlive just use faster-whisper under the hood. Witch can be way faster than real time.

1 more reply

qwertox1y ago· 2 in thread

Nothing of interest here, it's an ad.

If you're interested, you might as well check out Gladia, at least they have a pricing section and allow you to use it as a developer, unlike just asking you to "Request a Demo".

And while a sibling comment links to the GitHub repository, their entire website does not contain such a link.

---

Edit: My bad, for some reason I first checked the website instead of the blog post. Looks much more interesting now.

cheptsovOP1y ago

They have shared the link to GitHub [1], HuggingFace repo [2], and the paper [3]:

1. https://github.com/aiola-lab/whisper-medusa

2. https://huggingface.co/aiola/whisper-medusa-v1

3. https://paperswithcode.com/method/multi-head-attention

nmstoker1y ago

Looks like they left out all training code, presumably for commercial reasons (but it only just came out so it's conceivable they are just cleaning up that side of the code but I doubt it). Totally their call, given they've put the effort in, just a shame.

BetterWhisper1y ago· 1 in thread

Does it do speaker recognition/ diarization? Can't see it from the repo readme

ukuina1y ago

I haven't found a single good (working, easy to deploy cross-platform on CPU/CUDA/Apple Silicon) implementation of streaming + diarization, and I have looked at everything from WhisperX to pyannote to WhisperKit.

Any suggestions would be very welcome!

gronky_1y ago

GH repo: https://github.com/aiola-lab/whisper-medusa

phkahler1y ago

IIRC Whisper works on wave files. Can this do real time low latency continuous ASR?

j / k navigate · click thread line to collapse

14 comments

12 comments · 5 top-level

Doohickey-d1y ago· 4 in thread

I'm curious which of the Whisper derivatives is actually the fastest ?

Since faster-whisper claims 4x speedup over base Whisper, and I've found WhisperX to be faster still (for longer audio where it can do batch inference), at least on consumer GPUs.

So with AiOla saying "50% speedup", is that actually noteworthy?

gronky_1y ago

From my understanding faster-whisper optimizes the inference without changing the model itself. Here they seem to be changing the model architecture but not applying other optimizations.

50% on its own doesn’t make this the current best choice for production. But I imagine this could become the new base model that all of the inference optimizations are applied to.

Wonder if it’s plug and play or if faster-whisper and others would need to reimplement from scratch?

dloss1y ago

Is this even faster? https://github.com/Vaibhavs10/insanely-fast-whisper

If so, is the quality still acceptable?

tomp1y ago

Depends what you mean by “fast”.

I’ve tested WhisperLive, it’s basically real-time (i.e. low latency).

gunalx1y ago

Dosent whisperlive just use faster-whisper under the hood. Witch can be way faster than real time.

1 more reply

qwertox1y ago· 2 in thread

Nothing of interest here, it's an ad.

If you're interested, you might as well check out Gladia, at least they have a pricing section and allow you to use it as a developer, unlike just asking you to "Request a Demo".

And while a sibling comment links to the GitHub repository, their entire website does not contain such a link.

---

Edit: My bad, for some reason I first checked the website instead of the blog post. Looks much more interesting now.

cheptsovOP1y ago

They have shared the link to GitHub [1], HuggingFace repo [2], and the paper [3]:

1. https://github.com/aiola-lab/whisper-medusa

2. https://huggingface.co/aiola/whisper-medusa-v1

3. https://paperswithcode.com/method/multi-head-attention

nmstoker1y ago

BetterWhisper1y ago· 1 in thread

Does it do speaker recognition/ diarization? Can't see it from the repo readme

ukuina1y ago

Any suggestions would be very welcome!

gronky_1y ago

GH repo: https://github.com/aiola-lab/whisper-medusa

phkahler1y ago

IIRC Whisper works on wave files. Can this do real time low latency continuous ASR?

j / k navigate · click thread line to collapse