undefined | Better HN

0 pointscoder5433y ago0 comments

Someone else in this thread[0] said Whisper was running at 17x real time for them. So, even a weak machine might be able to do an acceptable approximation of real time with Whisper.

Also, I feel like shipping to the cloud and back has been shown to be just as fast as on device transcription in a lot of scenarios. Doing it on device is primarily a benefit for privacy and offline, not necessarily latency. (Although, increasingly powerful smartphone hardware is starting to give the latency edge to local processing.)

Siri's dictation has had such terrible accuracy for me (an American English speaker without a particularly strong regional accent) and everyone else I know for so many years that it is just a joke in my family. Google and Microsoft have much higher accuracy in their models. The bar is so low for Siri that I automatically wonder how much Whisper is beating Siri in accuracy... because I assume it has to be better than that.

I really wish there was an easy demo for Whisper that I could try out.

[0]: https://news.ycombinator.com/item?id=32928207

0 comments

6 comments · 2 top-level

lunixbochs3y ago· 3 in thread

17x realtime on a 3090

I did some basic tests on CPU, the "small" Whisper model is in the ballpark of 0.5x realtime, which is probably not great for interactive use.

My models in Talon run closer to 100x realtime on CPU.

coder543OP3y ago

“CPU” isn’t necessarily the benchmark, though. Most smartphones going back years have ML inference accelerators built in, and both Intel and AMD are starting to build in instructions to accelerate inference. Apple’s M1 and M2 have the same inference accelerator hardware as their phones and tablets. The question is whether this model is a good fit for those inference accelerators, and how well it works there, or how well it works running on the integrated GPUs these devices all have.

Brute forcing the model with just traditional CPU instructions is fine, but… obviously going to be pretty slow.

I have no experience on the accuracy of Talon, but I’ve heard that most open source models are basically overfit to the test datasets… so their posted accuracy is often misleading. If Whisper is substantially better in the real world, that’s the important thing, but I have no idea if that’s the case.

lunixbochs3y ago

Ok, my test harness is ready. My A40 box will be busy until later tonight, but on an NVIDIA A2 [1], this is the batchsize=1 throughput I'm seeing. Common Voice, default Whisper settings, card is staying at 97-100% utilization:

  tiny.en: ~18 sec/sec
  base.en: ~14 sec/sec
  small.en: ~6 sec sec/sec
  medium.en: ~2.2 sec/sec
  large: ~1.0 sec/sec (fairly wide variance when ramping up as this is slow to process individual clips)

[1] https://www.nvidia.com/en-us/data-center/products/a2/

1 more reply

lunixbochs3y ago

See https://news.ycombinator.com/item?id=32929029 re accuracy, I'm working on a wider comparison. My models are generally more robust than open-source models such as Vosk and Silero, but I'm definitely interested in how my stuff compares to Whisper on difficult held-out data.

> Brute forcing the model with just traditional CPU instructions is fine, but… obviously going to be pretty slow.

It's not that simple. Many of the mobile ML accelerators are more targeted for conv net image workloads, and current-gen Intel and Apple CPUs have dedicated hardware to accelerate matrix math (which helps quite a bit here, and these instructions were in use in my tests).

Also, not sure which model they were using at 17x realtime on the 3090. (If it's one of the smaller models, that bodes even worse for non-3090 performance.) The 3090 is one of the fastest ML inference chips in the world, so it doesn't necessarily set realistic expectations.

There are also plenty of optimizations that aren't applied to the code we're testing, but I think it's fairly safe to say the Large model is likely to be slow on anything but a desktop-gpu-class accelerator just due to the sheer parameter size.

MacsHeadroom3y ago· 1 in thread

> I really wish there was an easy demo for Whisper that I could try out.

Like the colab notebook linked on the official Whisper github project page?

coder543OP3y ago

Sure, but I did see one linked in another thread here on HN after posting that comment.

j / k navigate · click thread line to collapse

0 comments

6 comments · 2 top-level

lunixbochs3y ago· 3 in thread

17x realtime on a 3090

I did some basic tests on CPU, the "small" Whisper model is in the ballpark of 0.5x realtime, which is probably not great for interactive use.

My models in Talon run closer to 100x realtime on CPU.

coder543OP3y ago

Brute forcing the model with just traditional CPU instructions is fine, but… obviously going to be pretty slow.

lunixbochs3y ago

  tiny.en: ~18 sec/sec
  base.en: ~14 sec/sec
  small.en: ~6 sec sec/sec
  medium.en: ~2.2 sec/sec
  large: ~1.0 sec/sec (fairly wide variance when ramping up as this is slow to process individual clips)

[1] https://www.nvidia.com/en-us/data-center/products/a2/

1 more reply

lunixbochs3y ago

> Brute forcing the model with just traditional CPU instructions is fine, but… obviously going to be pretty slow.

MacsHeadroom3y ago· 1 in thread

> I really wish there was an easy demo for Whisper that I could try out.

Like the colab notebook linked on the official Whisper github project page?

coder543OP3y ago

Sure, but I did see one linked in another thread here on HN after posting that comment.

j / k navigate · click thread line to collapse