undefined | Better HN

0 pointsfxtentacle3y ago0 comments

This AI has a 30 second delay on the audio processing because it needs to be able to "look into the future" to get these good results. That 30s delay would be unacceptable for Siri/Google/Cortana.

0 comments

coder5433y ago

A lot of models we currently use seem to do the same thing. The model will transcribe a "best effort" interpretation in real time, then as you can continue speaking, you'll see it go back and make corrections. I'm sure you can feed the first X seconds you have into the model, followed by (30-X) seconds of silence, and it will do real time transcription just fine... it would be weird if this broke anything. Then, as you get more speech, you continue getting better transcription of the first 30 seconds, then you switch to a 30 second sliding window.

Maybe I'm missing something, but I don't see the problem here.

fxtentacleOP3y ago

Yes, that's because Whisper - like pretty much all of them - uses a Transformer encoder with Attention layers. And the Attention layers learn to look into the future.

And yes, what you describe could be done. But no, it won't reduce latency that much, because the model itself learns to delay the prediction w.r.t. the audio stream. That's why ASR-generated subtitles usually need to be re-aligned after the speech recognition step. And that's why there is research such as the FastEmit paper to prevent that, but then it is a trade-off between latency and quality again.

Also, running your "low-latency" model with 1s chunks means you now need to evaluate the AI 30x as often as if you'd be using 30s chunks.

coder5433y ago

You just said the models pretty much all work the same way, then you said doing what I described won't help. I'm confused. Apple and Google both offer real time, on device transcription these days, so something clearly works. And if you say the models already all do this, then running it 30x as often isn't a problem anyways, since again... people are used to that.

I doubt people run online transcription for long periods of time on their phone very often, so the battery impact is irrelevant, and the model is ideally running (mostly) on a low power, high performance inference accelerator anyways, which is common to many SoCs these days.

1 more reply

j / k navigate · click thread line to collapse

0 comments

coder5433y ago

Maybe I'm missing something, but I don't see the problem here.

fxtentacleOP3y ago

Yes, that's because Whisper - like pretty much all of them - uses a Transformer encoder with Attention layers. And the Attention layers learn to look into the future.

Also, running your "low-latency" model with 1s chunks means you now need to evaluate the AI 30x as often as if you'd be using 30s chunks.

coder5433y ago

1 more reply

j / k navigate · click thread line to collapse