undefined | Better HN

0 pointsrefulgentis1y ago0 comments

I did some core work on TTS at Google, at several layers, and I've never quite understood what people mean by streaming vs. not.

In each and every case I'm familiar with, streaming means "send the whole audio thus far to the inference engine, inference it, and send back the transcription"

I have a Flutter library that does the same flow as this (though via ONNX, so I can cover all platforms), and Whisper + Silero is ~identical to the interfaces I used at Google.

If the idea is streaming is when each audio byte is only sent once to the server, there's still an audio buffer accumulated -- its just on the server.

0 comments

10 comments · 3 top-level

opprobium1y ago· 4 in thread

Streaming for TTS doesn't matter but for speech to text it is more meaningful in interactive cases. In that case the user's speech is arriving in real time and streaming can mean a couple levels of things:

- Overlap compute with the user speaking: Not having to wait until all the speech has been acquired can massively reduce latency at the end of speech and allow a larger model to be used. This doesn't have to be the whole system, for instance an encoder can run in this fashion along audio as it comes in even if the final step of the system then runs in a non-streaming fashion.

- Produce partial results while the user is speaking: This can be just a UI nice to have, but it can also be much deeper, eg, a system can be activating on words or phrases in the input before the user is finished speaking which can dramatically change latency.

- Better segmentation: Whisper + Silero is just using VAD to make segments for Whisper, this is not at all the best you can do if you are actually decoding while you go. Looking at the results as you go allow you to make much better and faster segmentation decisions.

refulgentisOP1y ago

The only models that do what you're poking at hostically are 4o (claimed) and that french company with the 7B one. They're also bleeding edge, either unreleased or released and way wilder, ex. The french one interrupts too much, and screams back in an alien language occasionally.

Until these, you'd use echo cancellation to try and allow interruptible dialogue, and thats unsolved, you need a consistently cooperative chipset vendor for that (read: wasn't possible even at scale, carrots, presumably sticks, and with nuch cajoling. So it works on iPhones consistently.)

The partial results are obtained by running inference on the entire audio so far, and silence is determined by VAD, on every stack I've seen that is described as streaming

I find it hard to believe that Google and Apple specifically, and every other audio stack I've seen, are choosing to do "not the best they can at all"

opprobium1y ago

This is exactly what Google ASR does. Give it a try and watch how the results flow back to you, it certainly is not waiting for VAD segment breaking. I should know.

Streaming used to be something people cared about more. VAD is always part of those systems as well, you want to use it to start segments and to hard cut-off, but it is just the starting off point. It's kind of a big gap (to me) that's missing in available models since Whisper came out, partly I think because it does add to the complexity of using the model, and latency has to be tuned/traded-off with quality.

r2_pilot1y ago

Thank you for your insight. It confirms some of my suspicions working in this area (you wouldn't happen to know anybody who makes anything more modern than the Respeaker 4-mic array?). My biggest problem is even with AEC, the voice output is triggering the VAD and so it continually thinks it's getting interrupted by a human. My next attempt will be to try to only signal true VAD if there's also sound coming from anywhere but behind, where the speaker is. It's been an interesting challenge so far though.

2 more replies

Nimitz141y ago

This is a complete non sequitur lol. FYI whisper is not a streaming model though it can, with some work, be adapted to be one.

1 more reply

iamjackg1y ago· 2 in thread

I think in practical terms (at least for me):

- streaming == I talk and the text appears as I talk

- batched == I talk, and after I'm done talking some processing happens and the text gets populated

refulgentisOP1y ago

Gotcha, then, it's "not even wrong" in the Pauli sense to say Whisper isn't streaming

opprobium1y ago

It is not streaming in the way people normally use this term. It's a fuzzy notion but typically streaming means something encompassing:

- Processing and emitting results on something closer to word by word level - Allowing partial results while the user is still speaking and mid-segment - Not relying on an external segmenter to determine the chunking (and therefore also latency) of the output.

1 more reply

flax1y ago· 1 in thread

"streaming" in this case is like another reply said: transcriptions appear as I talk. Compared to not-streaming in which the service waits for silence, then processes the captured speech, then returns some transcription.

Is your Flutter library available? And does it run locally? I'm looking for a good Flutter streaming (in the sense above) speech recognition library. vosk looks good, but it's lacking some configurability such as selecting audio source.

refulgentisOP1y ago

FONNX, haven't gone out of my way to make it trivial[1], but, it's very good, battle tested on every single platform. (And yes runs locally)

[1] example app shows how to do everything, there's basic doc, but man the amount of nonsense you need to know to pull it all together is just too hard to document without a specific Q. Do feel free to file an issue

j / k navigate · click thread line to collapse

0 comments

10 comments · 3 top-level

opprobium1y ago· 4 in thread

refulgentisOP1y ago

The partial results are obtained by running inference on the entire audio so far, and silence is determined by VAD, on every stack I've seen that is described as streaming

I find it hard to believe that Google and Apple specifically, and every other audio stack I've seen, are choosing to do "not the best they can at all"

opprobium1y ago

This is exactly what Google ASR does. Give it a try and watch how the results flow back to you, it certainly is not waiting for VAD segment breaking. I should know.

r2_pilot1y ago

2 more replies

Nimitz141y ago

This is a complete non sequitur lol. FYI whisper is not a streaming model though it can, with some work, be adapted to be one.

1 more reply

iamjackg1y ago· 2 in thread

I think in practical terms (at least for me):

- streaming == I talk and the text appears as I talk

- batched == I talk, and after I'm done talking some processing happens and the text gets populated

refulgentisOP1y ago

Gotcha, then, it's "not even wrong" in the Pauli sense to say Whisper isn't streaming

opprobium1y ago

It is not streaming in the way people normally use this term. It's a fuzzy notion but typically streaming means something encompassing:

1 more reply

flax1y ago· 1 in thread

refulgentisOP1y ago

FONNX, haven't gone out of my way to make it trivial[1], but, it's very good, battle tested on every single platform. (And yes runs locally)

j / k navigate · click thread line to collapse