undefined | Better HN

0 pointsyujonglee10mo ago0 comments

I use VAD to chunk audio.

Whisper and Moonshine both works in a chunk, but for moonshine:

> Moonshine's compute requirements scale with the length of input audio. This means that shorter input audio is processed faster, unlike existing Whisper models that process everything as 30-second chunks. To give you an idea of the benefits: Moonshine processes 10-second audio segments 5x faster than Whisper while maintaining the same (or better!) WER.

Also for kyutai, we can input continuous audio in and get continuous text out.

- https://github.com/moonshine-ai/moonshine - https://docs.hyprnote.com/owhisper/configuration/providers/k...

0 comments

6 comments · 2 top-level

mijoharas10mo ago· 4 in thread

Something like that, in a cli tool, that just gives text to stdout would be perfect for a lot of use cases for me!

(maybe with an `owhisper serve` somewhere else to start the model running or whatever.)

ctbellmar10mo ago

I wrote a tool that may be just the thing for you:

https://github.com/bikemazzell/skald-go/

Just speech to text, CLI only, and it can paste into whatever app you have open.

mijoharas10mo ago

Oh, this does sound cool. Couple of questions that aren't clear from the readme (to me).

What exactly does the silence detection mean? does that mean it'll wait until a pause, and then send the audio off to whisper, and return the output (and stop the process)? Same question with continuous. Does that just mean it continues going until CTRL+C?

Nvm, answered my own question, looks like yes for both[0][1]. Cool this seems pretty great actually.

[0] https://github.com/bikemazzell/skald-go/blob/main/pkg/skald/...

[1] https://github.com/bikemazzell/skald-go/blob/main/pkg/skald/...

yujongleeOP10mo ago

Are you thinking about the realtime use-case or batch use-case?

For just transcribing file/audio,

`owhisper run <MODEL> --file a.wav` or

`curl httpsL//something.com/audio.wav | owhisper run <MODEL>`

might makes sense.

mijoharas10mo ago

agreed, both of those make sense, but I was thinking realtime. (pipes can stream data, I'd like and find useful something that can stream tts to stdout in realtime.)

1 more reply

zveyaeyv3sfye10mo ago

Having used whisper and noticed the useless quality due to their 30-second chunks, I would stay far away from software working on even a shorter duration.

The short duration effectively means that the transcription will start producing nonsense as soon as a sentence is cut up in the middle.

j / k navigate · click thread line to collapse