We built OWhisper for 2 reasons: (Also outlined in https://docs.hyprnote.com/owhisper/what-is-this)
(1). While working with on-device, realtime speech-to-text, we found there isn't tooling that exists to download / run the model in a practical way.
(2). Also, we got frequent requests to provide a way to plug in custom STT endpoints to the Hyprnote desktop app, just like doing it with OpenAI-compatible LLM endpoints.
The (2) part is still kind of WIP, but we spent some time writing docs so you'll get a good idea of what it will look like if you skim through them.
For (1) - You can try it now. (https://docs.hyprnote.com/owhisper/cli/get-started)
bash
brew tap fastrepl/hyprnote && brew install owhisper
owhisper pull whisper-cpp-base-q8-en
owhisper run whisper-cpp-base-q8-en
If you're tired of Whisper, we also support Moonshine :)
Give it a shot (owhisper pull moonshine-onnx-base-q8)We're here and looking forward to your comments!
These are list of local models it supports:
- whisper-cpp-base-q8
- whisper-cpp-base-q8-en
- whisper-cpp-tiny-q8
- whisper-cpp-tiny-q8-en
- whisper-cpp-small-q8
- whisper-cpp-small-q8-en
- whisper-cpp-large-turbo-q8
- moonshine-onnx-tiny
- moonshine-onnx-tiny-q4
- moonshine-onnx-tiny-q8
- moonshine-onnx-base
- moonshine-onnx-base-q4
- moonshine-onnx-base-q8
To me, STT should take a continuous audio stream and output a continuous text stream.
Whisper and Moonshine both works in a chunk, but for moonshine:
> Moonshine's compute requirements scale with the length of input audio. This means that shorter input audio is processed faster, unlike existing Whisper models that process everything as 30-second chunks. To give you an idea of the benefits: Moonshine processes 10-second audio segments 5x faster than Whisper while maintaining the same (or better!) WER.
Also for kyutai, we can input continuous audio in and get continuous text out.
- https://github.com/moonshine-ai/moonshine - https://docs.hyprnote.com/owhisper/configuration/providers/k...
But the base-q8 works (and works quite well!). The TUI is really nice. Speaker diarization would make it almost perfect for me. Thanks for building this.
I was actually integrating some whisper tools yesterday. I was wondering if there was a way to get a streaming response, and was thinking it'd be nice if you can.
I'm on linux, so don't think I can test out owhisper right now, but is that a thing that's possible?
Also, it looks like the `owhisper run` command gives it's output as a tui. Is there an option for a plain text response so that we can just pipe it to other programs? (maybe just `kill`/`CTRL+C` to stop the recording and finalize the words).
Same question for streaming, is there a way to get a streaming text output from owhisper? (it looks like you said you create a deepgram compatible api, I had a quick look at the api docs, but I don't know how easy it is to hook into it and get some nice streaming text while speaking).
Oh yeah, and diarisation (available with a flag?) would be awesome, one of the things that's missing from most of the easiest to run things I can find.
I didn't tested on Linux yet, but we have linux build: http://owhisper.hyprnote.com/download/latest/linux-x86_64
> also, it looks like the `owhisper run` command gives it's output as a tui. Is there an option for a plain tex
`owhisper run` is more like way to quickly trying it out. But I think piping is definitely something that should work.
> Same question for streaming, is there a way to get a streaming text output from owhisper?
You can use Deepgram client to talk to `owhisper serve`. (https://docs.hyprnote.com/owhisper/deepgram-compatibility) So best resource might be Deepgram client SDK docs.
> diarisation
yeah on the roadmap
Great work on this! excited to keep an eye on things.
Can you help me out to find where the code you've built is? I can see the folder in github[0], but I can't see the code for the cli for instance? unless I'm blind.
https://github.com/fastrepl/hyprnote/blob/8bc7a5eeae0fe58625...
Ultimately, I chose a cloud-based GPU setup, as the highest-performing diarization models required a GPU to process properly. Happy to share more if you’re going that route.
https://github.com/ggml-org/whisper.cpp/tree/master/examples...
- It supports other models like moonshine.
- It also works as proxy for cloud model providers.
- It can expose local models as Deepgram compatible api server
I just spent last week researching the options (especially for my M1!) and was left wishing for a standard, full-service (live) transcription server for Whisper like OLlama has been for LLMs.
I’m excited to try this out and see your API (there seems to be a standard vaccuum here due to openai not having a real time transcription service, which I find to be a bummer)!
Edit: They seem to emulate the Deepgram API (https://developers.deepgram.com/reference/speech-to-text-api...), which seems like a solid choice. I’d definitely like to see a standard emerging here.
Let me know how it goes!
When I find the time to set it up I’d like to contribute to the documentation to answer the questions I had, but I couldn’t even find information on how to do that (no docs folder in the repo contribution.md, which the AI assistant also points me towards, doesn’t contain information about adding to the docs).
In general I find it a bit distracting that the OWhisper code is inside of the hyprnote repository. For discoverability and “real project” purposes I find that it would probably deserve its own.
I see that you are also using llama.cpp's code? That's cool, but make sure you become a member of that community, not an abuser.
Link to the repo - https://github.com/m-bain/whisperX
EDIT: typo
also fyi - https://docs.hyprnote.com/owhisper/configuration/providers/o...
EDIT: Ah, I see this was already answered.
But I was hoping couple of features would be supported: 1. Multilingual support. It seems like even if I use a multilingual model like whisper-cpp-large-turbo-q8, the application seems to assume I am speaking English. 2. Translate feature. Probably already supported but I didnt see the option.
Though, with a twist that it would transcribe it with IPA :)