Running it locally way more sense for an open source project, because why would you pay and be dependent upon a 3rd party if you don't have to be.
It also makes way more sense for a service because then _you_ don't have to give all your money to openAI and skim off of what's left.
This is just.... bewildering. I really wanted to use it, but I'm not going to pay openAI to transcribe podcasts for me when i can literally use the exact same language model and do it locally with free open source code.
I'm hoping someone will fork this and teach it to run whisper locally.
[edit: getting the exact right version of python and PyTorch and dependencies to make whisper run was a pain but now i've got it set up and it's a trivial command to transcribe every mp3 i feel like transcribing]
https://github.com/ggerganov/whisper.cpp/tree/master/example...
What ratio are you getting (podcast length to transcription time) and does it error out memory wise as others suggest?
Whisper v2 costs $0.006 per minute of transcribed text: https://openai.com/pricing
If you had meetings every working hour, you'd have up to ~160 hours of audio per month to transcribe. For most people, this is a gross overestimate.
Throwing this audio at OpenAI's API would cost $57.60 per month, and also frees you up from having to set up and maintain local inference.
convenience: yes, it's a nicer interface, but the current state of the "geeky" version is type command on command line, with path to file. The end. unless you're really afraid of the command line it's not that much more convenient.
The text line being highlighted while you listen is nice but a) we wrote something that did it at the word level (as opposed to sentence..ish level) nearly 20 years ago, b) in this context it's not actually that useful. With video sure... you can click the text and go to teh right place in the video. With spoken text (what this is best at) you click and go to the point...where they're saying what you just read. Unless you really want to hear what you just read, there's not a lot of added value.
Would it be good for podcasts to use an interface like this for playback? absolutely. It'd be a massive upgrade, but that's not what this is offering.
maybe someone will extract that code and let us combine the MP3 and timestamped text file in a web site (if that doesn't already exist). That'd be cool.
But, the cost you propose is way too much for most people, especially in countries that aren't rich. In many places $400 a month is a really good salary. So yeah, if you're rich $700 a year is not a big deal, but...
Everything starts small.
Also, the most important thing about a service is attracting customers, not the tech stack under it. Facebook was made with PHP, Twitter famously failed constantly while struggling with user growth.
I’d much rather have tons of users with a tech stack that is a wrapper for a bunch of other stuff, than have super impressive in-house tech and no users.
I don't think people are excited about making API calls. They see a land grab and are clamoring for their piece. As for what I'm doing, I work on my own products that, I hope, push the envelope, at least slightly. And I have seen AI companies that are doing good work using OpenAI's tools, but this isn't one of them.
Sometimes "good looking gift wrapping" is a huge value unto itself. Also, it isn't fair to good UX and UI developers to imply that that isn't also really hard work to get right. It's just different work using a different form of thinking. Not lesser in any way. And... without the people who could make the "good looking gift wrapping" most apps would suck a lot harder than they already do.
This hype is going to eventually subside with lots of losers and a tiny minority of winners when the price increases come in.
The only winners of this race to the bottom is Stability.ai who are already open sourcing everything and OpenAI cannot afford to open source their flagship AI product(s) for free.
The current AI hype cycle has driven companies to slap AI somewhere in their offering so they can call themselves an AI company - even if it's an API key and an intern spending half a day with an API wrapper.
Gatekeeping is always risky but in my mind if you're not at least touching an ML framework you're not an "AI company" - which is already IMO a pretty low bar. That said it starts to get really hazy when you look at things like SageMaker and other offerings where you're doing abstracted model development or substantial amounts of fine-tuning/training on a custom dataset, etc.
Does such a thing exist? I would gladly donate to a kickstarter project for this before trying to build one myself.
If you own a gpu use this one https://github.com/openai/whisper
If you don't own a gpu use this one https://github.com/ggerganov/whisper.cpp (this one is very very slow)
https://github.com/ahmetoner/whisper-asr-webservice
For your requirements the medium.en model (max) should be satisfactory.
Don't most devs most likely already have a powerful GPU? Maybe I am biased for also being a gamer or having worked in game-development, which requires a powerful GPU anyway.
My understanding is that the only reason OpenAI even setup the paid API is because it "can also be hard to run [sic]". Personally, I'm skeptical. I"m not knocking them for it but I could see how this is just brand capitalization.
[0]: https://openai.com/blog/introducing-chatgpt-and-whisper-apis...
It's fairly easy and quick to run Whisper for free either locally in an Anaconda environment with Python or the command-line interface or, even better, in a Google Colab notebook.
Here's a sample notebook that builds on a notebook by Pete Warden.
https://colab.research.google.com/drive/1sxsey3n0jd09MjUd9Ky...
Why is it hard to see that not every organization has the capability to set up their own translation cluster, provision GPUs, frontends, scaling, on-call rotations, regularly update models..? It's not just "brand capitalization". An API that you can call to transcribe/translate a recording with zero extra work is absolutely essential to have for most.
- Run Voice Activity Detection for better timestamp output - Transcribe with Whisper - Run Forced Alignment to get per word timestamp - Create better segmented SRT - Translate(with multiple APIs - implemented DeepL, Google Translate, Baidu and a couple more)
A better description would be "A PHP based web app which calls OpenAI's Whisper API to transcribe speech"
The title describes what it does, I think you're making a mountain out of an anthill.
It's usually all in the details and delivery (and ya'know we're lazy and lack time to setup stuff locally)
Though I wouldn't really knock anything free and open source either way.
It's technically the domain for Anguilla, a literal British colony in the Caribbean.
It appears to be managed by some random guy- check out the .ai registration FAQ: http://whois.ai/faq.html
If you are going to use .ai, just be aware the top level of the domain appears to be managed by some dude with a gmail account. Its not necessarily bad, but something to consider if you're planning to host your billion dollar AI startup on it.
I always thought they would need to be vetted by the government of that the ccTLD represents.
https://en.wikipedia.org/wiki/.ai https://en.wikipedia.org/wiki/Vince_Cate
A colorful character to say the least, and exactly the kind of person I'd expect to be running the ccTLD of a small caribbean island.
I didn't know OpenAI had an API for that as well, but now I was able to try it out and it's magnitudes better: Perfect spelling and only 1 wrong word in 2 minutes of audio (an abbreviation) that I was able to understand. It even filters out filler words!
You just saved me literally hours of work by showing the powers of OpenAI!
(Reading this back it sounds like an ad, but I'm in no way affiliated with any of those services. I'm just very happy.)
So I would suggest this project to try things out, then setup Whisper locally :)
But what do I know, I'm just in the ether...
See https://platform.openai.com/docs/guides/speech-to-text/quick...
BUT if my brother (accountant) needs something like this, he wouldn't be able to install Whisper: he wouldn't even be able to open Github. So I think frontend GUI that behind the scenes runs models are always welcome.
I think this would be much better if it would run Whisper on their server instead of using an external API, but that's their decission.
there are a lot of older series/movies with where the speech is hard to discern but no subtitles are available for download.
i have been thinking about creating an AutoSubtitle app for years, but haven't had a free day to tackle it - hope someone else beats me to it.
insane.
Free and open source.
Thank you!