Being the go-getter she is, she tried various platforms like Happyscribe and Notta. Unfortunately, the transcription quality for Asian languages was just terrible, not to mention the translation quality. That’s when I decided to step in, get my hands dirty and fix it.
First Attempt
The solution is straightforward to me: First, get the video and extract the audio. Then, transcribe the audio into subtitles with high accuracy. Finally, translate those subtitles and watch the video with the synced, translated subtitles.
With the AI boom happening, I came across OpenAI's Whisper automatic speech recognition model and started thinking about how I could integrate it with ChatGPT.
For those who aren’t familiar, Whisper is renowned for its accuracy in most commonly used languages—though it struggles with unclear audio, which can lead to some hilarious hallucinations.
Setting it up was straightforward. It’s easy to use Whisper to transcribe audio into text with timestamps. I then used GPT for translation, synced the subtitle file with the original video in Adobe Premiere Pro for a quick check, and exported the video. This is how I translated and localized an episode of a Chinese TV series. The first show I experimented with was "Joy of Life."
By the way, watching a few episodes together was quite enjoyable.
However, Whisper ran slowly on my M1 Pro, and since I was also experimenting with generating images using Stable Diffusion, I decided to set up a PC with an NVIDIA RTX 4090. I configured a few scripts to download, convert to text, and translate in one go, speeding up the whole process significantly.
One day, while my wife was watching an episode, she suggested, "Why don’t you just make this into a product? Others might find it useful too, and you could even make some extra money for baby formula."
It was a eureka moment for me.
At the time, AI products were emerging rapidly, and with my previous project in the maintenance stage, I was eager to create something new. I invited two friends to join me, and we formed a small team. Thus, a new product was born.
Choices to Make
PC or Mobile: I chose to develop for PC instead of mobile. Subtitle editing and translation typically involve long videos, which are better handled on a PC. Web or Client: I opted for a web-based interface over a client-side application. Client applications need to be compatible with both Windows and Mac, and different versions of these systems. Moreover, I was fed up with Whisper's slow speed and various model limitations. Offloading the computation to the cloud allows users with any computer configuration to use this service smoothly.
I decided to name the product "SubEasy" because it makes creating subtitles easy.
So, it's SubEasy.ai
see comments below to continue...