But I am not convinced it will be another GPT-4 moment. Seems like big focus on tacking together multi-modal clever tricks vs straight better intelligence AI.
Hope they prove me wrong!
Multimodal is another way to stave off the inevitable, because these AI companies already are training multiple models on different piles of information. If you have to train a text model and an image model, why split your training data in half when you could train a combined model on a combined dataset?
[0] For starters, most YouTube videos aren't manually captioned, so you're feeding GPT the output of Google's autocaptioning model, so it's going to start learning artifacts of what that model can't process.
I'd bet a lot of YouTubers are using LLMs to write and/or edit content. So we pass that through a human presentation. Then introduce some errors in the form of transcription. Turn feed the output in as part of a training corpus ... we plateaued real quick.
It seems like it's hard to get past a level of human intelligence at which there's a large enough corpus of training data or trainers?
Anyone know of any papers on breaking this limit to push machine learning models to super-human intelligence levels?
Whisper models are better than anything google has. In fact the higher quality whisper models are better than humans when it comes to transcribing text with punctuation.
I would expect they’re using their own t2s which is still a model but way better quality and potentially customizable to better suit their needs