It's interesting that OpenAI is highlighting the Elo score instead of showing results for many many benchmarks that all models are stuck at 50-70% success.
[1] https://twitter.com/LiamFedus/status/1790064963966370209
I don't really care whether it's stronger than gpt-4-turbo or not. The direct real-time video and audio capabilities are absolutely magical and stunning. The responses in voice mode are now instantaneous, you can interrupt the model, you can talk to it while showing it a video, and it understands (and uses) intonation and emotion.
Really, just watch the live demo. I linked directly to where it starts.
Importantly, this makes the interaction a lot more "human-like".
This model isn't about basemark chasing or being a better code generator; it's entirely explicitly focused on pushing prior results into the frame of multi-modal interaction.
It's still a WIP, most of the videos show awkwardness where its capacity to understand the "flow" of human speech is still vestigial. It doesn't understand how humans pause and give one another space for such pauses yet.
But it has some indeed magic ability to share a deictic frame of reference.
I have been waiting for this specific advance, because it is going to significantly quiet the "stochastic parrot" line of wilfully-myopic criticism.
It is very hard to make blustery claims about "glorified Markov token generation" when using language in a way that requires both a shared world model and an understanding of interlocutor intent, focus, etc.
This is edging closer to the moment when it becomes very hard to argue that system does not have some form of self-model and a world model within which self, other, and other objects and environments exist with inferred and explicit relationships.
This is just the beginning. It will be very interesting to see how strong its current abilities are in this domain; it's one thing to have object classification—another thing entirely to infer "scripts plans goals..." and things like intent, and, deixis. E.g. how well does it now understand "us" and "them" and "this" vs "that"?
Exciting times. Scary times. Yee hawwwww.
So local modelling (completely offline but per speaker aware and responsive), with a really flexible application API. Sort of the GTK or QT equivalent for voice interactions. Also custom naming, so instead of "Hey Siri" or "Hey Google" I could say, "Hey idiot" :-)
Definitely some interesting tech here.
We'll have to see when end users actually get access to the voice features "in the coming weeks".
Thanks for this.
Skinner: "Yes."
Chalmers: "May I see it?"
Skinner: "No."
But I am not convinced it will be another GPT-4 moment. Seems like big focus on tacking together multi-modal clever tricks vs straight better intelligence AI.
Hope they prove me wrong!
Improving the instruction tuning, the RLHF step, increase the training size, work on multilingual capabilities, etc. make sense as a way to improve quality, but I think increasing model size doesn't. Being able to advertize a big breakthrough may make sense in terms of marketing, but I don't believe it's going to happen for two reasons:
- you don't release intermediate steps when you want to be able to advertise big gains, because it raises the baseline and reduce the effectiveness of your ”big gains” in terms of marketing.
- I don't think they would benefit in an arm race with Meta, trying to keeping a significant edge. Meta is likely to be able to catch-up eventually on performance, but they are not so much of a threat in terms of business. Focusing on keeping a performance edge instead of making their business viable would be a strategic blunder.
Seems to me that performance is converging and we might not see a significant jump until we have another breakthrough.
This model has been being tested under a code name of ‘gpt2-chatbot’ but it is very much a new GPT4+-level model, with new multimodal capabilities - but apparently some impressive work around inference speed.
Highlighting so people don’t get the impression this is just OpenAI slapping a new label on something a generation out of date.
(text input in web version)
maybe it's programmed to completely ignore swearing but how could I not swear after it gave me repeatedly info about you.com when I try to address it in second person
The improvements they seem to be hyping are in multimodality and speed (also price – half that of GPT-4 Turbo – though that’s their choice and could be promotional, but I expect it’s at least in part, like speed, a consequence of greater efficiency), not so much producing better output for the same pure-text inputs.
and the prompt wasn't a monstrosity, and it wasn't even that good, it was just one line "I need help to categorize these expenses" and off it went. hope it won't get enshittified like turbo, because this finally feels as great as 3.5 was for goal seeking.
The "gpt2-chatbot" was the worst of the three.