So much for a plateau lol.
It’s been really interesting to watch all the internet pundits’ takes on the plateau… as if the two years since the release of GPT3.5 is somehow enough data for an armchair ponce to predict the performance characteristics of an entirely novel technology that no one understands.
This is so insane that I can't help but be skeptical. I know FM answer key is private, but they have to send the questions to OpenAI in order to score the models. And a significant jump on this benchmark sure would increase a company's valuation...
Happy to be wrong on this.
openai and epochai are both startups with every incentive to peddle this narrative. when no one else can independently verify.
These new reasoning models are taking things in a new direction basically by adding search (inference time compute) on top of the basic LLM. So, the capabilities of the models are still improving, but the new variable is how deep of a search you want to do (how much compute to throw at it at inference time). Do you want your chess engine to do a 10 ply search or 20 ply? What kind of real world business problems will benefit from this?
They found a way to make test time compute a lot more effective and that is an advance but the idea is not new, the architecture is not new.
And the vast majority of people convinced LLMs plateaued did so regardless of test time compute.
A plain LLM does not use variable compute - it is a fixed number of transformer layers, a fixed amount of compute for every token generated.