story
In math it shares the top spot with o1 and is just a few points behind (well within errors). In creative writing it is basically ex-aequo with the latest ChatGPT 4o and in coding it's actually significantly ahead of everyone else and represents a new SOTA.
In the nicest way possible I'm saying this form of preference testing is ultimately useless, primarily due to a base of dilettantes with more free time than knowledge parading around as subject matter experts and secondarily due to presumed malfeasance. The latter is more apparent to more of the masses (that don't blindly believe any leaderboard they see) now that access to the model itself is more widespread and people are seeing the performance doesn't match the "revolution" promised [0]. If you're still confused why selecting a model based on a glorified Hot or Not application is flawed, perhaps ask yourself why other evals exist in the first place (hint: some tests are harder than others.)
[0](One such instance of someone competent testing it and realizing it's not even close to the "best" model out) https://www.youtube.com/watch?v=WVpaBTqm-Zo
How would the math change after factoring in that OpenAI isn't even covering entirety of opex with the sub anyway, and/or people finding associating their money and Twitter accounts to be weird, and/or this thing is supposedly running on a bigger cluster than that for OpenAI?
lmarena has also become less and less useful over time for comparing frontier models as all frontier models are able to saturate the performance needed for the kind of casual questions typically asked there. For the harder questions, o1 (not even o1-pro) still appears to be tied for 1st place with several other models... which is yet another indication of just how saturated that benchmark is.
“Grok 3 + Thinking feels somewhere around the state of the art territory of OpenAI's strongest models (o1-pro, $200/month)”.
"[...] though of course we need actual, real evaluations to look at."
His own tests are better than nothing, but hardly definitive.
The official source says "Starts at $22/month or $229/year on web", https://help.x.com/en/using-x/x-premium
This is pretty much what I paid a couple of months ago, as a Canadian.
Also visible here: https://help.x.com/en/using-x/x-premium#tbpricing-bycountry
This plan is 75 days old. I didn't know it existed until last week.
OpenAI is starting to try to get a little more realistic revenue in, Grok is acquiring customers.
Do we have a way to tell if one model is smarter than another at that point?
Here's a real world intelligence test. Take on each AI as a remote intern/new-hire, and try to train it to become a useful team member (solving math puzzles or manufacturing paperclips does not count).
Ask them to design a ranking mechanism for you. They are superhuman, after all.
(I really don't think we're going to have to worry about this).