undefined | Better HN

story

0 pointssigmoid101y ago0 comments

The impression seems to be warranted: Grok 3 has directly jumpted to the top of all leaderboard categories in Chatbot Arena: https://lmarena.ai/?leaderboard

In math it shares the top spot with o1 and is just a few points behind (well within errors). In creative writing it is basically ex-aequo with the latest ChatGPT 4o and in coding it's actually significantly ahead of everyone else and represents a new SOTA.

0 comments

jessfyi1y ago

lmarena/lmsys is beyond useless, looking at prior rankings of models vs formal benchmarks or testing for accuracy + correctness on batches of real world data. It's a bit like using a poll of Fox News to discern the opinions of every American; the audience voting is consistently found wanting. Not even getting into how easily a bad actor with means + motivation (in this "hypothetical" instance wanting to show that a certain model is capable of running the entire US government) can manipulate votes which has been brought up in the past (yes I'm aware of the lmsys publication on how they defend against attacks using cloudflare + recaptcha, there are ways around that.)

sigmoid10OP1y ago

So you're saying that either A: users interacting with models can't objectively rate what responses seem better to humans, B: xAi as a newcomer has somehow managed to game the leaderboard better than all those other companies, or C: all those other companies are not doing it. By those standards every test ever devised for anything is beyond useless. But simply not having the model creator running the evaluation is already going a long way.

jessfyi1y ago

No I'm saying that some companies are doing it (OpenAI at the very least), the company in question has motive and capability to game the system (kudos to them for pushing the boundaries there), AND the userbases' rankings have been historically, statistically misaligned with data from evals (though flawed) and especially when it comes to testing for accuracy + precision on real world data (outside of their known or presumed dataset). Take a look at how well Qwen or Deepseek actually performed vs the counterparts that were out at the same time vs their corresponding rankings.

In the nicest way possible I'm saying this form of preference testing is ultimately useless, primarily due to a base of dilettantes with more free time than knowledge parading around as subject matter experts and secondarily due to presumed malfeasance. The latter is more apparent to more of the masses (that don't blindly believe any leaderboard they see) now that access to the model itself is more widespread and people are seeing the performance doesn't match the "revolution" promised [0]. If you're still confused why selecting a model based on a glorified Hot or Not application is flawed, perhaps ask yourself why other evals exist in the first place (hint: some tests are harder than others.)

[0](One such instance of someone competent testing it and realizing it's not even close to the "best" model out) https://www.youtube.com/watch?v=WVpaBTqm-Zo

Breza1y ago

At work, developed our own suite of benchmarks. Every company with a serious investment in AI-powered platforms needs to do the same. Comparing our results to the Arena turns up some pleasant surprises, like DBRX hitting way above its weight for some reason.

sigmoid10OP1y ago

You say no, but then go on and explain why you believe a combination of both option A and option B. That's fine I guess, I just don't consider it particularly likely given the currently available information.

numpad01y ago

Considering that OpenAI subscription is $200 per month, and "Premium Plus" subscription that includes this thing is only $40 per month, does that mean instantaneous "Elon factor" is now at least -$160 per month per user, or is it supposed to be added up to more than -$240 per month?

How would the math change after factoring in that OpenAI isn't even covering entirety of opex with the sub anyway, and/or people finding associating their money and Twitter accounts to be weird, and/or this thing is supposedly running on a bigger cluster than that for OpenAI?

coder5431y ago

No... sigmoid10 was comparing with o1 (not o1-pro), which is accessible for $20/mo, not $200/mo. So, the "Elon factor" in your math is +$20/user/month (2x) for barely any difference in performance (a hard sell), not -$160/user/month, and while we have no clear answer to whether either of them are making a profit at that price, it would be surprising if OpenAI Plus users were not profitable, given the reasonable rate limits OpenAI imposes on o1 access, and the fact that most Plus users probably aren't maxing out their rate limits anyways. o1-pro requires vastly more compute than o1 for each query, and OpenAI was providing effectively unlimited access to o1-pro to Pro users, with users who want tons of queries gravitating to that subscription. The combination of those factors is certainly why Sam Altman claimed they weren't making money on Pro users.

lmarena has also become less and less useful over time for comparing frontier models as all frontier models are able to saturate the performance needed for the kind of casual questions typically asked there. For the harder questions, o1 (not even o1-pro) still appears to be tied for 1st place with several other models... which is yet another indication of just how saturated that benchmark is.

layer81y ago

“The impression overall I got here is that this is somewhere around o1-pro capability”.

“Grok 3 + Thinking feels somewhere around the state of the art territory of OpenAI's strongest models (o1-pro, $200/month)”.

coder5431y ago

The comment I was replying to had replied to an lmarena benchmark link. Perhaps you think that person should have replied to someone else? And, if you want to finish the quote, Karpathy's opinion on this is subjective. He admits it isn't a "real" evaluation.

"[...] though of course we need actual, real evaluations to look at."

His own tests are better than nothing, but hardly definitive.

1 more reply

srid1y ago

Where do you see that Premium+ is $40 per month?

The official source says "Starts at $22/month or $229/year on web", https://help.x.com/en/using-x/x-premium

This is pretty much what I paid a couple of months ago, as a Canadian.

nickthegreek1y ago

They just announced a price increased today. The link you posted has this info in a blue box at the top.

Also visible here: https://help.x.com/en/using-x/x-premium#tbpricing-bycountry

srid1y ago

Interesting. In that table, I see $40 for US users. Yet the price remains $30 for Canadian users, despite their low dollar value.

1 more reply

colechristensen1y ago

>Considering that OpenAI subscription is $200 per month

This plan is 75 days old. I didn't know it existed until last week.

OpenAI is starting to try to get a little more realistic revenue in, Grok is acquiring customers.

ben_w1y ago

Given how fast-moving the field is, it's very difficult to confidently state how much inference costs. Perhaps he's under-charging, perhaps OpenAI is over-charging, one may be more optimised than the other, but new models come out and change everything in less time than is normally takes for actual costs to become public knowlege.

visarga1y ago

Sometimes it's a matter of approach, some approach could be 5% better and 10x more expensive. So they will find the sweetspot, takes a few iterations.

arjunaaqa1y ago

Yes, better to avoid annual subscriptions.

JumpCrisscross1y ago

Masa Son top ticks a market is somehow still news in 2025.

jimbokun1y ago

What do we do to assess the intelligence of these models after they are smarter than any human? From the kinds of questions it's answering seems like they are almost there.

Do we have a way to tell if one model is smarter than another at that point?

HarHarVeryFunny1y ago

Nah, at the end of the day "things that are easy for humans are [still] hard for computers, and vice versa". DeepBlue was super-human at chess and couldn't play tic tac toe. Today's AI is (almost?) super-human at math yet only very recently learned to play tic tac toe, and still can't learn to do anything - because it can't learn, and has no innate drives to expose itself to learning situations even if it could.

Here's a real world intelligence test. Take on each AI as a remote intern/new-hire, and try to train it to become a useful team member (solving math puzzles or manufacturing paperclips does not count).

gf0001y ago

Almost there? Are we looking at the same thing?

1 more reply

flir1y ago

> Do we have a way to tell if one model is smarter than another at that point?

Ask them to design a ranking mechanism for you. They are superhuman, after all.

(I really don't think we're going to have to worry about this).

thefourthchime1y ago

There are things besides measuring intelligence, like humor. Currently, all the bots struggle with making jokes.

tcascais1y ago

What you probably mean is puzzle solving intelligence. Humor is a form of intelligence. It's just not only about intelligence - it's also about values, and context, for instance. But all this reflects a form of intelligence. Neverthless, intelligence shouldn't be ranked, at least not in the way we are used to talk about it.

j / k navigate · click thread line to collapse

0 comments

jessfyi1y ago

sigmoid10OP1y ago

jessfyi1y ago

[0](One such instance of someone competent testing it and realizing it's not even close to the "best" model out) https://www.youtube.com/watch?v=WVpaBTqm-Zo

Breza1y ago

sigmoid10OP1y ago

numpad01y ago

coder5431y ago

layer81y ago

“The impression overall I got here is that this is somewhere around o1-pro capability”.

“Grok 3 + Thinking feels somewhere around the state of the art territory of OpenAI's strongest models (o1-pro, $200/month)”.

coder5431y ago

"[...] though of course we need actual, real evaluations to look at."

His own tests are better than nothing, but hardly definitive.

1 more reply

srid1y ago

Where do you see that Premium+ is $40 per month?

The official source says "Starts at $22/month or $229/year on web", https://help.x.com/en/using-x/x-premium

This is pretty much what I paid a couple of months ago, as a Canadian.

nickthegreek1y ago

They just announced a price increased today. The link you posted has this info in a blue box at the top.

Also visible here: https://help.x.com/en/using-x/x-premium#tbpricing-bycountry

srid1y ago

Interesting. In that table, I see $40 for US users. Yet the price remains $30 for Canadian users, despite their low dollar value.

1 more reply

colechristensen1y ago

>Considering that OpenAI subscription is $200 per month

This plan is 75 days old. I didn't know it existed until last week.

OpenAI is starting to try to get a little more realistic revenue in, Grok is acquiring customers.

ben_w1y ago

visarga1y ago

Sometimes it's a matter of approach, some approach could be 5% better and 10x more expensive. So they will find the sweetspot, takes a few iterations.

arjunaaqa1y ago

Yes, better to avoid annual subscriptions.

JumpCrisscross1y ago

Masa Son top ticks a market is somehow still news in 2025.

jimbokun1y ago

What do we do to assess the intelligence of these models after they are smarter than any human? From the kinds of questions it's answering seems like they are almost there.

Do we have a way to tell if one model is smarter than another at that point?

HarHarVeryFunny1y ago

gf0001y ago

Almost there? Are we looking at the same thing?

1 more reply

flir1y ago

> Do we have a way to tell if one model is smarter than another at that point?

Ask them to design a ranking mechanism for you. They are superhuman, after all.

(I really don't think we're going to have to worry about this).

thefourthchime1y ago

There are things besides measuring intelligence, like humor. Currently, all the bots struggle with making jokes.

tcascais1y ago

j / k navigate · click thread line to collapse