undefined | Better HN

0 pointsTacticalCoder2mo ago0 comments

> Given that for a number of these benchmarks, it seems to be barely competitive with the previous gen

We're not reading the same numbers I think. Compared to Opus 4.6, it's a big jump nearly in every single bench GP posted. They're "only" catching up to Google's Gemini on GPQA and MMMLU but they're still beating their own Opus 4.6 results on these two.

This sounds like a much better model than Opus 4.6.

0 comments

5 comments · 1 top-level

ninjagoo2mo ago· 4 in thread

> We're not reading the same numbers I think.

We must not be.

That's why I listed out the ones where it is barely competitive from @babelfish's table, which itself is extracted from Pg 186 & 187 of the System Card, which has the comparison with Opus 4.6, GPT 5.4 and Gemini 3.1 Pro.

Sure, it may be better than Opus 4.6 on some of those, but barely achieves a small increase over GPT-5.4 on the ones I called out.

nl2mo ago

> barely competitive

It's higher than all other models except vs Gemini 3.1 Pro on MMMLU

MMMLU is generally thought to be maxed out - as it it might not be possible to score higher than those scores.

> Overall, they estimated that 6.5% of questions in MMLU contained an error, suggesting the maximum attainable score was significantly below 100%[1]

Other models get close on GPQA Diamond, but it wouldn't be surprising to anyone if the max possible on that was around the 95% the top models are scoring.

[1] https://en.wikipedia.org/wiki/MMLU

lostmsu2mo ago

You are reading the percentages wrong.

Because 100% is maximum, you should be looking at error rates instead. GPT has 25% on Terminal Bench and the new model has 18%, almost 1.4x reduction.

nimchimpsky2mo ago

barely competitive ? Mythos column is the first column.

You are the only person with this take on hackernews, everyone else "this is a massive a jump". Fwiwi, the data you list shows the biggest jump I remember for mythos

devmor2mo ago

The biggest jump in the numbers they quoted is 6%.

Please look at the columns OTHER than Opus as well.

3 more replies

j / k navigate · click thread line to collapse