undefined | Better HN

0 pointslossolo1y ago0 comments

> making the most interesting and challenging LLM benchmark so far.

This[1] is currently the most challenging benchmark. I would like to see how O3 handles it, as O1 solved only 1%.

1. https://epoch.ai/frontiermath/the-benchmark

0 comments

2 comments · 2 top-level

pynappo1y ago

Apparently o3 scored about 25%

https://youtu.be/SKBG1sqdyIU?t=4m40s

1 more reply

modeless1y ago

You're right, I was wrong to say "most challenging" as there have been harder ones coming out recently. I think the correct statement would be "most challenging long-standing benchmark" as I don't believe any other test designed in 2019 has resisted progress for so long. FrontierMath is only a month old. And of course the real key feature of ARC is that it is easy for humans. FrontierMath is (intentionally) not.

1 more reply

j / k navigate · click thread line to collapse