undefined | Better HN

0 pointsfamouswaffles1y ago0 comments

This is also wildly ahead in SWE-bench (71.7%, previous 48%) and Frontier Math (25% on high compute, previous 2%).

So much for a plateau lol.

0 comments

11 comments · 5 top-level

throwup2381y ago· 2 in thread

> So much for a plateau lol.

It’s been really interesting to watch all the internet pundits’ takes on the plateau… as if the two years since the release of GPT3.5 is somehow enough data for an armchair ponce to predict the performance characteristics of an entirely novel technology that no one understands.

bandwidth-bob1y ago

The pundits response to the (alleged) plateau was proportional to the certainty with which CEOs of frontier labs discussed pre-training scaling. The o3 result is from scaling test time compute, which represents a meaningful change in how you would build out compute for scaling (single supercluster --> presence in regions close to users). Thus it is important to discuss.

1 more reply

jgalt2121y ago

You could make an equivalently dismissive comment about the hypesters.

1 more reply

optimalsolver1y ago· 2 in thread

>Frontier Math (25% on high compute, previous 2%)

This is so insane that I can't help but be skeptical. I know FM answer key is private, but they have to send the questions to OpenAI in order to score the models. And a significant jump on this benchmark sure would increase a company's valuation...

Happy to be wrong on this.

upghost1y ago

Nope, makes sense to me. Seems unreasonable to conclude the dataset is not compromised now.

1 more reply

bwfan1231y ago

viewed from a skeptical lens of incentives:

openai and epochai are both startups with every incentive to peddle this narrative. when no one else can independently verify.

HarHarVeryFunny1y ago· 1 in thread

You're talking apples and oranges. The plateau the frontier models have hit is the limited further gains to be had from dataset (+ corresponding model/compute) scaling.

These new reasoning models are taking things in a new direction basically by adding search (inference time compute) on top of the basic LLM. So, the capabilities of the models are still improving, but the new variable is how deep of a search you want to do (how much compute to throw at it at inference time). Do you want your chess engine to do a 10 ply search or 20 ply? What kind of real world business problems will benefit from this?

famouswafflesOP1y ago

"New" reasoning models are plain LLMs with clever reinforcement learning. o1 is itself reinforcement learning on top GPT-4o.

They found a way to make test time compute a lot more effective and that is an advance but the idea is not new, the architecture is not new.

And the vast majority of people convinced LLMs plateaued did so regardless of test time compute.

1 more reply

OsrsNeedsf2P1y ago· 1 in thread

At 6,670$/task? I hope there's a jump

famouswafflesOP1y ago

It's not 6,670$/task. That was the high efficiency cost for 400 questions.

attentionmech1y ago

I legit see that if there is not even a new breakthrough just one week, people start shouting plateau plateau.. Our rate of progress is extraordinary and any downplay of it seems like stupid

j / k navigate · click thread line to collapse

0 comments

11 comments · 5 top-level

throwup2381y ago· 2 in thread

> So much for a plateau lol.

bandwidth-bob1y ago

1 more reply

jgalt2121y ago

You could make an equivalently dismissive comment about the hypesters.

1 more reply

optimalsolver1y ago· 2 in thread

>Frontier Math (25% on high compute, previous 2%)

Happy to be wrong on this.

upghost1y ago

Nope, makes sense to me. Seems unreasonable to conclude the dataset is not compromised now.

1 more reply

bwfan1231y ago

viewed from a skeptical lens of incentives:

openai and epochai are both startups with every incentive to peddle this narrative. when no one else can independently verify.

HarHarVeryFunny1y ago· 1 in thread

You're talking apples and oranges. The plateau the frontier models have hit is the limited further gains to be had from dataset (+ corresponding model/compute) scaling.

famouswafflesOP1y ago

"New" reasoning models are plain LLMs with clever reinforcement learning. o1 is itself reinforcement learning on top GPT-4o.

They found a way to make test time compute a lot more effective and that is an advance but the idea is not new, the architecture is not new.

And the vast majority of people convinced LLMs plateaued did so regardless of test time compute.

1 more reply

OsrsNeedsf2P1y ago· 1 in thread

At 6,670$/task? I hope there's a jump

famouswafflesOP1y ago

It's not 6,670$/task. That was the high efficiency cost for 400 questions.

attentionmech1y ago

I legit see that if there is not even a new breakthrough just one week, people start shouting plateau plateau.. Our rate of progress is extraordinary and any downplay of it seems like stupid

j / k navigate · click thread line to collapse