undefined | Better HN

0 pointsAgentMatrixAI7mo ago0 comments

I'm not really convinced, the benchmark blunder was really strange but the demos were quite underwhelming, and it appears this was reflected by a huge market correction in the betting markets as to who will have the best AI by end of the year.

What excites me now is that Gemini 3.0 or some answer from Google is coming soon and that will be the one I will actually end up using. It seems like the last mover in the LLM race is more advantageous.

0 comments

Buttons8407mo ago

Polymarket betters are not impressed. Based upon the market odds, OpenAI had a 35% chance to have the best model (at year end), but those odds have dropped to 18% today.

(I'm mostly making this comment to document what happened for the history books.)

https://polymarket.com/event/which-company-has-best-ai-model...

vessenes7mo ago

After a few hours with gpt-5, I'd trade that spread. Not that I think oAI will win end of year. But I think gpt5 is better than it looks on the benchmark side. It is very very good at something we don't have a lot of benchmarks for -- keeping track of where it's at. codex is vassstly better in practice than claude code or gemini cli right now.

On the chat side, it's also quite different, and I wouldn't be surprised if people need some time to get a taste and a preference for it. I ask most models to help me build a macbook pro charger in 15th century florence with the instructions that I start with only my laptop and I can only talk for four hours of chat before the battery dies -- 5 was notable in that it thought through a bunch of second order implications of plans and offered some unusual things, including a list of instructions for a foot-treadle-based split ring commutator + generator in 15th century florentine italian(!). I have no way of verifying if the italian was correct.

Upshot - I think they did something very special with long context and iterative task management, and I would be surprised if they don't keep improving 5, based on their new branding and marketing plan.

That said, to me this is one of the first 'product release' moments in the frontier model space. 5 is not so much a model release as a polished-up, holes-fixed, annoyances-reduced/removed, 10x faster type of product launch. Google (current polymarket favorite) is remarkably bad at those product releases.

Back to betting - I bet there's a moment this year where those numbers change 10% in oAIs favor.

ttroyr7mo ago

I would agree. I am a big fan of Claude and I've Claude code a bunch although after testing Codex & GPT-5 extensively, it just gets stuck in a rut way less often and much more often is able to pinpoint issues & fixes in the codebase.

apetresc7mo ago

How on Earth does that market have Anthropic at 2%, in a dead heat with the likes of Meta? If the market was about yesterday rather than 5 months from now I think Claude would be pretty clearly the front runner. Why does the market so confidently think they’ll drop to dead last in the next little while?

degrews7mo ago

It's because those markets are based on the LLM Arena leaderboard (https://lmarena.ai/), where Claude has historically done poorly.

That eval has also become a lot less relevant (it's considered not very indicative of real-world performance), so it's unlikely Anthropic will prioritize optimizing for it in future models.

2 more replies

Buttons8407mo ago

How is Claude doing on the benchmark that market is based on? Maybe not so good? Idk. Just because Claude is good for real world use doesn't mean it's winning the benchmark, but the benchmark is all that matters for the Polymarket.

tedk-427mo ago

I'm a fan of Anthropic for this reason. I use Claude and it's very good most of the time for my coding requirements.

Generally when you have a lot of companies competing to show whos product X does the best at Y, there's a lot of monetary incentives to manipulate the products to perform well specifically on those types of tests.

vasco7mo ago

If you think it's wrong, participate. That's the only way prediction markets end up predicting anything.

1 more reply

sinuhe697mo ago

I think they also based their expectation on the release cycles and speeds of update. Anthropic is known for more conservative release cycle and incremental updates. Google on the other hand is accelerated recently. It also seems that other actors are better at benchmark cheating ;)

epiccoleman7mo ago

I find this confusing too. I dropped my OpenAI subs for Claude a while back and I don't feel like I'm missing much.

I need to spend some more time with Gemini too though. I was using that as a backend for Cursor for a while and had some good results there too.

1 more reply

globular-toast7mo ago

I mean, if you feel strongly enough that it will be #1 at the end of year then $100 now would net you $3000 end of year... Do bear in mind what my sibling said about the specific benchmark that is being used, though.

jstummbillig7mo ago

That bet does not seem to be very illuminating. Winner is likely who happens to release closest to end of year, no?

croemer7mo ago

Looking at LMarena which polymarket uses, I'm not surprised. Based on the little data there is (3k duels, it's possibly worse than Gemini, it lost more to Gemini 2.5 Pro than it won in direct duels). Not sure why the ELO is still higher, possibly GPT5 did more clearly better against bad models, which I don't care about.

roflyear7mo ago

The Musk effect is pretty crazy. Or is there another explanation for why x can compete with Google?

ENGNR7mo ago

Elon's Y Combinator interview was pretty good. He seemed more in his element back amongst the hacker crowd (rather than dirty politics), and seemed to be doing hackery things at X, like renting generators and mobile cooling vans and just putting them the car park outside a warehouse to train Grok, since there were no data centres available and he was told it would take 2 years to set it all up properly.

I think he's just good at attracting good talent, and letting them focus on the right things to move fast initially, while cutting the supporting infra down to zero until it's needed.

1 more reply

deltaburnt7mo ago

Thinking more cynically: political corruption and connections I'm guessing? Just a couple months ago Musk was treating the US government like his personal playground.

raincole7mo ago

Because they started so late but somehow managed to make something close to SOTA?

Either way or people think Trump will just give Elon a 500B government contract...

Davidzheng7mo ago

They have a lot of compute already and Grok 4 was pretty strong?

1 more reply

boringg7mo ago

You don't actually hold polymarket odds with any significant weighting on actual outcomes do you?

m3kw97mo ago

Is not that they are not impressed, is just google came out with steerable video gen

Buttons8407mo ago

That was a few days ago. The big drop in that Polymarket I mentioned all happened today. It was reaction to GTP5 specifically.

riku_iki7mo ago

> Polymarket betters are not impressed. Based upon the market odds, OpenAI had a 35% chance to have the best model (at year end)

who will decide the winner to resolve bets?

joshmlewis7mo ago

I am convinced. I've been giving it tasks the past couple hours that Opus 4.1 was failing on and it not only did them but cleaned up the mess Opus made. It's the real deal.

diego_sandoval7mo ago

On that same vein, I had just tried Opus 4.1 yesterday, and it succesfully completed tasks that Sonnet 4 and Opus 4 failed at.

joshmlewis7mo ago

When it came out on Tuesday I wanted to throw my laptop out of the window. I don't know what happened but results were total garbage earlier this week. It got better the past couple days but so far with gpt-5 being able to solve problems without as much correction I'm going to use it more.

alfalfasprout7mo ago

Interesting, I've had the complete opposite experience. Opus 4.1 feels like a generational improvement compared to GPT-5.

joshmlewis7mo ago

It is funny how it can be like this sometimes. I think a lot depends on coding styles, languages, prompting, etc.

energy1237mo ago

And it's almost 10x cheaper via flex, and in #1 position on lmarena. It's not even close.

boomfunky7mo ago

The real last mover is Apple, because boy are they not moving.

manmal7mo ago

As an iOS dev, I really hope they acquire Anthropic before it’s too expensive.

amelius7mo ago

As a Linux developer, I hope they do not.

BoiledCabbage7mo ago

Wow would that be horrible

1 more reply

echelon7mo ago

I really don't want the already trillion dollar mega monopoly to own the world.

blitzar7mo ago

I would rather the already trillion dollar mega monopoly own the world than "Open"Ai

roxolotl7mo ago

Yea maybe it’s naive but I’ve started learning towards preferring the devil I know. It also helps that Gemini is great.

1 more reply

someuser545417mo ago

Which betting markets were you referring to and where can they be viewed?

rendang7mo ago

One that comes to mind is

https://polymarket.com/event/which-company-has-best-ai-model...

zamadatix7mo ago

Polymarket has a whole AI category https://polymarket.com/search/ai?_sort=volume of markets.

retinaros7mo ago

The demos were awful. It felt like watching sloppy vibe coded css UIs

m3kw97mo ago

Gpt5 high reasoning is a big step up from o3

j / k navigate · click thread line to collapse

0 comments

Buttons8407mo ago

Polymarket betters are not impressed. Based upon the market odds, OpenAI had a 35% chance to have the best model (at year end), but those odds have dropped to 18% today.

(I'm mostly making this comment to document what happened for the history books.)

https://polymarket.com/event/which-company-has-best-ai-model...

vessenes7mo ago

Back to betting - I bet there's a moment this year where those numbers change 10% in oAIs favor.

ttroyr7mo ago

apetresc7mo ago

degrews7mo ago

It's because those markets are based on the LLM Arena leaderboard (https://lmarena.ai/), where Claude has historically done poorly.

That eval has also become a lot less relevant (it's considered not very indicative of real-world performance), so it's unlikely Anthropic will prioritize optimizing for it in future models.

2 more replies

Buttons8407mo ago

tedk-427mo ago

I'm a fan of Anthropic for this reason. I use Claude and it's very good most of the time for my coding requirements.

vasco7mo ago

If you think it's wrong, participate. That's the only way prediction markets end up predicting anything.

1 more reply

sinuhe697mo ago

epiccoleman7mo ago

I find this confusing too. I dropped my OpenAI subs for Claude a while back and I don't feel like I'm missing much.

I need to spend some more time with Gemini too though. I was using that as a backend for Cursor for a while and had some good results there too.

1 more reply

globular-toast7mo ago

jstummbillig7mo ago

That bet does not seem to be very illuminating. Winner is likely who happens to release closest to end of year, no?

croemer7mo ago

roflyear7mo ago

The Musk effect is pretty crazy. Or is there another explanation for why x can compete with Google?

ENGNR7mo ago

I think he's just good at attracting good talent, and letting them focus on the right things to move fast initially, while cutting the supporting infra down to zero until it's needed.

1 more reply

deltaburnt7mo ago

Thinking more cynically: political corruption and connections I'm guessing? Just a couple months ago Musk was treating the US government like his personal playground.

raincole7mo ago

Because they started so late but somehow managed to make something close to SOTA?

Either way or people think Trump will just give Elon a 500B government contract...

Davidzheng7mo ago

They have a lot of compute already and Grok 4 was pretty strong?

1 more reply

boringg7mo ago

You don't actually hold polymarket odds with any significant weighting on actual outcomes do you?

m3kw97mo ago

Is not that they are not impressed, is just google came out with steerable video gen

Buttons8407mo ago

That was a few days ago. The big drop in that Polymarket I mentioned all happened today. It was reaction to GTP5 specifically.

riku_iki7mo ago

> Polymarket betters are not impressed. Based upon the market odds, OpenAI had a 35% chance to have the best model (at year end)

who will decide the winner to resolve bets?

joshmlewis7mo ago

I am convinced. I've been giving it tasks the past couple hours that Opus 4.1 was failing on and it not only did them but cleaned up the mess Opus made. It's the real deal.

diego_sandoval7mo ago

On that same vein, I had just tried Opus 4.1 yesterday, and it succesfully completed tasks that Sonnet 4 and Opus 4 failed at.

joshmlewis7mo ago

alfalfasprout7mo ago

Interesting, I've had the complete opposite experience. Opus 4.1 feels like a generational improvement compared to GPT-5.

joshmlewis7mo ago

It is funny how it can be like this sometimes. I think a lot depends on coding styles, languages, prompting, etc.

energy1237mo ago

And it's almost 10x cheaper via flex, and in #1 position on lmarena. It's not even close.

boomfunky7mo ago

The real last mover is Apple, because boy are they not moving.

manmal7mo ago

As an iOS dev, I really hope they acquire Anthropic before it’s too expensive.

amelius7mo ago

As a Linux developer, I hope they do not.

BoiledCabbage7mo ago

Wow would that be horrible

1 more reply

echelon7mo ago

I really don't want the already trillion dollar mega monopoly to own the world.

blitzar7mo ago

I would rather the already trillion dollar mega monopoly own the world than "Open"Ai

roxolotl7mo ago

Yea maybe it’s naive but I’ve started learning towards preferring the devil I know. It also helps that Gemini is great.

1 more reply

someuser545417mo ago

Which betting markets were you referring to and where can they be viewed?

rendang7mo ago

One that comes to mind is

https://polymarket.com/event/which-company-has-best-ai-model...

zamadatix7mo ago

Polymarket has a whole AI category https://polymarket.com/search/ai?_sort=volume of markets.

retinaros7mo ago

The demos were awful. It felt like watching sloppy vibe coded css UIs

m3kw97mo ago

Gpt5 high reasoning is a big step up from o3

j / k navigate · click thread line to collapse