undefined | Better HN

0 pointsminimaxir7mo ago0 comments

The marketing copy and the current livestream appear tautological: "it's better because it's better."

Not much explanation yet why GPT-5 warrants a major version bump. As usual, the model (and potentially OpenAI as a whole) will depend on output vibe checks.

0 comments

WD-427mo ago

It has the last ~6 months worth of flavor of the month Javascript libraries in it's training set now, so it's "better at coding".

How is this sustainable.

sethops17mo ago

Who said anything about sustainable? The only goal here is to hobble to the next VC round. And then the next, and the next, ...

WXLCKNO7mo ago

It doesn't even have that, knowledge cutoff is in 2024.

jcgrillo7mo ago

Vast quantities of extremely dumb money

some-guy7mo ago

As someone who tries to push the limits of hard coding tasks (mainly refactoring old codebases) to LLMs with not much improvement since the last round of models, I'm finding that we are hitting the reduction of rate of improvement on the S-curve of quality. Obviously getting the same quality cheaper would be huge, but the quality of the output day to day isn't noticeable to me.

camdenreslink7mo ago

I find it struggles to even refactor codebases that aren't that large. If you have a somewhat complicated change that spans the full stack, and has some sort of wrinkle that makes it slightly more complicated than adding a data field, then even the most modern LLMs seem to trip on themselves. Even when I tell it to create a plan for implementation and write it to a markdown file and then step through those steps in a separate prompt.

Not that it makes it useless, just that we seem to not "be there" yet for the standard tasks software engineers do every day.

teaearlgraycold7mo ago

I haven’t used GPT5 yet, but even on a 1000 line code base I found Opus 4, o3, etc. to be very hit or miss. The trouble is I can’t seem to predict when these models will hit. So the misses cost time, reducing their overall utility.

2 more replies

didibus7mo ago

Agree, I think they'll need to move to performance now. If a model was comparable to Claude 4, but took like 500ms or less per edit. A quicker feedback loop would be a big improvement.

krat0sprakhar7mo ago

> Not much explanation yet why GPT-5 warrants a major version bump

Exactly. Too many videos - too little real data / benchmarks on the page. Will wait for vibe check from simonw and others

collinmanderson7mo ago

> Will wait for vibe check from simonw

https://openai.com/gpt-5/?video=1108156668

2:40 "I do like how the pelican's feet are on the pedals." "That's a rare detail that most of the other models I've tried this on have missed."

4:12 "The bicycle was flawless."

5:30 Re generating documentation: "It nailed it. It gave me the exact information I needed. It gave me full architectural overview. It was clearly very good at consuming a quarter million tokens of rust." "My trust issues are beginning to fall away"

Edit: ohh he has blog post now: https://news.ycombinator.com/item?id=44828264

bardak7mo ago

I feel like we need to move on from using the same test on models since as time goes on the information about these specific test is out there in the training data and while i am not saying that it's happened in this case there is nothing stopping model developers from adding extra data for theses tests directly in the training data to make their models seem better than they are

dimitri-vs7mo ago

This effectively kills this benchmark.

2 more replies

laurent_du7mo ago

Damn Theo is really a handsome dude.

nicetryguy7mo ago

Yeah. We're entered the Smartphone stage: "You want the new one because it's the new one."

ttroyr7mo ago

I think the biggest tell for me was having the leader of Cursor up vouching for the model, who has been a big proponent of Claude in Cursor for the last year. Doesn't seem like a light statement.

sbinnee7mo ago

When they were about to release gpt4 I remember the hype was so high there were a lot of AGI debates. But then was quickly out-shadowed by more advanced models.

People knew that gpt5 wouldn’t be an AGI or even close to that. It’s just an updated version. GptN would become more or leas like an annual release.

scosman7mo ago

There's a bunch of benchmarks on the intro page including AIME 2025 without tools, SWE-bench Verified, Aider Polyglot, MMMU, and HealthBench Hard (not familiar with this one): https://openai.com/index/introducing-gpt-5/

Pretty par for course evals at launch setup.

jennyholzer7mo ago

I didn't think GPT-4 warranted a major version bump. I do not believe that Open AI's benchmarks are legitimate and I don't think they have been for quite some time, if ever.

blablablerg7mo ago

For fun, I asked it how much better it is than GPT-4. It started a rap battle against itself :P

https://chatgpt.com/share/6895d5da-8884-8003-bf9d-1e191b11d3...

anthonypasq7mo ago

its >o3 performance at gpt4 price. seems pretty obvious

thegeomaster7mo ago

o3 pricing: $8/Mtok out

GPT-5 pricing: $10/Mtok out

What am I missing?

joshmlewis7mo ago

It's more efficient with tools for one and the input cost is cheaper (which is where a lot of the cost is).

See comparison between GPT-5, 4.1, and o3 tool calling here: https://promptslice.com/share/b-2ap_rfjeJgIQsG.

vitaflo7mo ago

That you can run Deepseek for 50 cents.

throwaway0123_57mo ago

It seems like you might need less output tokens for the same quality of response though. One of their plots shows o3 needing ~14k tokens to get 69% on SWE-bench Verified, but GPT-5 needing only ~4k.

mitkebes7mo ago

O3 has had some major price cuts since Gemini 2.5 Pro came out. At the time, o3 cost $10/Mtok in and $40/Mtok out. The big deal with Gemini 2.5 Pro was it had comparable quality to o3 at a fraction of the cost.

I'm not sure when they slashed the o3 pricing, but the GPT-5 pricing looks like they set it to be identical to Gemini 2.5 Pro.

If you scroll down on this page you can see what different models cost when 2.5 Pro was released: https://deepmind.google/models/gemini/pro/

anthonypasq7mo ago

pretty sure reduced cache input pricing is a pretty big deal for reasoning models, but im not positive

1 more reply

pram7mo ago

We’re at the audiophile stage of LLMs where people are talking about the improved soundstage, tonality, reduced sibilance etc

jaredcwhite7mo ago

Note GPT-5's subtle mouthfeel reminiscent of cranberries with a touch of bourbon.

alephnerd7mo ago

Explains why I find AGI fundamentalists similar to tater heads. /s

(Not to undermine progress in the foundational model space, but there is a lack of appreciation for the democratization of domain specific models amongst HNers).

__loam7mo ago

Every bourbon tastes the same unless it's Weller, King's County Peated, or Pappy (or Jim Beam for the wrong reasons lol)

1 more reply

javchz7mo ago

I can already see LLMs Sommeliers: Yes, the mouthfeel and punch of GPT-5 it's comparable to the one of Grok 4, but it's tenderness lacks the crunch from Gemini 2.5 Pro.

0x7cfe7mo ago

Isn't it exactly what the typical LLM discourse is about? People are just throwing anecdotes and stay with their opinion. A is better than B because C, and that's basically it. And whoever tries to actually bench them gets called out because all benches are gamed. Go figure.

tuesdaynight7mo ago

You need to burn-in your LLM by using for 100 hours before you see the true performance of it.

virgil_disgr4ce7mo ago

Well, reduced sibilance is an ordinary and desirable thing. A better "audiophile absurdity" example would be $77,000 cables, freezing CDs to improve sound quality, using hospital-grade outlets, cryogenically frozen outlets (lol), the list goes on and on

codeulike7mo ago

I feel sorry for audiophiles because they have to work so much harder to get the same enjoyment of music that I get via my laptop speakers

2 more replies

ezst7mo ago

Always have been. This LLM-centered AI boom has been my craziest and most frustrating social experiment, propped up by the rhetoric (with no evidence to back it up) that this time we finally have the keys to AGI (whatever the hell that means), and infused with enough AstroTurfing to drive the discourse into ideological stances devoid of any substance (you must either be a true believer or a naysayer). On the plus side, it appears that this hype train is taking a bump with GPT-5.

satyrun7mo ago

Come on, we aren't even close to the level of audiophile nonsense like worrying about what cable sounds better.

leptons7mo ago

We're still at the stage of which LLM lies the least (but they all do). So yeah, no different than audiophiles really.

catigula7mo ago

Informed audiophiles rely on Klippel output now

bobson3817mo ago

The empirical ones do! There's still a healthy sports car element to the scene though, at least in my experience.

1 more reply

Q6T46nT668w6i3m7mo ago

It’s always been this way with LLMs.

j / k navigate · click thread line to collapse

0 comments

WD-427mo ago

It has the last ~6 months worth of flavor of the month Javascript libraries in it's training set now, so it's "better at coding".

How is this sustainable.

sethops17mo ago

Who said anything about sustainable? The only goal here is to hobble to the next VC round. And then the next, and the next, ...

WXLCKNO7mo ago

It doesn't even have that, knowledge cutoff is in 2024.

jcgrillo7mo ago

Vast quantities of extremely dumb money

some-guy7mo ago

camdenreslink7mo ago

Not that it makes it useless, just that we seem to not "be there" yet for the standard tasks software engineers do every day.

teaearlgraycold7mo ago

2 more replies

didibus7mo ago

Agree, I think they'll need to move to performance now. If a model was comparable to Claude 4, but took like 500ms or less per edit. A quicker feedback loop would be a big improvement.

krat0sprakhar7mo ago

> Not much explanation yet why GPT-5 warrants a major version bump

Exactly. Too many videos - too little real data / benchmarks on the page. Will wait for vibe check from simonw and others

collinmanderson7mo ago

> Will wait for vibe check from simonw

https://openai.com/gpt-5/?video=1108156668

2:40 "I do like how the pelican's feet are on the pedals." "That's a rare detail that most of the other models I've tried this on have missed."

4:12 "The bicycle was flawless."

Edit: ohh he has blog post now: https://news.ycombinator.com/item?id=44828264

bardak7mo ago

dimitri-vs7mo ago

This effectively kills this benchmark.

2 more replies

laurent_du7mo ago

Damn Theo is really a handsome dude.

nicetryguy7mo ago

Yeah. We're entered the Smartphone stage: "You want the new one because it's the new one."

ttroyr7mo ago

I think the biggest tell for me was having the leader of Cursor up vouching for the model, who has been a big proponent of Claude in Cursor for the last year. Doesn't seem like a light statement.

sbinnee7mo ago

When they were about to release gpt4 I remember the hype was so high there were a lot of AGI debates. But then was quickly out-shadowed by more advanced models.

People knew that gpt5 wouldn’t be an AGI or even close to that. It’s just an updated version. GptN would become more or leas like an annual release.

scosman7mo ago

Pretty par for course evals at launch setup.

jennyholzer7mo ago

I didn't think GPT-4 warranted a major version bump. I do not believe that Open AI's benchmarks are legitimate and I don't think they have been for quite some time, if ever.

blablablerg7mo ago

For fun, I asked it how much better it is than GPT-4. It started a rap battle against itself :P

https://chatgpt.com/share/6895d5da-8884-8003-bf9d-1e191b11d3...

anthonypasq7mo ago

its >o3 performance at gpt4 price. seems pretty obvious

thegeomaster7mo ago

o3 pricing: $8/Mtok out

GPT-5 pricing: $10/Mtok out

What am I missing?

joshmlewis7mo ago

It's more efficient with tools for one and the input cost is cheaper (which is where a lot of the cost is).

See comparison between GPT-5, 4.1, and o3 tool calling here: https://promptslice.com/share/b-2ap_rfjeJgIQsG.

vitaflo7mo ago

That you can run Deepseek for 50 cents.

throwaway0123_57mo ago

It seems like you might need less output tokens for the same quality of response though. One of their plots shows o3 needing ~14k tokens to get 69% on SWE-bench Verified, but GPT-5 needing only ~4k.

mitkebes7mo ago

I'm not sure when they slashed the o3 pricing, but the GPT-5 pricing looks like they set it to be identical to Gemini 2.5 Pro.

If you scroll down on this page you can see what different models cost when 2.5 Pro was released: https://deepmind.google/models/gemini/pro/

anthonypasq7mo ago

pretty sure reduced cache input pricing is a pretty big deal for reasoning models, but im not positive

1 more reply

pram7mo ago

We’re at the audiophile stage of LLMs where people are talking about the improved soundstage, tonality, reduced sibilance etc

jaredcwhite7mo ago

Note GPT-5's subtle mouthfeel reminiscent of cranberries with a touch of bourbon.

alephnerd7mo ago

Explains why I find AGI fundamentalists similar to tater heads. /s

(Not to undermine progress in the foundational model space, but there is a lack of appreciation for the democratization of domain specific models amongst HNers).

__loam7mo ago

Every bourbon tastes the same unless it's Weller, King's County Peated, or Pappy (or Jim Beam for the wrong reasons lol)

1 more reply

javchz7mo ago

I can already see LLMs Sommeliers: Yes, the mouthfeel and punch of GPT-5 it's comparable to the one of Grok 4, but it's tenderness lacks the crunch from Gemini 2.5 Pro.

0x7cfe7mo ago

tuesdaynight7mo ago

You need to burn-in your LLM by using for 100 hours before you see the true performance of it.

virgil_disgr4ce7mo ago

codeulike7mo ago

I feel sorry for audiophiles because they have to work so much harder to get the same enjoyment of music that I get via my laptop speakers

2 more replies

ezst7mo ago

satyrun7mo ago

Come on, we aren't even close to the level of audiophile nonsense like worrying about what cable sounds better.

leptons7mo ago

We're still at the stage of which LLM lies the least (but they all do). So yeah, no different than audiophiles really.

catigula7mo ago

Informed audiophiles rely on Klippel output now

bobson3817mo ago

The empirical ones do! There's still a healthy sports car element to the scene though, at least in my experience.

1 more reply

Q6T46nT668w6i3m7mo ago

It’s always been this way with LLMs.

j / k navigate · click thread line to collapse