undefined | Better HN

0 pointsiceKirin2d ago0 comments

I feel that the recent iterations of LLM haven't provided an intuitive qualitative leap. Have they entered a bottleneck period so quickly?

0 comments

olao992d ago

For what is worth I find GPT 5.5 qualitatively different than 5.4 and 5.3

If I had to collapse the nature of the difference in one sentence it'd be that the 5.5 does more what I'm asking it to do versus doing a small aspect of what I'm asking then stopping.

5.4 required a lot of "continue" encouragement. 5.5 just "gets it" a bit more

What is boils down to for me is that even though it's more expensive I would much rather use 5.5 on low then 5.4/5.3 on high/medium

darqis19h ago

5.5 is overcomplicating it. Where the solution is e.g. changing some oidc auth url, it goes around and verifies and check and builds this and that to eventually change the url, and then write a summary.

It is unable to do K.I.S.S . Instead of adding just an endpoint, it creates a service, middleware, config reader and finally an endpoint.

LLMs are nowhere near being good developers. The only thing they have is speed. Because of this speed they create the illusion of a good developer, the whoa moment. Whoa it would've taken me 2 months to implement this. Yeah but then again you would not make such silly mistakes and you would've reused that oidc client instead of reinventing the wheel every single time.

varispeed1d ago

They must have changed something recently as when 5.5 first dropped I was unable to make it do anything. It would say it will implement, but would never actually do it, no matter how many times I tried to tell it what it needs to do. It would acknowledge what needs to be done, even create step by step plan and then ask if it should do it. I would confirm and then it will just go around reiterating the plan and that this time it will start. Annoying and funny. Now it doesn't seem to be doing that anymore.

Wingy1d ago

I think that's a failure mode of using the legacy completions API rather than the new responses API. With the responses API, the agent actually goes and does the things it's supposed to do.

nothinkjustai1d ago

They probably just tell it to do more in the prompt lmao

barrell1d ago

Azure recently discontinued the gpt-4.1 model. I had to move off of this model, and moving to any gpt-5* model was worse (higher failures & less accuracy), and more expensive. I had to rewrite the entire system from high school level prompts to lower elementary school level prompts using non-gpt models.

I would say models entered a bottleneck a long time ago. My personal opinion is now they are overfitting newer models on coding and "agentic" capabilities at great expense of general abilities in other domains.

GorbachevyChase1d ago

I am wondering if everyone is moving to an IPO and striking these bizarre circular deals because they’ve hit the ceiling on what can be done with more compute until a major architectural innovation happens.

Still amazing, but 5.5 does feel like incremental progress with a massive up charge.

32dsfa1d ago

Ofc they have hit a ceiling, why do you think OAI has shut down many of its projects like the research one called Prism?

The reality is both Anthropic and OAI have converged on LLMs as being a thing for software production - that's where the majority of their revenue is coming from.

gaflo1d ago

Can you elaborate what kind of system you built? I'm curious what specific prompts are getting worse responses with the newer models.

toponijo1d ago

I actually think it makes sense to hone models for coding and agentic capabilities. Those models will be specialized for those tasks, and the results will be cheaper and better. We can still have a general model and specialized models

2ndorderthought2d ago

I am delighted to see the ceiling on small models exponentially increase. I think the "make models unsustainably large because the benchmark improved by 1%" practice is ending. I think the thing boosting small models will be the thing that makes LLMs actually useful. The main thing is research.

aurareturn2d ago

They likely entered the same compute constraint scenario as Anthropic.

IE. They had 100 compute units. Demand is 200 units. They have to do a combination of buying more compute, increasing price, lowering limits, etc.

cyanydeez1d ago

capitalism convinced you that line goes up unless you dont let it eat all the resources.

GorbachevyChase1d ago

Yes, Cuba definitely doesn’t have such wild delusions to the benefit of its residents.

Please stop. Critical theory is easy. Something about “X” sucks. Got it. What is the alternative? It’s the completely unserious philosophy of the peanut gallery.

eiekek111d ago

Bunch of nonsense.

If that is true then they should all invest resources into projects that will yield efficient use of the compute. The most efficient producer then gains a huge cost advantage AND capacity to serve more… so yeah.. that logic doesn’t hold.

jere1d ago

You mean the company that just doubled their rate limits? https://www.anthropic.com/news/higher-limits-spacex

NitpickLawyer1d ago

They only did that after they "found" ~300k H100 equivalent compute. Before signing that deal they were severely compute constrained. Especially visible when EU tz was still active and US east would wake up.

1 more reply

helloplanets2d ago

Are you running gpt-5.5 on xhigh reasoning? Because I'm seeing a clear difference between that and gpt-5.4 on xhigh.

auspiv1d ago

GPT-5.5 is a solid leap with Codex or other harnesses. Opus 4.7 I still don't understand how people use... I tried it for a day or two, have tried it for a few hours every week or so since release, and still use 4.6 as daily driver (with xhi thinking).

throwaway2194501d ago

As with these daily opinion threads, ymmv. I find GPT's code to be competent, but its voice isn't great. If Claude can be a little too cool, GPT-5.x often reads like 90s era movie hacker technobabble. This has got to be RLHF/alignment and the sort of tone that people like. Also anecdotally I used xhigh for a while and turned it down to medium because it would take so long to do even simple jobs. The instruction following is quite good with 5.5 so there isn't too much need to let it wander off.

benterix1d ago

Call me cynical but for me these are mostly pricing changes, the change in quality is imperceptible. I believe after a few iterations we will be closer to the real cost.

patates2d ago

Considering my use case (web apps), there already wasn't anything I couldn't do with Opus 4.5, the same will be true or were already true for more people in other releases, and at some point, which may have already passed, most people will stop finding qualitative leaps.

This doesn't always mean that there is a bottleneck in terms of raw power, it may also mean that your use cases (or the lower hanging fruits among them) are already covered.

gchamonlive2d ago

My take is that demand is also increasing, so maybe they are making incremental improvements to model quality while focusing on improving inference costs. Prices are increasing though because even if they achieve a very efficient model, they are still selling at a loss.

sroerick1d ago

I do a lot of OCaml and I found 5.5 to be much better, but that's kind of an esoteric language thing

captainbland1d ago

In fairness I think these recent few iterations have done reasonably well considering it's largely optimising/fine tuning/enhancing multimodal integrations in existing foundation models rather than generating new ones but at some point the next big foundation models will come out.

We'll probably see another stair step change followed by another plateauing curve of incremental improvements when that happens.

wahnfrieden1d ago

5.4 and 5.5 were each a big jump for Codex use

AussieWog931d ago

I remember thinking the same thing shortly after GPT-5 came out, then Opus 4.5 dropped.

Some releases are just "meh", but I wouldn't rule out exciting new stuff for 2026 just because Opus 4.7 sucked.

SecretDreams2d ago

> Have they entered a bottleneck period so quickly?

So quickly - this industry has had trillions thrown around to get here so quickly, heh.

But, yes, capability seems somewhat stagnant. It's about ISO perf and cost improvements or iso cost and perf improvements + agentic.

cyanydeez1d ago

its a sigmoid, not a bottleneck.

j / k navigate · click thread line to collapse

0 comments

olao992d ago

For what is worth I find GPT 5.5 qualitatively different than 5.4 and 5.3

If I had to collapse the nature of the difference in one sentence it'd be that the 5.5 does more what I'm asking it to do versus doing a small aspect of what I'm asking then stopping.

5.4 required a lot of "continue" encouragement. 5.5 just "gets it" a bit more

What is boils down to for me is that even though it's more expensive I would much rather use 5.5 on low then 5.4/5.3 on high/medium

darqis19h ago

It is unable to do K.I.S.S . Instead of adding just an endpoint, it creates a service, middleware, config reader and finally an endpoint.

varispeed1d ago

Wingy1d ago

I think that's a failure mode of using the legacy completions API rather than the new responses API. With the responses API, the agent actually goes and does the things it's supposed to do.

nothinkjustai1d ago

They probably just tell it to do more in the prompt lmao

barrell1d ago

GorbachevyChase1d ago

Still amazing, but 5.5 does feel like incremental progress with a massive up charge.

32dsfa1d ago

Ofc they have hit a ceiling, why do you think OAI has shut down many of its projects like the research one called Prism?

The reality is both Anthropic and OAI have converged on LLMs as being a thing for software production - that's where the majority of their revenue is coming from.

gaflo1d ago

Can you elaborate what kind of system you built? I'm curious what specific prompts are getting worse responses with the newer models.

toponijo1d ago

2ndorderthought2d ago

aurareturn2d ago

They likely entered the same compute constraint scenario as Anthropic.

IE. They had 100 compute units. Demand is 200 units. They have to do a combination of buying more compute, increasing price, lowering limits, etc.

cyanydeez1d ago

capitalism convinced you that line goes up unless you dont let it eat all the resources.

GorbachevyChase1d ago

Yes, Cuba definitely doesn’t have such wild delusions to the benefit of its residents.

Please stop. Critical theory is easy. Something about “X” sucks. Got it. What is the alternative? It’s the completely unserious philosophy of the peanut gallery.

eiekek111d ago

Bunch of nonsense.

jere1d ago

You mean the company that just doubled their rate limits? https://www.anthropic.com/news/higher-limits-spacex

NitpickLawyer1d ago

1 more reply

helloplanets2d ago

Are you running gpt-5.5 on xhigh reasoning? Because I'm seeing a clear difference between that and gpt-5.4 on xhigh.

auspiv1d ago

throwaway2194501d ago

benterix1d ago

Call me cynical but for me these are mostly pricing changes, the change in quality is imperceptible. I believe after a few iterations we will be closer to the real cost.

patates2d ago

This doesn't always mean that there is a bottleneck in terms of raw power, it may also mean that your use cases (or the lower hanging fruits among them) are already covered.

gchamonlive2d ago

sroerick1d ago

I do a lot of OCaml and I found 5.5 to be much better, but that's kind of an esoteric language thing

captainbland1d ago

We'll probably see another stair step change followed by another plateauing curve of incremental improvements when that happens.

wahnfrieden1d ago

5.4 and 5.5 were each a big jump for Codex use

AussieWog931d ago

I remember thinking the same thing shortly after GPT-5 came out, then Opus 4.5 dropped.

Some releases are just "meh", but I wouldn't rule out exciting new stuff for 2026 just because Opus 4.7 sucked.

SecretDreams2d ago

> Have they entered a bottleneck period so quickly?

So quickly - this industry has had trillions thrown around to get here so quickly, heh.

But, yes, capability seems somewhat stagnant. It's about ISO perf and cost improvements or iso cost and perf improvements + agentic.

cyanydeez1d ago

its a sigmoid, not a bottleneck.

j / k navigate · click thread line to collapse