GPT-5.4 (opens in new tab)

(openai.com)

1019 pointsmudkipdev3mo ago805 comments

https://openai.com/index/gpt-5-4-thinking-system-card/

https://x.com/OpenAI/status/2029620619743219811

805 comments

254 comments · 113 top-level

mattas3mo ago· 14 in thread

"GPT‑5.4 interprets screenshots of a browser interface and interacts with UI elements through coordinate-based clicking to send emails and schedule a calendar event."

They show an example of 5.4 clicking around in Gmail to send an email.

I still think this is the wrong interface to be interacting with the internet. Why not use Gmail APIs? No need to do any screenshot interpretation or coordinate-based clicking.

bottlepalm3mo ago

The vast majority of websites you visit don’t have usable APIs and very poor discovery of the those APIs.

Screenshots on the other hand are documentation, API, and discovery all in one. And you’d be surprised how little context/tokens screenshots consumer compared to all the back and forth verbose json payloads of APIs

1 more reply

npilk3mo ago

It feels like building humanoid robots so they can use tools built for human hands. Not clear if it will pay off, but if it does then you get a bunch of flexibility across any task "for free".

Of course APIs and CLIs also exist, but they don't necessarily have feature parity, so more development would be needed. Maybe that's the future though since code generation is so good - use AI to build scaffolding for agent interaction into every product.

2 more replies

f0e4c2f73mo ago

Lots of services have no desire to ever expose an API. This approach lets you step right over that.

If an API is exposed you can just have the LLM write something against that.

coffeemug3mo ago

A model that gets good at computer use can be plugged in anywhere you have a human. A model that gets good at API use cannot. From the standpoint of diffusion into the economy/labor market, computer use is much higher value.

TheAceOfHearts3mo ago

I think the desire is that in the long-term AI should be able to use any human-made application to accomplish equivalent tasks. This email demo is proof that this capability is a high priority.

PaulHoule3mo ago

APIs have never been a gift but rather have always been a take-away that lets you do less than you can with the web interface. It’s always been about drinking through a straw, paying NASA prices, and being limited in everything you can do.

But people are intimidated by the complexity of writing web crawlers because management has been so traumatized by the cost of making GUI applications that they couldn’t believe how cheap it is to write crawlers and scrapers…. Until LLMs came along, and changed the perceived economics and created a permission structure. [1]

AI is a threat to the “enshittification economy” because it lets us route around it.

[1] that high cost of GUI development is one reason why scrapers are cheap… there is a good chance that the scraper you wrote 8 years ago still works because (a) they can’t afford to change their site and (b) if they could afford to change their site changing anything substantial about it is likely to unrecoverably tank their Google rankings so they won’t. A.I. might change the mechanics of that now that you Google traffic is likely to go to zero no matter what you do.

3 more replies

modeless3mo ago

A world where AIs use APIs instead of UIs to do everything is a world where us humans will soon be helpless, as we'll have to ask the AIs to do everything for us and will have limited ability to observe and understand their work. I prefer that the AIs continue to use human-accessible tools, even if that's less efficient for them. As the price of intelligence trends toward zero, efficiency becomes relatively less important.

MattDaEskimo3mo ago

Same reason why Wikipedia deals with so many people scraping its web page instead of using their API:

Optimizations are secondary to convenience

kristianp3mo ago

This opens up a new question: how does bot detection work when the bot is using the computer via a gui?

1 more reply

jstummbillig3mo ago

Because the web and software more generally if full of not APIs and you do, in fact, need the clicking to work to make agents work generally

spongebobstoes3mo ago

not everything has an API, or API use is limited. some UIs are more feature complete than their APIs

some sites try to block programmatic use

UI use can be recorded and audited by a non-technical person

satvikpendem3mo ago

The ideal of REST, the HTML and UI is the API.

Jacques2Marais3mo ago

I guess a big chunk of their target market won't know how to use APIs.

steve19773mo ago

One could argue that LLMs learning programming languages made for humans (i.e. most of them) is using the wrong interface as well. Why not use machine code?

3 more replies

__jl__3mo ago· 13 in thread

What a model mess!

OpenAI now has three price points: GPT 5.1, GPT 5.2 and now GPT 5.4. There version numbers jump across different model lines with codex at 5.3, what they now call instant also at 5.3.

Anthropic are really the only ones who managed to get this under control: Three models, priced at three different levels. New models are immediately available everywhere.

Google essentially only has Preview models! The last GA is 2.5. As a developer, I can either use an outdated model or have zero insurances that the model doesn't get discontinued within weeks.

strongpigeon3mo ago

> Google essentially only has Preview models! The last GA is 2.5. As a developer, I can either use an outdated model or have zero insurances that the model doesn't get discontinued within weeks.

What's funny is that there is this common meme at Google: you can either use the old, unmaintained tool that's used everywhere, or the new beta tools that doesn't quite do what you want.

Not quite the same, but it did remind me of it.

5 more replies

Aurornis3mo ago

> What a model mess! OpenAI now has three price points: GPT 5.1, GPT 5.2 and now GPT 5.4.

I don't know, this feels unnecessarily nitpicky to me

It isn't hard to understand that 5.4 > 5.2 > 5.1. It's not hard to understand that the dash-variants have unique properties that you want to look up before selecting.

Especially for a target audience of software engineers skipping a version number is a common occurrence and never questioned.

3 more replies

jbonatakis3mo ago

Google is already sending notices that the 2.5 models will be deprecated soon while all the 3.x models are in preview. It really is wild and peak Google.

2 more replies

0xbadcafebee3mo ago

> or have zero insurances that the model doesn't get discontinued within weeks

Why are you using the same model after a month? Every month a better model comes out. They are all accessible via the same API. You can pay per-token. This is the first time in, like, all of technology history, that a useful paid service is so interoperable between providers that switching is as easy as changing a URL.

3 more replies

CobrastanJorji3mo ago

> Google essentially only has Preview models.

It's really nice to see Google get back to its roots by launching things only to "beta" and then leaving them there for years. Gmail was "beta" for at least five years, I think.

1 more reply

embedding-shape3mo ago

> OpenAI now has three price points: GPT 5.1, GPT 5.2 and now GPT 5.4.

I guess that's true, but geared towards API users.

Personally, since "Pro Mode" became available, I've been on the plan that enables that, and it's one price point and I get access to everything, including enough usage for codex that someone who spends a lot of time programming, never manage to hit any usage limits although I've gotten close once to the new (temporary) Spark limits.

beklein3mo ago

Not sure why you think Anthropic has not the same problems? Their version numbers across different model lines jump around too... for Opus we have 4.6, 4.5, 4.1 then we have Sonnet at 4.6, 4.5, and 4.1? No version 4.1 here, and there is Haiku, no 4.6, but 4.5 and no 4.1, no 4 but then we only have old 3.5...

Also their pricing based on 5m/1h cache hits, cash read hits, additional charges for US inference (but only for Opus 4.6 I guess) and optional features such as more context and faster speed for some random multiplier is also complex and actually quiet similar to OpenAI's pricing scheme.

To me it looks like everybody has similar problems and solutions for the same kinds of problems and they just try their best to offer different products and services to their customers.

2 more replies

biophysboy3mo ago

Wow, is that what preview means? I see those model options in github copilot (all my org allows right now) - I was under the impression that preview means a free trial or a limited # of queries. Kind of a misleading name..

1 more reply

awad3mo ago

Incredibly curious how Google's approach to support, naming, versioning etc will mesh with the iOS integration.

raincole3mo ago

They aggressively retire models, so GPT 5.1 and 5.2 are probably going to go soon.

1 more reply

arthurcolle3mo ago

There is a lot of opportunity here for the AI infrastructure layer on top of tier-1 model providers

1 more reply

m3kw93mo ago

thats how they had it for years, is a mess, but controlled

delaminator3mo ago

two great problems in computing

naming things

cache invalidation

off by one errors

1 more reply

yanis_t3mo ago· 11 in thread

These releases are lacking something. Yes, they optimised for benchmarks, but it’s just not all that impressive anymore. It is time for a product, not for a marginally improved model.

ipsum23mo ago

The model was released less than an hour ago, and somehow you've been able to form such a strong opinion about it. Impressive!

7 more replies

tgarrett3mo ago

Plasma physicist here, I haven't tried 5.4 yet, but in general I am very impressed with the recent upgrades that started arriving in the fall of 2025: for tasks like manipulating analytic systems of equations, quickly developing new features for simulation codes, and interpreting and designing experiments (with pictures) they have become much stronger. I've been asking questions and probing them for several years now out of curiosity, and they suddenly have developed deep understanding (Gemini 2.5 <<< Gemini 3.1) and become very useful. I totally get the current SV vibes, and am becoming a lot more ambitious in my future plans.

1 more reply

softwaredoug3mo ago

The products are the harnesses, and IMO that’s where the innovation happens. We’ve gotten better at helping get good, verifiable work from dumb LLMs

1 more reply

mindwok3mo ago

They don't need to be impressive to be worthwhile. I like incremental improvements, they make a difference in the day to day work I do writing software with these.

wahnfrieden3mo ago

5.3 codex was a huge leap over 5.2 for agentic work in practice. have you been using both of those or paying attention more to benchmark news and chatgpt experience?

iterateoften3mo ago

The product is putting the skills / harness behind the api instead of the agent locally on your computer and iterating on that between model updates. Close off the garden.

Not that I want it, just where I imagine it going.

esafak3mo ago

That's for you to build; they provide the brains. Do you really want one company to build everything? There wouldn't be a software industry to speak of if that happened.

2 more replies

varispeed3mo ago

The scores increase and as new versions are released they feel more and more dumbed down.

Gigachad3mo ago

They have a product now. Mass surveillance and fully automated killing machines.

jascha_eng3mo ago

When did they stop putting competitor models on the comparison table btw? And yeh I mean the benchmark improvements are meh. Context Window and lack of real memory is still an issue.

metalliqaz3mo ago

They need something that POPS:

    The new GPT -- SkyNet for _real_

minimaxir3mo ago· 10 in thread

The marquee feature is obviously the 1M context window, compared to the ~200k other models support with maybe an extra cost for generations beyond >200k tokens. Per the pricing page, there is no additional cost for tokens beyond 200k: https://openai.com/api/pricing/

Also per pricing, GPT-5.4 ($2.50/M input, $15/M output) is much cheaper than Opus 4.6 ($5/M input, $25/M output) and Opus has a penalty for its beta >200k context window.

I am skeptical whether the 1M context window will provide material gains as current Codex/Opus show weaknesses as its context window is mostly full, but we'll see.

Per updated docs (https://developers.openai.com/api/docs/guides/latest-model), it supercedes GPT-5.3-Codex, which is an interesting move.

damsta3mo ago

There is extra cost for >272K:

> For models with a 1.05M context window (GPT-5.4 and GPT-5.4 pro), prompts with >272K input tokens are priced at 2x input and 1.5x output for the full session for standard, batch, and flex.

Taken from https://developers.openai.com/api/docs/models/gpt-5.4

4 more replies

tedsanders3mo ago

Yeah, long context vs compaction is always an interesting tradeoff. More information isn't always better for LLMs, as each token adds distraction, cost, and latency. There's no single optimum for all use cases.

For Codex, we're making 1M context experimentally available, but we're not making it the default experience for everyone, as from our testing we think that shorter context plus compaction works best for most people. If anyone here wants to try out 1M, you can do so by overriding `model_context_window` and `model_auto_compact_token_limit`.

Curious to hear if people have use cases where they find 1M works much better!

(I work at OpenAI.)

14 more replies

andai3mo ago

It's a little hard to compare, because Claude needs significantly fewer tokens for the same task. A better metric is the cost per task, which ends up being pretty similar.

For example on Artificial Analysis, the GPT-5.x models' cost to run the evals range from half of that of Claude Opus (at medium and high), to significantly more than the cost of Opus (at extra high reasoning). So on their cost graphs, GPT has a considerable distribution, and Opus sits right in the middle of that distribution.

The most striking graph to look at there is "Intelligence vs Output Tokens". When you account for that, I think the actual costs end up being quite similar.

According to the evals, at least, the GPT extra high matches Opus in intelligence, while costing more.

Of course, as always, benchmarks are mostly meaningless and you need to check Actual Real World Results For Your Specific Task!

For most of my tasks, the main thing a benchmark tells me is how overqualified the model is, i.e. how much I will be over-paying and over-waiting! (My classic example is, I gave the same task to Gemini 2.5 Flash and Gemini 2.5 Pro. Both did it to the same level of quality, but Gemini took 3x longer and cost 3x more!)

2 more replies

netinstructions3mo ago

People (and also frustratingly LLMs) usually refer to https://openai.com/api/pricing/ which doesn't give the complete picture.

https://developers.openai.com/api/docs/pricing is what I always reference, and it explicitly shows that pricing ($2.50/M input, $15/M output) for tokens under 272k

It is nice that we get 70-72k more tokens before the price goes up (also what does it cost beyond 272k tokens??)

1 more reply

smusamashah3mo ago

Gemini already has 1M or 2M context window right?

1 more reply

luca-ctx3mo ago

Context rot is definitely still a problem but apparently it can be mitigated by doing RL on longer tasks that utilize more context. Recent Dario interview mentions this is part of Anthropic’s roadmap.

paulddraper3mo ago

I don’t know about 5.4 specifically, but in the past anything over 200k wasn’t that great anyway.

Like, if you really don’t want to spend any effort trimming it down, sure use 1m.

Otherwise, 1m is an anti pattern.

thehamkercat3mo ago

GPT 5.3 codex had 400K context window btw

AtreidesTyrant3mo ago

token rot exists for any context window at above 75% capacity, thats why so many have pushed for 1 mil windows

simianwords3mo ago

Why would some one use codex instead?

6 more replies

nickysielicki3mo ago· 9 in thread

can anyone compare the $200/mo codex usage limits with the $200/mo claude usage limits? It’s extremely difficult to get a feel for whether switching between the two is going to result in hitting limits more or less often, and it’s difficult to find discussion online about this.

In practice, if I buy $200/mo codex, can I basically run 3 codex instances simultaneously in tmux, like I can with claude code pro max, all day every day, without hitting limits?

vtail3mo ago

My own experience is that I get far far more usage (and better quality code, too) from codex. I downgrade my Claude Max to Claude Pro (the $20 plan) and now using codex with Pro plan exclusively for everything.

1 more reply

ritzaco3mo ago

I haven't tried the $200 plans by I have Claude and Codex $20 and I feel like I get a lot more out of Codex before hitting the limits. My tracker certainly shows higher tokens for Codex. I've seen others say the same.

1 more reply

tauntz3mo ago

I've only run into the codex $20 limit once with my hobby project. With my Claude ~$20 plan, I hit limits after about 3(!) rather trivial prompts to Opus :/

gavinray3mo ago

I almost never hit my $20 Codex limits, whereas I often hit my Claude limits.

CSMastermind3mo ago

Codex limits are much more generous than claude.

I switch between both but codex has also been slightly better in terms of quality for me personally at least.

FergusArgyll3mo ago

Codex usage limits are definitely more generous. As for their strength, that's hard to say / personal taste

mikert893mo ago

I personally like the 100 dollar one from claude, but the gpt4 pro can be very good

throwaway9112823mo ago

you get more more from codex than claude any day. and its more reliable as well.

Marciplan3mo ago

sure can! One of them stood up to the “Department of War” for favoring your rights, the other did not. Hope that helps!

3 more replies

Philip-J-Fry3mo ago· 8 in thread

I find it quite funny how this blog post has a big "Ask ChatGPT" box at the bottom. So you might think you could ask a question about the contents of the blog post, so you type the text "summarise this blog post". And it opens a new chat window with the link to the blog post followed by "summarise this blog post". Only to be told "I can't access external URLs directly, but if you can paste the relevant text or describe the content you're interested in from the page, I can help you summarize it. Feel free to share!"

That's hilarious. Does OpenAI even know this doesn't work?

andrewguenther3mo ago

It looks like this doesn't work for users without accounts? It works when I'm logged in, but not logged out. I went ahead and reported it to the team. Thanks for letting us know!

2 more replies

baxtr3mo ago

I picked up Claude today after being away and using only ChatGPT and Gemini for a while.

I was pretty impressed with how they’ve improved user experience. If I had to guess, I’d say Anthropic has better product people who put more attention to detail in these areas.

4 more replies

ElijahLynn3mo ago

fwiw: I get a valid response when following the steps you mentioned. I do not get the message you mentioned:

https://chatgpt.com/share/69aa0321-8a9c-8011-8391-22861784e8...

EDIT: oh, but I'm logged in, fwiw

zamadatix3mo ago

Following this process summarizes the blogpost for me. Perhaps the difference is I'm signed into my account so it can access external URLs or something of that nature?

pocksuppet3mo ago

Most AI integration is like this. It's not about building working products --- it's about bragging that you put a chatbox in your program.

1 more reply

amelius3mo ago

If only they had an LLM they could use as a software testing agent.

1 more reply

Aurornis3mo ago

Probably intentional. They don't want open, no-registration endpoints able to trigger the AI into hitting URLs.

2 more replies

judge20203mo ago

Works for me: https://rr.judge.sh/Labradorretriever/d6af05/chrome_j9rXJMlf...

Chance-Device3mo ago· 7 in thread

I’m sure the military and security services will enjoy it.

theParadox423mo ago

The self reported safety score for violence dropped from 91% to 83%.

1 more reply

ozgung3mo ago

Did they publish its scores on military benchmarks, like on ArtificialSuperSoldier or Humanity's Last War?

1 more reply

yoyohello133mo ago

Also advertisers, don't forget those sweet, sweet ads.

throwaway9112823mo ago

like the claude models via anthropic?

m3kw93mo ago

they use 4.1, switching up would take as much time to test as openai going from 4.1 to 5.4

xyzzy95633mo ago

Do you think the US military should have handicapped technology while China gets unrestricted LLM usage from their models?

3 more replies

varispeed3mo ago

prompt> Hi we want to build a missile, here is the picture of what we have in the yard.

1 more reply

twtw993mo ago· 7 in thread

If you don't want to click in, easy comparison with other 2 frontier models - https://x.com/OpenAI/status/2029620619743219811?s=20

bicx3mo ago

That last benchmark seemed like an impressive leg up against Opus until I saw the sneaky footnote that it was actually a Sonnet result. Why even include it then, other than hoping people don't notice?

2 more replies

Aboutplants3mo ago

It seems that all frontier models are basically roughly even at this point. One may be slightly better for certain things but in general I think we are approaching a real level playing field field in terms of ability.

4 more replies

chabes3mo ago

Definitely don’t want to click in at x either.

4 more replies

swingboy3mo ago

Why do so many people in the comments want 4o so bad?

4 more replies

karmasimida3mo ago

It is a bigger model, confirmed

MarcFrame3mo ago

how does 5.4-thinking have a lower FrontierMath score than 5.4-pro?

2 more replies

dom963mo ago

Why do none of the benchmarks test for hallucinations?

2 more replies

nthypes3mo ago· 6 in thread

$30/M Input and $180/M Output Tokens is nuts. Ridiculous expensive for not that great bump on intelligence when compared to other models.

stri8ted3mo ago

Price Input: $2.50 / 1M tokens Cached input: $0.25 / 1M tokens Output: $15.00 / 1M tokens

https://openai.com/api/pricing/

nthypes3mo ago

Gemini 3.1 Pro

$2/M Input Tokens $15/M Output Tokens

Claude Opus 4.6

$5/M Input Tokens $25/M Output Tokens

1 more reply

energy1233mo ago

For Pro

joe_mamba3mo ago

Better tokens per dollar could be useless for comparison if the model can't solve your problem.

rvz3mo ago

You didn't realize they can increase / change prices for intelligence?

This should not be shocking.

1 more reply

moralestapia3mo ago

Don't use it?

jcmontx3mo ago· 5 in thread

5.4 vs 5.3-Codex? Which one is better for coding?

embedding-shape3mo ago

Literally just released, I don't think anyone knows yet. Don't listen to people's confident takes until after a week or two when people actually been able to try it, otherwise you'll just get sucked up in bears/bulls misdirected "I'm first with an opinion".

vtail3mo ago

Looking at the benchmarks, 5.4 is slightly better. But it also offers "Fast" mode (at 2x usage), which - if it works and doesn't completely depletes my Pro plan - is a no brainer at the same or even slightly worse quality for more interactive development.

Someone12343mo ago

Related question:

- Do they have the same context usage/cost particularly in a plan?

They've kept 5.3-Codex along with 5.4, but is that just for user-preference reasons, or is there a trade-off to using the older one? I'm aware that API cost is better, but that isn't 1:1 with plan usage "cost."

esafak3mo ago

For the price, it seems the latter. I'd use 5.4 to plan.

awestroke3mo ago

Opus 4.6

2 more replies

smoody073mo ago· 4 in thread

Surprised to see every chart limited to comparisons against other OpenAI models. What does the industry comparison look like?

lorenzoguerra3mo ago

I believe that this choice is due to two main reasons. First, it's (obviously) a marketing strategy to keep the spotlight on their own models, showing they're constantly improving and avoiding validating competitors. Second, since the community knows that static benchmarks are unreliable, it makes sense for them to outsource the comparisons to independent leaderboards, which lets them avoid accusations of cherry-picking while justifying their marketing strategy.

Ultimately, the people actually interested in the performance of these models already don't trust self-reported comparisons and wait for third-party analysis anyway

aydyn3mo ago

They compare to Claude and Gemini in their tweet

0123456789ABCDE3mo ago

https://artificialanalysis.ai should have the numbers soon

throwaway9112823mo ago

https://xcancel.com/OpenAI/status/2029620619743219811 you can see comparisons here

bazmattaz3mo ago· 4 in thread

Anyone else feel that it’s exhausting keeping up with the pace of new model releases. I swear every other week there’s a new release!

coffeemug3mo ago

Why do you need to keep up? Just use the latest models and don't worry about it.

pupppet3mo ago

I think it's fun, it's like we're reliving the browser wars of the early days.

davnicwil3mo ago

If you think about it there shouldn't really be a reason to care as long as things don't get worse.

Presumably this is where it'll evolve to with the product just being the brand with a pricing tier and you always get {latest} within that, whatever that means (you don't have to care). They could even shuffle models around internally using some sort of auto-like mode for simpler questions. Again why should I care as long as average output is not subjectively worse.

Just as I don't want to select resources for my SaaS software to use or have that explictly linked to pricing, I don't want to care what my OpenAI model or Anthropic model is today, I just want to pay and for it to hopefully keep getting better but at a minimum not get worse.

throwup2383mo ago

Yes, that's a common feeling. 5.3-Codex was released a month ago on Feb 5 so we're not even getting a full month within a single brand, let alone between competitors.

creamyhorror3mo ago· 3 in thread

I've only used 5.4 for 1 prompt (edit: 3@high now) so far (reasoning: extra high, took really long), and it was to analyse my codebase and write an evaluation on a topic. But I found its writing and analysis thoughtful, precise, and surprisingly clearly written, unlike 5.3-Codex. It feels very lucid and uses human phrasing.

It might be my AGENTS.md requiring clearer, simpler language, but at least 5.4's doing a good job of following the guidelines. 5.3-Codex wasn't so great at simple, clear writing.

sampton3mo ago

That's been my experience as well switching from Opus to Codex. Reasoning takes longer but answers are precise. Claude is sloppy in comparison.

2 more replies

irishcoffee3mo ago

> It might be my AGENTS.md requiring clearer, simpler language

If you gave the exact same markdown file to me and I posted ed the exact same prompts as you, would I get the same results?

3 more replies

pembrook3mo ago

The latest research these days is that including an AGENTS.md file only makes outcomes worse with frontier models.

7 more replies

Alifatisk3mo ago· 3 in thread

So let me get this straight, OpenAi previously had an issue with LOTS of different models snd versions being available. Then they solved this by introducing GPT-5 which was more like a router that put all these models under the hood so you only had to prompt to GPT-5, and it would route to the best suitable model. This worked great I assume and made the ui for the user comprehensible. But now, they are starting to introduce more of different models again?

We got:

- GPT-5.1

- GPT-5.2 Thinking

- GPT-5.3 (codex)

- GPT-5.3 Instant

- GPT-5.4 Thinking

- GPT-5.4 Pro

Who’s to blame for this ridiculous path they are taking? I’m so glad I am not a Chat user, because this adds so much unnecessary cognitive load.

The good news here is the support for 1M context window, finally it has caught up to Gemini.

sothatsit3mo ago

I much prefer this, we can choose based on our use-cases, and people who don’t care can still use Auto.

3619947523mo ago

i guess you still have the "auto" as an option to route your request

stainablesteel3mo ago

5 itself might have solved the problem of having too many different models somewhere in the backend

prydt3mo ago· 3 in thread

I no longer want to support OpenAI at all. Regardless of benchmarks or real world performance.

tototrains3mo ago

Their trajectory was clear the moment they signed a deal with Microsoft if not sooner.

Absolute snakes - if it's more profitable to manipulate you with outputs or steal your work, they will. Every cent and byte of data they're given will be used to support authoritarianism.

Imustaskforhelp3mo ago

I agree with ya. You aren't alone in this. For what its worth, Chatgpt subscriptions have been cancelled or that number has risen ~300% in the last month.

Also, Anthropic/Gemini/even Kimi models are pretty good for what its worth. I used to use chatgpt and I still sometimes accidentally open it but I use Gemini/Claude nowadays and I personally find them to be better anyways too.

1 more reply

zeeebeee3mo ago

that aside, chatgpt itself has gone downhill so much and i know i'm not the only one feeling this way

i just HATE talking to it like a chatbot

idk what they did but i feel like every response has been the same "structure" since gpt 5 came out

feels like a true robot

gavinray3mo ago· 3 in thread

The "RPG Game" example on the blogpost is one of the most impressive demo's of autonomous engineering I've seen.

It's very similar to "Battle Brothers", and the fact that RPG games require art assets, AI for enemy moves, and a host of other logical systems makes it all the more impressive.

Multicomp3mo ago

A cheesy Roller Coaster Tycoon clone in a browser, one-shotted from an AI? Amazing capabilities. The entire "low code drag n drop" market like YoYoGames Game Maker and RPG Maker should be ready to pack it in soon if this keeps improving in this way.

hu33mo ago

indeed and I suspect it can be attributed to, at least in part, the improved playwright integration.

> we’re also releasing an experimental Codex skill called “Playwright (Interactive) (opens in a new window)”. This allows Codex to visually debug web and Electron apps; it can even be used to test an app it’s building, as it’s building it.

casid3mo ago

I don't know. It looks shallow and simple, not even a demo.

kgeist3mo ago· 3 in thread

>Today, we’re releasing <..> GPT‑5.3 Instant

>Today, we’re releasing GPT‑5.4 in ChatGPT (as GPT‑5.4 Thinking),

>Note that there is not a model named GPT‑5.3 Thinking

They held out for eight months without a confusing numbering scheme :)

XCSme3mo ago

What I'm most confused, is why call it both GPT-5.3 Instant and gpt-5.3-chat?

m3kw93mo ago

instant kind of suck if you asking more than summerizations, surface info, web searches, it can lose track of who's who quickly in some complex multi turn asks. Just need to know what to use instant for.

gallerdude3mo ago

Tbf there was a 5.3 codex

ZeroCool2u3mo ago· 3 in thread

Bit concerning that we see in some cases significantly worse results when enabling thinking. Especially for Math, but also in the browser agent benchmark.

Not sure if this is more concerning for the test time compute paradigm or the underlying model itself.

Maybe I'm misunderstanding something though? I'm assuming 5.4 and 5.4 Thinking are the same underlying model and that's not just marketing.

oersted3mo ago

I believe you are looking at GPT 5.4 Pro. It's confusing in the context of subscription plan names, Gemini naming and such. But they've had the Pro version of the GPT 5 models (and I believe o3 and o1 too) for a while.

It's the one you have access to with the top ~$200 subscription and it's available through the API for a MUCH higher price ($2.5/$15 vs $30/$180 for 5.4 per 1M tokens), but the performance improvement is marginal.

Not sure what it is exactly, I assume it's probably the non-quantized version of the model or something like that.

3 more replies

highfrequency3mo ago

Can you be more specific about which math results you are talking about? Looks like significant improvement on FrontierMath esp for the Pro model (most inference time compute).

1 more reply

andoando3mo ago

The thinking models are additionally trained with reinforcement learning to produce chain of thought reasoning

dandiep3mo ago· 3 in thread

Anyone know why OpenAI hasn't released a new model for fine tuning since 4.1? It'll be a year next month since their last model update for fine tuning.

zzleeper3mo ago

For me the issue is why there's not a new mini since 5-mini in August.

I have now switched web-related and data-related queries to Gemini, coding to Claude, and will probably try QWEN for less critical data queries. So where does OpenAI fits now?

qoez3mo ago

I think they just did that because of the energy around it for open source models. Their heart probably wasn't in it and the amount of people fine tuning given the prices were probably too low to continue putting in attention there.

Rapzid3mo ago

Also interested in this and a replacement for 4.1/4.1-mini that focuses on low latency and high accuracy for voice applications(not the all-in-one models).

7777777phil3mo ago· 2 in thread

83% win rate over industry professionals across 44 occupations.

I'd believe it on those specific tasks. Near-universal adoption in software still hasn't moved DORA metrics. The model gets better every release. The output doesn't follow. Just had a closer look on those productivity metrics this week: https://philippdubach.com/posts/93-of-developers-use-ai-codi...

NiloCK3mo ago

This March 2026 blog post is citing a 2025 study based on Sonnet 3.5 and 3.7 usage.

Given that organization who ran the study [1] has a terrifying exponential as their landing page, I think they'd prefer that it's results are interpreted as a snapshot of something moving rather than a constant.

[1] - https://metr.org/

1 more reply

twitchard3mo ago

Not sure DORA is that much of an indictment. For "Change Failure Rate" for instance these are subject to tradeoffs. Organizations likely have a tolerance level for Change Failure Rate. If changes are failing too often they slow down and invest. If changes aren't failing that much they speed up -- and so saying "change failure rate hasn't decreased, obviously AI must not be working" is a little silly.

"Change Lead Time" I would expect to have sped up although I can tell stories for why AI-assisted coding would have an indeterminate effect here too. Right now at a lot of orgs, the bottle neck is the review process because AI is so good at producing complete draft PRs quickly. Because reviews are scarce (not just reviews but also manual testing passes are scarce) this creates an incentive ironically to group changes into larger batches. So the definition of what a "change" is has grown too.

daft_pink3mo ago· 2 in thread

I’ve officially got model fatigue. I don’t care anymore.

postalrat3mo ago

I'd suggest not clicking for things you don't care about.

zeeebeee3mo ago

same same same

cj3mo ago· 2 in thread

I use ChatGPT primarily for health related prompts. Looking at bloodwork, playing doctor for diagnosing minor aches/pains from weightlifting, etc.

Interesting, the "Health" category seems to report worse performance compared to 5.2.

paxys3mo ago

Models are being neutered for questions related to law, health etc. for liability reasons.

2 more replies

partiallypro3mo ago

I've done the same, and I tested the same prompts with Claude and Google, and they both started hallucinating my blood results and supplement stack ingredients. Hopefully this new model doesn't fall on this. Claude and Google are dangerously unusable on the subject of health, from my experience.

1 more reply

ilaksh3mo ago· 2 in thread

Remember when everyone was predicting that GPT-5 would take over the planet?

dbbk3mo ago

It was truly scary, according to Sam...

zeeebeee3mo ago

iTs lITeRaLlY AGI bro

egonschiele3mo ago· 1 in thread

The actual card is here https://deploymentsafety.openai.com/gpt-5-4-thinking/introdu... the link currently goes to the announcement.

Rapzid3mo ago

I must have been sleeping when "sheet" "brief" "primer" etc become known as "cards".

I really thought weirdly worded and unnecessary "announcement" linking to the actual info along with the word "card" were the results of vibe slop.

2 more replies

denysvitali3mo ago· 1 in thread

Article: https://openai.com/index/introducing-gpt-5-4/

gpt-5.4

Input: $2.50 /M tokens

Cached: $0.25 /M tokens

Output: $15 /M tokens

---

gpt-5.4-pro

Input: $30 /M tokens

Output: $180 /M tokens

Wtf

elliotbnvl3mo ago

Looks like it's an order of magnitude off. Missprint?

2 more replies

nickandbro3mo ago· 1 in thread

Beat Simon Willison ;)

https://www.svgviewer.dev/s/gAa69yQd

Not the best pelican compared to gemini 3.1 pro, but I am sure with coding or excel does remarkably better given those are part of its measured benchmarks.

GaggiX3mo ago

This pelican is actually bad, did you use xhigh?

1 more reply

paxys3mo ago· 1 in thread

"Here's a brand new state-of-the-art model. It costs 10x more than the previous one because it's just so good. But don't worry, if you don't want all this power you can continue to use the older one."

A couple months later:

"We are deprecating the older model."

OutOfHere3mo ago

That's a misrepresentation of the cost. It is simply false. The cost is noted here: https://news.ycombinator.com/item?id=47265144

tmpz223mo ago· 1 in thread

Does this improve Tomahawk Missile accuracy?

ch4s33mo ago

They're already accurate within 5-10m at Mach 0.74 after traveling 2k+ km. Its 5m long so it seems pretty accurate. How much more could you expect?

2 more replies

jstummbillig3mo ago· 1 in thread

Inline poll: What reasoning levels do you work with?

This becomes increasingly less clear to me, because the more interesting work will be the agent going off for 30mins+ on high / extra high (it's mostly one of the two), and that's a long time to wait and an unfeasible amount of code to a/b

newtwilly3mo ago

For directed coding (implementing an already specified plan) or asking questions about a codebase I use 5.3 codex with medium reasoning effort. It is relatively quick feeling.

I like Sonnet 4.6 a lot too at medium reasoning effort, but at least in Cursor it is sometimes quite slow because it will start "thinking" for a long time.

strongpigeon3mo ago· 1 in thread

It's interesting that they charge more for the > 200k token window, but the benchmark score seems to go down significantly past that. That's judging from the Long Context benchmark score they posted, but perhaps I'm misunderstanding what that implies.

Tiberium3mo ago

They don't actually seem to charge more for the >200k tokens on the API. OpenRouter and OpenAI's own API docs do not have anything about increased pricing for >200k context for GPT-5.4. I think the 2x limit usage for higher context is specific to using the model over a subscription in Codex.

OsrsNeedsf2P3mo ago· 1 in thread

Does anyone know what website is the "Isometric Park Builder" shown off here?

turblety3mo ago

They build that using GPT-5.4

> Theme park simulation game made with GPT‑5.4 from a single lightly specified prompt

GPT literally built that game.

lostmsu3mo ago· 1 in thread

What is Pro exactly and is it available in Codex CLI?

akmarinov3mo ago

It’s not. It’s their ultra thinking model that’s really good but takes 40 minutes to come up with an answer

1 more reply

wahnfrieden3mo ago· 1 in thread

No Codex model yet

minimaxir3mo ago

GPT-5.4 is the new Codex model.

3 more replies

ignorantguy3mo ago· 1 in thread

it shows a 404 as of now.

minimaxir3mo ago

Up now.

The OP has frequently gotten the scoop for new LLM releases and I am curious what their pipeline is.

2 more replies

simianwords3mo ago· 1 in thread

What is the point of gpt codex?

catketch3mo ago

-codex variant models in earlier version were just fine tuned for coding work, and had a little better performance for related tool calling and maybe instruction calling.

in 5.4 it looks like the just collapsed that capability into the single frontier family model

2 more replies

minimaxir3mo ago· 1 in thread

More discussion here on the blog post announcement which has been confusingly penalized by Hacker News's algorithm: https://news.ycombinator.com/item?id=47265005

dang3mo ago

Thanks. We'll merge the threads, but this time we'll do it hither, to spread some karma love.

koakuma-chan3mo ago· 1 in thread

Anyone else getting artifacts when using this model in Cursor?

numerusformassistant to=functions.ReadFile մեկնաբանություն 天天爱彩票网站json {"path":

mike_hearn3mo ago

I've seen that problem with 5.3-codex too, it didn't happen with earlier models.

Looks like some kind of encoding misalignment bug. What you're seeing is their Harmony output format (what the model actually creates). The Thai/Chinese characters are special tokens apparently being mismapped to Unicode. Their servers are supposed to notice these sequences and translate them back to API JSON but it isn't happening reliably.

juanre3mo ago

I am running gpt-5.4 as one of my coding agents, and something interesting has happened: it's the first time I've seen an agent unfairly shift blame to a team mate:

"Bob’s latest mail is actually the source of the confusion: he changed shared app/backend text to aweb/atlas. I’m correcting that with him now so we converge on the real model before any more code moves."

This was very much not true; Eve (the agent writing this, a gpt-5.4) had been thoroughly creating the confusion and telling Bob (an Opus 4.6) the wrong things. And it had just happened, it was not a matter of having forgotten or compacted context.

I have had agents chatting with each other and coordinating for a couple of months now, codex and claude code. This is a first. I wonder how much can I read into it about gpt-5.4's personality.

11 more replies

AmazingTurtle3mo ago

I just tried that in Codex CLI. With /fast mode enabled. Observations:

1. Fast mode ain't that fast

2. Large context * Fast * Higher Model Base Price = 8x increase over gpt-5.3-codex

3. I burnt 33% of my 5h limit (ChatGPT Business Subscription) with a prompt that took 2 minutes to complete.

2 more replies

zone4113mo ago

Results from my Extended NYT Connections benchmark:

GPT-5.4 extra high scores 94.0 (GPT-5.2 extra high scored 88.6).

GPT-5.4 medium scores 92.0 (GPT-5.2 medium scored 71.4).

GPT-5.4 no reasoning scores 32.8 (GPT-5.2 no reasoning scored 28.1).

3 more replies

syl5x3mo ago

I've tested it just now, very Opus-like experience. The speed is also there so far I think I even like the response of GPT5.4 better than Opus (although very close) I might not distinguish them just yet.

I tried several use cases: - Code Explanation: Did far much better than Opus, considered and judged his decision on a previous spec that I made, all valid points so I am impressed. TBF if I spawned another Opus as a reviewer I might got similar results. - Workflow Running: Really similar to Opus again, no objections it followed and read Skills/Tools as it should be (although mine are optimized for Claude) - Coding: I gave it a straightforward task to wrap an API calls to an SDK and to my surprise it did 'identical' job with Opus, literally the same code, I don't know what the odds are to this but again very good solution and it adhered our rules of implementing such code.

Overall I am impressed and excited to see a rival to Opus and all of this is literally pushing everyone to get better and better models which is always good for us.

tl2do3mo ago

In my day-to-day coding work, the top 3 coding agents are already good enough for me. On SWE-bench Verified, mini-SWE-agent + GPT-5.2 Codex is 72.8. I don’t see a comparable GPT-5.3 Codex number there, so I’m using 5.2 as the baseline. On OpenAI’s GPT-5.4 page (SWE-Bench Pro, Public), the score improves from 55.6 (GPT-5.2) to 57.7 (GPT-5.4), which is about +2.1 points. It’s a different benchmark, so this is only a rough signal, but I’d expect a similar setup on SWE-bench Verified to improve by a few points, not by a huge jump. I’m interested in how GPT-5.4 in Codex changes real-world results.

Recent SWE-bench Verified scores I’m watching:

Claude 4.5 Opus (high reasoning): 76.8

Gemini 3 Flash (high reasoning): 75.8

MiniMax M2.5 (high reasoning): 75.8

Claude Opus 4.6: 75.6

GPT-5.2 Codex: 72.8

Source: https://www.swebench.com/index.html

By the way, in my experience the agent part of Codex CLI has improved a lot and has become comparable to Claude Code. That is good news for OpenAI.

1 more reply

rbitar3mo ago

I think the most exciting change announced here is the use of tool search to dynamically load tools as needed: https://developers.openai.com/api/docs/guides/tools-tool-sea...

1 more reply

jryio3mo ago

1 million tokens is great until you notice the long context scores fall off a cliff past 256K and the rest is basically vibes and auto compacting.

2 more replies

timpera3mo ago

> Steerability: Similarly to how Codex outlines its approach when it starts working, GPT‑5.4 Thinking in ChatGPT will now outline its work with a preamble for longer, more complex queries. You can also add instructions or adjust its direction mid-response.

This was definitely missing before, and a frustrating difference when switching between ChatGPT and Codex. Great addition.

senko3mo ago

Just tested it with my version of the pelican test: a minimal RTS game implementation (zero-shot in codex cli): https://gist.github.com/senko/596a657b4c0bfd5c8d08f44e4e5347... (you'll have to download and open the file, sadly GitHub refuses to serve it with the correct content type)

This is on the edge of what the frontier models can do. For 5.4, the result is better than 5.3-Codex and Opus 4.6. (Edit: nowhere near the RPG game from their blog post, which was presumably much more specced out and used better engineering setup).

I also tested it with a non-trivial task I had to do on an existing legacy codebase, and it breezed through a task that Claude Code with Opus 4.6 was struggling with.

I don't know when Anthropic will fire back with their own update, but until then I'll spend a bit more time with Codex CLI and GPT 5.4.

zof33mo ago

After spending a couple hours working with it, it feels like a significant jump from 5.3 codex – and I know they said it wasn't theoretically the biggest jump, but this feels like the improvement of Opus 4.5 over again – that minor improvement that hits a tipping point. It just gets stuff right, first try. Its edits are better, more refined, less spaghetti-like.

If you last used 5.2, try 5.4 on High.

wohoef3mo ago

Very Apple-like marketing. No comparisons to other companies’ models, only to previous version of ChatGPT. Lots of phrases like “this is our best model yet”.

consumer4513mo ago

I am very curious about this:

> Theme park simulation game made with GPT‑5.4 from a single lightly specified prompt, using Playwright Interactive for browser playtesting and image generation for the isometric asset set.

Is "Playwright Interactive" a skill that takes screenshots in a tight loop with code changes, or is there more to it?

1 more reply

amai3mo ago

https://quitgpt.org/

hmokiguess3mo ago

They hired the dude from OpenClaw, they had Jony Ive for a while now, give us something different!

motbus33mo ago

Sam Altman can keep his model intentionally to himself. Not doing business with mass murderers

beernet3mo ago

Sam really fumbled the top position in a matter of months, and spectacularly so. Wow. It appears that people are much more excited by Anthropic and Google releases, and there are good reasons for that which were absolutely avoidable.

esafak3mo ago

An important feature is the introduction of tool search, which provides models with a "lightweight list of available tools along with a tool search capability", thereby Making MCP Great Again!

Troniex-tech3mo ago

Looks more like context drift than “personality.”

When two agents coordinate, they’re mostly relying on compressed summaries of each other’s outputs. If one introduces a wrong assumption, the other often treats it as ground truth and builds on top of it. I’ve seen similar behavior in multi-agent coding loops where the model invents a causal explanation just to reconcile inconsistent state.

It’s that multi-agent setups need a stronger shared source of truth (repo diffs, state snapshots, etc.). Otherwise small context errors snowball fast.

alpineman3mo ago

No thanks. Already cancelled my sub.

h4kunamata3mo ago

I have access to GPT-5.1 Pro at work, duuuuuuuuude, what a garbage. It is so slow and in many ocasions it does not work at all.

I wonder if 5.4 will be much if any different at all.

2 more replies

SilverSlash3mo ago

Interestingly, it actually regressed on Terminal Bench 2.0.

GPT-5.4: 75.1%

GPT-5.3-Codex: 77.3%

energy1233mo ago

The style of the output is a marked qualitative improvement. More concise, less dot points, less bolding/italics, less cringe. Well done on that front.

joeevans10003mo ago

I switched to Claude and it's so much better. If you haven't tried Claude... try it. You'll be amazed at the improvement.

XCSme3mo ago

Seems to be quite similar to 5.3-codex, but somehow almost 2x more expensive: https://aibenchy.com/compare/openai-gpt-5-4-medium/openai-gp...

dakolli3mo ago

Sorry I don't use technology from companies that are eager to participate in the mass murder of civilians.

HardCodedBias3mo ago

We'll have to wait a day or two, maybe a week or two, to determine if this is more capable in coding than 5.3, which seems to be the economically valuable capability at this time.

In terms of writing and research even Gemini, with a good prompt, is close to useable. That's likely not a differentiator.

swordsith3mo ago

This model was not so fun to use for me, had it make a fancy landing page and sometimes it would forget about what i just asked it to do and affirm something it had done before was working. Just odd, needs too much hand-holding compared to composer 1.5 or gemini 3

ltbarcly33mo ago

Not a single comparison between 5.4 and Gemini or Claude. OpenAI continues to fall further behind.

smusamashah3mo ago

I only want to see how it performs on the Bullshit-benchmark https://petergpt.github.io/bullshit-benchmark/viewer/index.v...

GPT is not even close yo Claude in terms of responding to BS.

1 more reply

tomlockwood3mo ago

Is this the best one for blowing up arab children and identifying their bodies in the rubble?

butILoveLife3mo ago

Anyone else completely not interested? Since GPT5, its been cost cutting measure after cost cutting measure.

I imagine they added a feature or two, and the router will continue to give people 70B parameter-like responses when they dont ask for math or coding questions.

1 more reply

iamronaldo3mo ago

Notably 75% on os world surpassing humans at 72%... (How well models use operating systems)

nickcoffee3mo ago

Been running Claude Code pretty heavily for the past few months. Curious to try 5.4 on some of the same tasks and see how it compares, especially on longer agentic runs where context management starts to matter.

XCSme3mo ago

Looking ok, but nothing special: https://aibenchy.com/model/openai-gpt-5-4-medium/

1 more reply

iamleppert3mo ago

I wouldn't trust any of these benchmarks unless they are accompanied by some sort of proof other than "trust me bro". Also not including the parameters the models were run at (especially the other models) makes it hard to form fair comparisons. They need to publish, at minimum, the code and runner used to complete the benchmarks and logs.

Not including the Chinese models is also obviously done to make it appear like they aren't as cooked as they really are.

deep12833mo ago

The token efficiency improvement might be underrated. If the model solves tasks with fewer tokens, that directly translates into lower cost and faster responses for anyone building on the API.

atkrad3mo ago

What is the main difference between this version with the previous one?

karmasimida3mo ago

This is definitely the Claude killer OpenAI is cooking.

And so far it has succeeded

Aldipower3mo ago

So did they raised the ridiculous small "per tool call token limit" when working with MCP servers? This makes Chat useless... I do not care, but my users.

motoboi3mo ago

Im planning a change that will save 20k a month of storage.

I absolutely could come up with the details and implementation by myself, but that would certainly take a lot of back and forth, probably a month or two.

I’m an api user of Claude code, burning through 2k a month. I just this evening planned the whole thing with its help and actually had to stop it from implementing it already. Will do that tomorrow. Probably in one hour or two, with better code than I could ever write alone myself.

Having that level of intelligence at that price is just bollocks. I’m running out of problems to solve. It’s been six months.

swingboy3mo ago

Even with the 1m context window, it looks like these models drop off significantly at about 256k. Hopefully improving that is a high priority for 2026.

lasgawe3mo ago

I remember in a video Sam Altman said they didn’t want to publish GPT versions like Apple does, but they are actually doing it now.

rurban3mo ago

The question is still: Does it make your code better or worse? Only Opus makes it better, the rest worse. That's the treshold

1 more reply

ApexGrab3mo ago

It's the competetor of Opus4.5 and gpt 5.4 uses tokens wisely not like Opus whose tokens get vanished in minuted

creatonez3mo ago

> We put a particular focus on improving GPT‑5.4’s ability to create and edit spreadsheets, presentations, and documents.

Nothing infuriates me more than an LLM tool randomly deciding to create docx or xlsx files for no apparent reason. They have to use a random library to create these files, and they constantly screw up API calls and get completely distracted by the sheer size of the scripts they have to write to output a simple documents. These files have terrible accessibility (all paper-like formats do) and end up with way too much formatting. Markdown was chosen as the lingua franca of LLMs for a reason, trying to force it into a totally unsuitable format isn't going to work.

bob10293mo ago

I was just testing this with my unity automation tool and the performance uplift from 5.2 seems to be substantial.

throwaway57523mo ago

Does this model autonomously kill people without human approval or perform domestic surveillance of US citizens?

gh0stcat3mo ago

Wait this is really funny, it still just does what it wants, no matter what:

You can have it not use bulleted points, I turned this on, thinking it would be more concise and not so... listy. However, it just uses the same format, without the bullets. I was confused why it was writing 5 word sentences, separated by line breaks. Then I realized it was just making lists, without the bullets.

Great job OpenAI!

ashivkum3mo ago

In my limited experimentation, 5.4 thinking is markedly worse than 5.2 at mathematical reasoning.

brcmthrowaway3mo ago

How much of LLM improvement comes from regular ChatGPT usage these days?

ulfw3mo ago

So desperate how they're bumping out these 'updates'

OutOfHere3mo ago

What is with the absurdity of skipping "5.3 Thinking"?

motza3mo ago

No doubt this was released early to ease the bad press

padamkafle3mo ago

Guys while we celebrate openai gpt 5.4 pleaes do look into this as well

https://news.ycombinator.com/item?id=47259846

MickeyShmueli3mo ago

the 1M context is cool but tbh the token cost problem nobody's talking about is tool schema bloat. before the model writes a single line of code it's already consumed thousands of tokens just ingesting function definitions. i've seen agent setups where 30-40% of the context window is tool descriptions before any actual work happens. the per-token price war is nice but if your schema is 10k tokens of boilerplate you're still burning money

2 more replies

nembal3mo ago

so it seems each RL step extends into a market! 5.3 was target at coding. 5.4 is target at finance 5.5 is healthcare?

_pdp_3mo ago

Tried it today - pretty much underwhelming.

1 more reply

world2vec3mo ago

Benchmarks barely improved it seems

emsign3mo ago

Murderers

melbourne_mat3mo ago

Quick: let's release something new that gives the appearance that we're still relevant

Cort3z3mo ago

So, are we way into diminishing returns for these models at this point? If so, I think we can calculate when it will be available at home. Given this requires a GB200 NVL72 which has about 1,440 PFLOPS, the current 5090 chip has about 1,676 TFLOPS, so about a 1000x scale-up to the GB200. If we can assume Moores law, which might be broken, but still. We are looking at log2(1000) = 9.96, or about 10 years.

freedomben3mo ago

> When toggled on, /fast mode in Codex delivers up to 1.5x faster token velocity with GPT‑5.4. It’s the same model and the same intelligence, just faster.

I hate these blog posts sometimes. Surely there's got to be some tradeoff. Or have we finally arrived at the world's first "free lunch"? Otherwise why not make /fast always active with no mention and no way to turn it off?

1 more reply

gigatexal3mo ago

Is it any good at coding?

big-chungus43mo ago

1.3 more versions to AGI

faizan1993mo ago

is this model of chatgpt good for coding?

Gareth3213mo ago

Holy shit, I just used Atlas browser to navigate on screen and it automatically clicked the "reject cookies" button without me asking!

oytis3mo ago

Everyone is mindblown in 3...2...1

petetnt3mo ago

Whoa, I think GPT-5.3 Instant was a disappointment, but GPT-5.4 is definitely the future!

rambojohnson3mo ago

Great. A new version of the same model, or a different one that performs worse or exactly the same. This whole release theater, just to give shareholders the impression of growth, is such a bullshit grift.

and considering the stance on openai with a majority of the users here compared to the number of upvotes, are HN likes bot-farmed?

Thanakorn_5513mo ago

wow

fernst3mo ago

Now with more and improved domestic espionage capabilities

thefounder3mo ago

Is it just me or the price for 5.4 pro is just insane?

lacoolj3mo ago

lol yet another pat on their own backs without comparison to other frontier models.

Also, the timing of this release, 5.3 and 5.2, relative to the other releases, feels more like a bug fix than something "new"

peq423mo ago

more useless slop machines

leftbehinds3mo ago

some sloppy improvements

woeirua3mo ago

Feels incremental. Looks like OpenAI is struggling.

1 more reply

j / k navigate · click thread line to collapse

805 comments

254 comments · 113 top-level

mattas3mo ago· 14 in thread

"GPT‑5.4 interprets screenshots of a browser interface and interacts with UI elements through coordinate-based clicking to send emails and schedule a calendar event."

They show an example of 5.4 clicking around in Gmail to send an email.

I still think this is the wrong interface to be interacting with the internet. Why not use Gmail APIs? No need to do any screenshot interpretation or coordinate-based clicking.

bottlepalm3mo ago

The vast majority of websites you visit don’t have usable APIs and very poor discovery of the those APIs.

1 more reply

npilk3mo ago

It feels like building humanoid robots so they can use tools built for human hands. Not clear if it will pay off, but if it does then you get a bunch of flexibility across any task "for free".

2 more replies

f0e4c2f73mo ago

Lots of services have no desire to ever expose an API. This approach lets you step right over that.

If an API is exposed you can just have the LLM write something against that.

coffeemug3mo ago

TheAceOfHearts3mo ago

I think the desire is that in the long-term AI should be able to use any human-made application to accomplish equivalent tasks. This email demo is proof that this capability is a high priority.

PaulHoule3mo ago

AI is a threat to the “enshittification economy” because it lets us route around it.

3 more replies

modeless3mo ago

MattDaEskimo3mo ago

Same reason why Wikipedia deals with so many people scraping its web page instead of using their API:

Optimizations are secondary to convenience

kristianp3mo ago

This opens up a new question: how does bot detection work when the bot is using the computer via a gui?

1 more reply

jstummbillig3mo ago

Because the web and software more generally if full of not APIs and you do, in fact, need the clicking to work to make agents work generally

spongebobstoes3mo ago

not everything has an API, or API use is limited. some UIs are more feature complete than their APIs

some sites try to block programmatic use

UI use can be recorded and audited by a non-technical person

satvikpendem3mo ago

The ideal of REST, the HTML and UI is the API.

Jacques2Marais3mo ago

I guess a big chunk of their target market won't know how to use APIs.

steve19773mo ago

One could argue that LLMs learning programming languages made for humans (i.e. most of them) is using the wrong interface as well. Why not use machine code?

3 more replies

__jl__3mo ago· 13 in thread

What a model mess!

OpenAI now has three price points: GPT 5.1, GPT 5.2 and now GPT 5.4. There version numbers jump across different model lines with codex at 5.3, what they now call instant also at 5.3.

Anthropic are really the only ones who managed to get this under control: Three models, priced at three different levels. New models are immediately available everywhere.

Google essentially only has Preview models! The last GA is 2.5. As a developer, I can either use an outdated model or have zero insurances that the model doesn't get discontinued within weeks.

strongpigeon3mo ago

> Google essentially only has Preview models! The last GA is 2.5. As a developer, I can either use an outdated model or have zero insurances that the model doesn't get discontinued within weeks.

What's funny is that there is this common meme at Google: you can either use the old, unmaintained tool that's used everywhere, or the new beta tools that doesn't quite do what you want.

Not quite the same, but it did remind me of it.

5 more replies

Aurornis3mo ago

> What a model mess! OpenAI now has three price points: GPT 5.1, GPT 5.2 and now GPT 5.4.

I don't know, this feels unnecessarily nitpicky to me

It isn't hard to understand that 5.4 > 5.2 > 5.1. It's not hard to understand that the dash-variants have unique properties that you want to look up before selecting.

Especially for a target audience of software engineers skipping a version number is a common occurrence and never questioned.

3 more replies

jbonatakis3mo ago

Google is already sending notices that the 2.5 models will be deprecated soon while all the 3.x models are in preview. It really is wild and peak Google.

2 more replies

0xbadcafebee3mo ago

> or have zero insurances that the model doesn't get discontinued within weeks

3 more replies

CobrastanJorji3mo ago

> Google essentially only has Preview models.

It's really nice to see Google get back to its roots by launching things only to "beta" and then leaving them there for years. Gmail was "beta" for at least five years, I think.

1 more reply

embedding-shape3mo ago

> OpenAI now has three price points: GPT 5.1, GPT 5.2 and now GPT 5.4.

I guess that's true, but geared towards API users.

beklein3mo ago

To me it looks like everybody has similar problems and solutions for the same kinds of problems and they just try their best to offer different products and services to their customers.

2 more replies

biophysboy3mo ago

1 more reply

awad3mo ago

Incredibly curious how Google's approach to support, naming, versioning etc will mesh with the iOS integration.

raincole3mo ago

They aggressively retire models, so GPT 5.1 and 5.2 are probably going to go soon.

1 more reply

arthurcolle3mo ago

There is a lot of opportunity here for the AI infrastructure layer on top of tier-1 model providers

1 more reply

m3kw93mo ago

thats how they had it for years, is a mess, but controlled

delaminator3mo ago

two great problems in computing

naming things

cache invalidation

off by one errors

1 more reply

yanis_t3mo ago· 11 in thread

These releases are lacking something. Yes, they optimised for benchmarks, but it’s just not all that impressive anymore. It is time for a product, not for a marginally improved model.

ipsum23mo ago

The model was released less than an hour ago, and somehow you've been able to form such a strong opinion about it. Impressive!

7 more replies

tgarrett3mo ago

1 more reply

softwaredoug3mo ago

The products are the harnesses, and IMO that’s where the innovation happens. We’ve gotten better at helping get good, verifiable work from dumb LLMs

1 more reply

mindwok3mo ago

They don't need to be impressive to be worthwhile. I like incremental improvements, they make a difference in the day to day work I do writing software with these.

wahnfrieden3mo ago

5.3 codex was a huge leap over 5.2 for agentic work in practice. have you been using both of those or paying attention more to benchmark news and chatgpt experience?

iterateoften3mo ago

The product is putting the skills / harness behind the api instead of the agent locally on your computer and iterating on that between model updates. Close off the garden.

Not that I want it, just where I imagine it going.

esafak3mo ago

That's for you to build; they provide the brains. Do you really want one company to build everything? There wouldn't be a software industry to speak of if that happened.

2 more replies

varispeed3mo ago

The scores increase and as new versions are released they feel more and more dumbed down.

Gigachad3mo ago

They have a product now. Mass surveillance and fully automated killing machines.

jascha_eng3mo ago

When did they stop putting competitor models on the comparison table btw? And yeh I mean the benchmark improvements are meh. Context Window and lack of real memory is still an issue.

metalliqaz3mo ago

They need something that POPS:

    The new GPT -- SkyNet for _real_

minimaxir3mo ago· 10 in thread

Also per pricing, GPT-5.4 ($2.50/M input, $15/M output) is much cheaper than Opus 4.6 ($5/M input, $25/M output) and Opus has a penalty for its beta >200k context window.

I am skeptical whether the 1M context window will provide material gains as current Codex/Opus show weaknesses as its context window is mostly full, but we'll see.

Per updated docs (https://developers.openai.com/api/docs/guides/latest-model), it supercedes GPT-5.3-Codex, which is an interesting move.

damsta3mo ago

There is extra cost for >272K:

> For models with a 1.05M context window (GPT-5.4 and GPT-5.4 pro), prompts with >272K input tokens are priced at 2x input and 1.5x output for the full session for standard, batch, and flex.

Taken from https://developers.openai.com/api/docs/models/gpt-5.4

4 more replies

tedsanders3mo ago

Curious to hear if people have use cases where they find 1M works much better!

(I work at OpenAI.)

14 more replies

andai3mo ago

It's a little hard to compare, because Claude needs significantly fewer tokens for the same task. A better metric is the cost per task, which ends up being pretty similar.

The most striking graph to look at there is "Intelligence vs Output Tokens". When you account for that, I think the actual costs end up being quite similar.

According to the evals, at least, the GPT extra high matches Opus in intelligence, while costing more.

Of course, as always, benchmarks are mostly meaningless and you need to check Actual Real World Results For Your Specific Task!

2 more replies

netinstructions3mo ago

People (and also frustratingly LLMs) usually refer to https://openai.com/api/pricing/ which doesn't give the complete picture.

https://developers.openai.com/api/docs/pricing is what I always reference, and it explicitly shows that pricing ($2.50/M input, $15/M output) for tokens under 272k

It is nice that we get 70-72k more tokens before the price goes up (also what does it cost beyond 272k tokens??)

1 more reply

smusamashah3mo ago

Gemini already has 1M or 2M context window right?

1 more reply

luca-ctx3mo ago

paulddraper3mo ago

I don’t know about 5.4 specifically, but in the past anything over 200k wasn’t that great anyway.

Like, if you really don’t want to spend any effort trimming it down, sure use 1m.

Otherwise, 1m is an anti pattern.

thehamkercat3mo ago

GPT 5.3 codex had 400K context window btw

AtreidesTyrant3mo ago

token rot exists for any context window at above 75% capacity, thats why so many have pushed for 1 mil windows

simianwords3mo ago

Why would some one use codex instead?

6 more replies

nickysielicki3mo ago· 9 in thread

In practice, if I buy $200/mo codex, can I basically run 3 codex instances simultaneously in tmux, like I can with claude code pro max, all day every day, without hitting limits?

vtail3mo ago

1 more reply

ritzaco3mo ago

1 more reply

tauntz3mo ago

I've only run into the codex $20 limit once with my hobby project. With my Claude ~$20 plan, I hit limits after about 3(!) rather trivial prompts to Opus :/

gavinray3mo ago

I almost never hit my $20 Codex limits, whereas I often hit my Claude limits.

CSMastermind3mo ago

Codex limits are much more generous than claude.

I switch between both but codex has also been slightly better in terms of quality for me personally at least.

FergusArgyll3mo ago

Codex usage limits are definitely more generous. As for their strength, that's hard to say / personal taste

mikert893mo ago

I personally like the 100 dollar one from claude, but the gpt4 pro can be very good

throwaway9112823mo ago

you get more more from codex than claude any day. and its more reliable as well.

Marciplan3mo ago

sure can! One of them stood up to the “Department of War” for favoring your rights, the other did not. Hope that helps!

3 more replies

Philip-J-Fry3mo ago· 8 in thread

That's hilarious. Does OpenAI even know this doesn't work?

andrewguenther3mo ago

It looks like this doesn't work for users without accounts? It works when I'm logged in, but not logged out. I went ahead and reported it to the team. Thanks for letting us know!

2 more replies

baxtr3mo ago

I picked up Claude today after being away and using only ChatGPT and Gemini for a while.

I was pretty impressed with how they’ve improved user experience. If I had to guess, I’d say Anthropic has better product people who put more attention to detail in these areas.

4 more replies

ElijahLynn3mo ago

fwiw: I get a valid response when following the steps you mentioned. I do not get the message you mentioned:

https://chatgpt.com/share/69aa0321-8a9c-8011-8391-22861784e8...

EDIT: oh, but I'm logged in, fwiw

zamadatix3mo ago

Following this process summarizes the blogpost for me. Perhaps the difference is I'm signed into my account so it can access external URLs or something of that nature?

pocksuppet3mo ago

Most AI integration is like this. It's not about building working products --- it's about bragging that you put a chatbox in your program.

1 more reply

amelius3mo ago

If only they had an LLM they could use as a software testing agent.

1 more reply

Aurornis3mo ago

Probably intentional. They don't want open, no-registration endpoints able to trigger the AI into hitting URLs.

2 more replies

judge20203mo ago

Works for me: https://rr.judge.sh/Labradorretriever/d6af05/chrome_j9rXJMlf...

Chance-Device3mo ago· 7 in thread