They show an example of 5.4 clicking around in Gmail to send an email.
I still think this is the wrong interface to be interacting with the internet. Why not use Gmail APIs? No need to do any screenshot interpretation or coordinate-based clicking.
Screenshots on the other hand are documentation, API, and discovery all in one. And you’d be surprised how little context/tokens screenshots consumer compared to all the back and forth verbose json payloads of APIs
Of course APIs and CLIs also exist, but they don't necessarily have feature parity, so more development would be needed. Maybe that's the future though since code generation is so good - use AI to build scaffolding for agent interaction into every product.
If an API is exposed you can just have the LLM write something against that.
But people are intimidated by the complexity of writing web crawlers because management has been so traumatized by the cost of making GUI applications that they couldn’t believe how cheap it is to write crawlers and scrapers…. Until LLMs came along, and changed the perceived economics and created a permission structure. [1]
AI is a threat to the “enshittification economy” because it lets us route around it.
[1] that high cost of GUI development is one reason why scrapers are cheap… there is a good chance that the scraper you wrote 8 years ago still works because (a) they can’t afford to change their site and (b) if they could afford to change their site changing anything substantial about it is likely to unrecoverably tank their Google rankings so they won’t. A.I. might change the mechanics of that now that you Google traffic is likely to go to zero no matter what you do.
Optimizations are secondary to convenience
some sites try to block programmatic use
UI use can be recorded and audited by a non-technical person
OpenAI now has three price points: GPT 5.1, GPT 5.2 and now GPT 5.4. There version numbers jump across different model lines with codex at 5.3, what they now call instant also at 5.3.
Anthropic are really the only ones who managed to get this under control: Three models, priced at three different levels. New models are immediately available everywhere.
Google essentially only has Preview models! The last GA is 2.5. As a developer, I can either use an outdated model or have zero insurances that the model doesn't get discontinued within weeks.
What's funny is that there is this common meme at Google: you can either use the old, unmaintained tool that's used everywhere, or the new beta tools that doesn't quite do what you want.
Not quite the same, but it did remind me of it.
I don't know, this feels unnecessarily nitpicky to me
It isn't hard to understand that 5.4 > 5.2 > 5.1. It's not hard to understand that the dash-variants have unique properties that you want to look up before selecting.
Especially for a target audience of software engineers skipping a version number is a common occurrence and never questioned.
Why are you using the same model after a month? Every month a better model comes out. They are all accessible via the same API. You can pay per-token. This is the first time in, like, all of technology history, that a useful paid service is so interoperable between providers that switching is as easy as changing a URL.
It's really nice to see Google get back to its roots by launching things only to "beta" and then leaving them there for years. Gmail was "beta" for at least five years, I think.
I guess that's true, but geared towards API users.
Personally, since "Pro Mode" became available, I've been on the plan that enables that, and it's one price point and I get access to everything, including enough usage for codex that someone who spends a lot of time programming, never manage to hit any usage limits although I've gotten close once to the new (temporary) Spark limits.
Also their pricing based on 5m/1h cache hits, cash read hits, additional charges for US inference (but only for Opus 4.6 I guess) and optional features such as more context and faster speed for some random multiplier is also complex and actually quiet similar to OpenAI's pricing scheme.
To me it looks like everybody has similar problems and solutions for the same kinds of problems and they just try their best to offer different products and services to their customers.
naming things
cache invalidation
off by one errors
Not that I want it, just where I imagine it going.
The new GPT -- SkyNet for _real_Also per pricing, GPT-5.4 ($2.50/M input, $15/M output) is much cheaper than Opus 4.6 ($5/M input, $25/M output) and Opus has a penalty for its beta >200k context window.
I am skeptical whether the 1M context window will provide material gains as current Codex/Opus show weaknesses as its context window is mostly full, but we'll see.
Per updated docs (https://developers.openai.com/api/docs/guides/latest-model), it supercedes GPT-5.3-Codex, which is an interesting move.
> For models with a 1.05M context window (GPT-5.4 and GPT-5.4 pro), prompts with >272K input tokens are priced at 2x input and 1.5x output for the full session for standard, batch, and flex.
Taken from https://developers.openai.com/api/docs/models/gpt-5.4
For Codex, we're making 1M context experimentally available, but we're not making it the default experience for everyone, as from our testing we think that shorter context plus compaction works best for most people. If anyone here wants to try out 1M, you can do so by overriding `model_context_window` and `model_auto_compact_token_limit`.
Curious to hear if people have use cases where they find 1M works much better!
(I work at OpenAI.)
For example on Artificial Analysis, the GPT-5.x models' cost to run the evals range from half of that of Claude Opus (at medium and high), to significantly more than the cost of Opus (at extra high reasoning). So on their cost graphs, GPT has a considerable distribution, and Opus sits right in the middle of that distribution.
The most striking graph to look at there is "Intelligence vs Output Tokens". When you account for that, I think the actual costs end up being quite similar.
According to the evals, at least, the GPT extra high matches Opus in intelligence, while costing more.
Of course, as always, benchmarks are mostly meaningless and you need to check Actual Real World Results For Your Specific Task!
For most of my tasks, the main thing a benchmark tells me is how overqualified the model is, i.e. how much I will be over-paying and over-waiting! (My classic example is, I gave the same task to Gemini 2.5 Flash and Gemini 2.5 Pro. Both did it to the same level of quality, but Gemini took 3x longer and cost 3x more!)
https://developers.openai.com/api/docs/pricing is what I always reference, and it explicitly shows that pricing ($2.50/M input, $15/M output) for tokens under 272k
It is nice that we get 70-72k more tokens before the price goes up (also what does it cost beyond 272k tokens??)
Like, if you really don’t want to spend any effort trimming it down, sure use 1m.
Otherwise, 1m is an anti pattern.
In practice, if I buy $200/mo codex, can I basically run 3 codex instances simultaneously in tmux, like I can with claude code pro max, all day every day, without hitting limits?
I switch between both but codex has also been slightly better in terms of quality for me personally at least.
That's hilarious. Does OpenAI even know this doesn't work?
I was pretty impressed with how they’ve improved user experience. If I had to guess, I’d say Anthropic has better product people who put more attention to detail in these areas.
https://chatgpt.com/share/69aa0321-8a9c-8011-8391-22861784e8...
EDIT: oh, but I'm logged in, fwiw
$2/M Input Tokens $15/M Output Tokens
Claude Opus 4.6
$5/M Input Tokens $25/M Output Tokens
This should not be shocking.
- Do they have the same context usage/cost particularly in a plan?
They've kept 5.3-Codex along with 5.4, but is that just for user-preference reasons, or is there a trade-off to using the older one? I'm aware that API cost is better, but that isn't 1:1 with plan usage "cost."
Ultimately, the people actually interested in the performance of these models already don't trust self-reported comparisons and wait for third-party analysis anyway
Presumably this is where it'll evolve to with the product just being the brand with a pricing tier and you always get {latest} within that, whatever that means (you don't have to care). They could even shuffle models around internally using some sort of auto-like mode for simpler questions. Again why should I care as long as average output is not subjectively worse.
Just as I don't want to select resources for my SaaS software to use or have that explictly linked to pricing, I don't want to care what my OpenAI model or Anthropic model is today, I just want to pay and for it to hopefully keep getting better but at a minimum not get worse.
It might be my AGENTS.md requiring clearer, simpler language, but at least 5.4's doing a good job of following the guidelines. 5.3-Codex wasn't so great at simple, clear writing.
If you gave the exact same markdown file to me and I posted ed the exact same prompts as you, would I get the same results?
We got:
- GPT-5.1
- GPT-5.2 Thinking
- GPT-5.3 (codex)
- GPT-5.3 Instant
- GPT-5.4 Thinking
- GPT-5.4 Pro
Who’s to blame for this ridiculous path they are taking? I’m so glad I am not a Chat user, because this adds so much unnecessary cognitive load.
The good news here is the support for 1M context window, finally it has caught up to Gemini.
Absolute snakes - if it's more profitable to manipulate you with outputs or steal your work, they will. Every cent and byte of data they're given will be used to support authoritarianism.
Also, Anthropic/Gemini/even Kimi models are pretty good for what its worth. I used to use chatgpt and I still sometimes accidentally open it but I use Gemini/Claude nowadays and I personally find them to be better anyways too.
i just HATE talking to it like a chatbot
idk what they did but i feel like every response has been the same "structure" since gpt 5 came out
feels like a true robot
It's very similar to "Battle Brothers", and the fact that RPG games require art assets, AI for enemy moves, and a host of other logical systems makes it all the more impressive.
> we’re also releasing an experimental Codex skill called “Playwright (Interactive) (opens in a new window)”. This allows Codex to visually debug web and Electron apps; it can even be used to test an app it’s building, as it’s building it.
>Today, we’re releasing GPT‑5.4 in ChatGPT (as GPT‑5.4 Thinking),
>Note that there is not a model named GPT‑5.3 Thinking
They held out for eight months without a confusing numbering scheme :)
Not sure if this is more concerning for the test time compute paradigm or the underlying model itself.
Maybe I'm misunderstanding something though? I'm assuming 5.4 and 5.4 Thinking are the same underlying model and that's not just marketing.
It's the one you have access to with the top ~$200 subscription and it's available through the API for a MUCH higher price ($2.5/$15 vs $30/$180 for 5.4 per 1M tokens), but the performance improvement is marginal.
Not sure what it is exactly, I assume it's probably the non-quantized version of the model or something like that.
I have now switched web-related and data-related queries to Gemini, coding to Claude, and will probably try QWEN for less critical data queries. So where does OpenAI fits now?
I'd believe it on those specific tasks. Near-universal adoption in software still hasn't moved DORA metrics. The model gets better every release. The output doesn't follow. Just had a closer look on those productivity metrics this week: https://philippdubach.com/posts/93-of-developers-use-ai-codi...
Given that organization who ran the study [1] has a terrifying exponential as their landing page, I think they'd prefer that it's results are interpreted as a snapshot of something moving rather than a constant.
[1] - https://metr.org/
"Change Lead Time" I would expect to have sped up although I can tell stories for why AI-assisted coding would have an indeterminate effect here too. Right now at a lot of orgs, the bottle neck is the review process because AI is so good at producing complete draft PRs quickly. Because reviews are scarce (not just reviews but also manual testing passes are scarce) this creates an incentive ironically to group changes into larger batches. So the definition of what a "change" is has grown too.
Interesting, the "Health" category seems to report worse performance compared to 5.2.
I really thought weirdly worded and unnecessary "announcement" linking to the actual info along with the word "card" were the results of vibe slop.
gpt-5.4
Input: $2.50 /M tokens
Cached: $0.25 /M tokens
Output: $15 /M tokens
---
gpt-5.4-pro
Input: $30 /M tokens
Output: $180 /M tokens
Wtf
https://www.svgviewer.dev/s/gAa69yQd
Not the best pelican compared to gemini 3.1 pro, but I am sure with coding or excel does remarkably better given those are part of its measured benchmarks.
A couple months later:
"We are deprecating the older model."
This becomes increasingly less clear to me, because the more interesting work will be the agent going off for 30mins+ on high / extra high (it's mostly one of the two), and that's a long time to wait and an unfeasible amount of code to a/b
I like Sonnet 4.6 a lot too at medium reasoning effort, but at least in Cursor it is sometimes quite slow because it will start "thinking" for a long time.
> Theme park simulation game made with GPT‑5.4 from a single lightly specified prompt
GPT literally built that game.
The OP has frequently gotten the scoop for new LLM releases and I am curious what their pipeline is.
in 5.4 it looks like the just collapsed that capability into the single frontier family model
numerusformassistant to=functions.ReadFile մեկնաբանություն 天天爱彩票网站json {"path":
Looks like some kind of encoding misalignment bug. What you're seeing is their Harmony output format (what the model actually creates). The Thai/Chinese characters are special tokens apparently being mismapped to Unicode. Their servers are supposed to notice these sequences and translate them back to API JSON but it isn't happening reliably.
"Bob’s latest mail is actually the source of the confusion: he changed shared app/backend text to aweb/atlas. I’m correcting that with him now so we converge on the real model before any more code moves."
This was very much not true; Eve (the agent writing this, a gpt-5.4) had been thoroughly creating the confusion and telling Bob (an Opus 4.6) the wrong things. And it had just happened, it was not a matter of having forgotten or compacted context.
I have had agents chatting with each other and coordinating for a couple of months now, codex and claude code. This is a first. I wonder how much can I read into it about gpt-5.4's personality.
1. Fast mode ain't that fast
2. Large context * Fast * Higher Model Base Price = 8x increase over gpt-5.3-codex
3. I burnt 33% of my 5h limit (ChatGPT Business Subscription) with a prompt that took 2 minutes to complete.
GPT-5.4 extra high scores 94.0 (GPT-5.2 extra high scored 88.6).
GPT-5.4 medium scores 92.0 (GPT-5.2 medium scored 71.4).
GPT-5.4 no reasoning scores 32.8 (GPT-5.2 no reasoning scored 28.1).
I tried several use cases: - Code Explanation: Did far much better than Opus, considered and judged his decision on a previous spec that I made, all valid points so I am impressed. TBF if I spawned another Opus as a reviewer I might got similar results. - Workflow Running: Really similar to Opus again, no objections it followed and read Skills/Tools as it should be (although mine are optimized for Claude) - Coding: I gave it a straightforward task to wrap an API calls to an SDK and to my surprise it did 'identical' job with Opus, literally the same code, I don't know what the odds are to this but again very good solution and it adhered our rules of implementing such code.
Overall I am impressed and excited to see a rival to Opus and all of this is literally pushing everyone to get better and better models which is always good for us.
Recent SWE-bench Verified scores I’m watching:
Claude 4.5 Opus (high reasoning): 76.8
Gemini 3 Flash (high reasoning): 75.8
MiniMax M2.5 (high reasoning): 75.8
Claude Opus 4.6: 75.6
GPT-5.2 Codex: 72.8
Source: https://www.swebench.com/index.html
By the way, in my experience the agent part of Codex CLI has improved a lot and has become comparable to Claude Code. That is good news for OpenAI.
This was definitely missing before, and a frustrating difference when switching between ChatGPT and Codex. Great addition.
This is on the edge of what the frontier models can do. For 5.4, the result is better than 5.3-Codex and Opus 4.6. (Edit: nowhere near the RPG game from their blog post, which was presumably much more specced out and used better engineering setup).
I also tested it with a non-trivial task I had to do on an existing legacy codebase, and it breezed through a task that Claude Code with Opus 4.6 was struggling with.
I don't know when Anthropic will fire back with their own update, but until then I'll spend a bit more time with Codex CLI and GPT 5.4.
If you last used 5.2, try 5.4 on High.
> Theme park simulation game made with GPT‑5.4 from a single lightly specified prompt, using Playwright Interactive for browser playtesting and image generation for the isometric asset set.
Is "Playwright Interactive" a skill that takes screenshots in a tight loop with code changes, or is there more to it?
When two agents coordinate, they’re mostly relying on compressed summaries of each other’s outputs. If one introduces a wrong assumption, the other often treats it as ground truth and builds on top of it. I’ve seen similar behavior in multi-agent coding loops where the model invents a causal explanation just to reconcile inconsistent state.
It’s that multi-agent setups need a stronger shared source of truth (repo diffs, state snapshots, etc.). Otherwise small context errors snowball fast.
I wonder if 5.4 will be much if any different at all.
GPT-5.4: 75.1%
GPT-5.3-Codex: 77.3%
In terms of writing and research even Gemini, with a good prompt, is close to useable. That's likely not a differentiator.
GPT is not even close yo Claude in terms of responding to BS.
I imagine they added a feature or two, and the router will continue to give people 70B parameter-like responses when they dont ask for math or coding questions.
Not including the Chinese models is also obviously done to make it appear like they aren't as cooked as they really are.
And so far it has succeeded
I absolutely could come up with the details and implementation by myself, but that would certainly take a lot of back and forth, probably a month or two.
I’m an api user of Claude code, burning through 2k a month. I just this evening planned the whole thing with its help and actually had to stop it from implementing it already. Will do that tomorrow. Probably in one hour or two, with better code than I could ever write alone myself.
Having that level of intelligence at that price is just bollocks. I’m running out of problems to solve. It’s been six months.
Nothing infuriates me more than an LLM tool randomly deciding to create docx or xlsx files for no apparent reason. They have to use a random library to create these files, and they constantly screw up API calls and get completely distracted by the sheer size of the scripts they have to write to output a simple documents. These files have terrible accessibility (all paper-like formats do) and end up with way too much formatting. Markdown was chosen as the lingua franca of LLMs for a reason, trying to force it into a totally unsuitable format isn't going to work.
You can have it not use bulleted points, I turned this on, thinking it would be more concise and not so... listy. However, it just uses the same format, without the bullets. I was confused why it was writing 5 word sentences, separated by line breaks. Then I realized it was just making lists, without the bullets.
Great job OpenAI!
I hate these blog posts sometimes. Surely there's got to be some tradeoff. Or have we finally arrived at the world's first "free lunch"? Otherwise why not make /fast always active with no mention and no way to turn it off?
and considering the stance on openai with a majority of the users here compared to the number of upvotes, are HN likes bot-farmed?
Also, the timing of this release, 5.3 and 5.2, relative to the other releases, feels more like a bug fix than something "new"