GPT-5.5 | Better HN

GPT-5.5 | Better HN

1033 comments

tedsanders17d ago

Just as a heads up, even though GPT-5.5 is releasing today, the rollout in ChatGPT and Codex will be gradual over many hours so that we can make sure service remains stable for everyone (same as our previous launches). You may not see it right away, and if you don't, try again later in the day. We usually start with Pro/Enterprise accounts and then work our way down to Plus. We know it's slightly annoying to have to wait a random amount of time, but we do it this way to keep service maximally stable.

(I work at OpenAI.)

endymi0n17d ago

Did you guys do anything about GPT‘s motivation? I tried to use GPT-5.4 API (at xhigh) for my OpenClaw after the Anthropic Oauthgate, but I just couldn‘t drag it to do its job. I had the most hilarious dialogues along the lines of „You stopped, X would have been next.“ - „Yeah, I‘m sorry, I failed. I should have done X next.“ - „Well, how about you just do it?“ - „Yep, I really should have done it now.“ - “Do X, right now, this is an instruction.” - “I didn’t. You’re right, I have failed you. There’s no apology for that.”

I literally wasn’t able to convince the model to WORK, on a quick, safe and benign subtask that later GLM, Kimi and Minimax succeeded on without issues. Had to kick OpenAI immediately unfortunately.

31 more replies

vlovich12317d ago

Conceivably you could have a public-facing dashboard of the rollout status to reduce confusion or even make it visible directly in the UI that the model is there but not yet available to you. The fanciest would be to include an ETA but that's presumably difficult since it's hard to guess in case the rollout has issues.

Grp117d ago

Congrats on the release! Is Images 2.0 rolling out inside ChatGPT as well, or is some of the functionality still going to be API/Playground-only for a while?

fragmede17d ago

Are you able to say something about the training you've done to 5.5 to make it less likely to freak out and delete projects in what can only be called shame?

dandiep17d ago

Will GPT 5.5 fine tuning be released any time soon?

rev4n17d ago

Looks good, but I’m a little hesitant to try it in Codex as a Plus user since I’m not sure how much it would eat into the usage cap.

qsort17d ago

Great stuff! Congrats on the release!

wslh17d ago

Just a tip: add [translated] subtitles to the top video.

dhruv300616d ago

Yep - its taking sometime.

lr197016d ago

When I ask GPT-5.5 about its knowledge cutoff date it says "August 2025". Really?

motoboi17d ago

Please next time start with azure foundry lol thanks!

fHr17d ago

LETS GO CODEX #1

dude25071117d ago

With Anthropic, newer models often lead to quality degradation. Will you keep GPT 5.4 available for some time?

pixel_popping17d ago

can't wait! Thanks guys. PS: when you drop a new model, it would be smart to reset weekly or at least session limits :)

simonw17d ago

This doesn't have API access yet, but OpenAI seem to approve of the Codex API backdoor used by OpenClaw these days... https://twitter.com/steipete/status/2046775849769148838 and https://twitter.com/romainhuet/status/2038699202834841962

And that backdoor API has GPT-5.5.

So here's a pelican: https://simonwillison.net/2026/Apr/23/gpt-5-5/#and-some-peli...

I used this new plugin for LLM: https://github.com/simonw/llm-openai-via-codex

UPDATE: I got a much better pelican by setting the reasoning effort to xhigh: https://gist.github.com/simonw/a6168e4165a258e4d664aeae8e602...

DrProtic17d ago

That pelican you posted yesterday from a local model looks nicer than this one.

Edit: this one has crossed legs lol

stingraycharles16d ago

OpenAI hired the guy behind OpenClaw, so it makes sense that they’re more lenient towards its usage.

GistNoesis17d ago

Isn't it awful ? After 5.5 versions it still can't draw a basic bike frame. How is the front wheel supposed to turn sideways ?

postalcoder17d ago

I made pelicans at different thinking efforts:

https://hcker.news/pelican-low.svg

https://hcker.news/pelican-medium.svg

https://hcker.news/pelican-high.svg

https://hcker.news/pelican-xhigh.svg

Someone needs to make a pelican arena, I have no idea if these are considered good or not.

droidjj17d ago

It's... like no pelican I've ever seen before.

matt321016d ago

The pelican doesn’t really matter anymore since models are tuned for it knowing people will ask.

XCSme17d ago

Is this direct API usage allowed by their terms? I remember Anthropic really not liking such usage.

deflator17d ago

Hmm. Any idea why it's so much worse than the other ones you have posted lately? Even the open weight local models were much better, like the Qwen one you posted yesterday.

Schlagbohrer17d ago

That's amazing that the default did that much in just 39 "reasoning tokens" (no idea what a reasoning token is but that's still shockingly few tokens)

mannanj16d ago

Does OpenAI actually act open for once here, and allow using their model via a subscription over Anthrophic banning use in Openclaw?

noonething17d ago

Thank you for doing all this. It's appreciated.

SkyBelow17d ago

Wait, I thought we were onto racoons on e-scooters to avoid (some of) the issues with Goodhart's Law coming into play.

zerop16d ago

So pelican must have become the mandatory test case to pass for all model providers before launch.

andriy_koval17d ago

what is your setup for drawing pelican? Do you ask model to check generated image, find issues and iterate over it which would demonstrate models real abilities?

gpm17d ago

I for one delight in bicycles where neither wheel can turn!

It continues to amaze me that these models that definitely know what bicycle geometry actually looks like somewhere in their weights produces such implausibly bad geometry.

Also mildly interesting, and generally consistent with my experience with LLMs, that it produced the same obvious geometry issue both times.

singingtoday16d ago

Thank you for continuing to post these! Very interesting benchmark.

rolymath17d ago

Exciting. Another Pelican post.

sjdv198217d ago

At some point, OpenAI is going to cheat and hardcode a pelican on a bicycle into the model. 3D modelling has Suzanne and the teapot; LLMs will have the pelican.

dakolli17d ago

You know they are 1000% training these models to draw pelicans, this hasn't been a valid benchmark for 6 months +

jfkimmes17d ago

Everyone talked about the marketing stunt that was Anthropic's gated Mythos model with an 83% result on CyberGym. OpenAI just dropped GPT 5.5, which scores 82% and is open for anybody to use.

I recommend anybody in offensive/defensive cybersecurity to experiment with this. This is the real data point we needed - without the hype!

Never thought I'd say this but OpenAI is the 'open' option again.

tpurves17d ago

The real 'hype' was that the oh-snap realization that Open AI would absolutely release a competitive model to Mythos within weeks of Anthropic announcing there's, and that Sam would not gate access to it. So the panic was that the cyber world had only a projected 2 weeks to harden all these new zero days before Sam would inevitably create open season for blackhats to discover and exploit a deluge of zero-days.

concinds17d ago

> Never thought I'd say this but OpenAI is the 'open' option again.

Compared to Anthropic, they always have been. Anthropic has never released any open models. Never released Claude Code's source, willingly (unlike Codex). Never released their tokenizer.

unsupp0rted17d ago

Doesn't OpenAI get mad if you ask cybersecurity questions and force you to upload a government ID, otherwise they'll silently route you to a less capable model?

> Developers and security professionals doing cybersecurity-related work or similar activity that could be mistaken by automated detection systems may have requests rerouted to GPT-5.2 as a fallback.

https://developers.openai.com/codex/concepts/cyber-safety

https://chatgpt.com/cyber

mafriese16d ago

From my experience OpenAI has become very sensitive when it comes to using their tools for security research. I am using MCP servers for tools like IDA Pro or Ghidra (for malware analysis) and recently received a warning:

> OpenAI's terms and policies restrict the use of our services in a number of areas. We have identified activity in your OpenAI account that is not permitted under our policies for: - Cyber Abuse

I raised an appeal which got denied. To be fair I think it's close to impossible for someone that is looking at the chat history to differenciate between legitimate research and malicious intent. I have also applied for the security research program that OpenAI is offering but didn't get any reply on that.

tnkuehne17d ago

isnt it like cyber question are being routed to dumper models at openai?

willsmith7216d ago

Being "more" open than something totally closed doesn't make you open. The name is still bs

mannanj16d ago

Seems like OpenAI only acts Open for theatric and attentional purposes though, i.e. when backed into a corner and its for their image.

ur-whale17d ago

> Anthropic's gated Mythos model

aka the perfect marketing ploy

attentive16d ago

it's still somewhat gated behind "trusted access" for cyber, see https://chatgpt.com/cyber

_the_inflator17d ago

I ignore any hype news.

Anthropic is the embodiment of bullshitting to me.

I read Cialdini many decades ago and I am bored by Anthropic.

OpenAI is very clever. With the advent of Claude OpenAI disappeared from the headlines. Who or what was this Sam again all were talking about a year ago?

OpenAI has a massive user advantage so that they can simply follow Anthropic’s release cycle to ridicule them.

I think it is really brutal for Anthropic how they are easily getting passed by by OpenAI and it is getting worse with every new GPT version for Anthropic.

OpenAI owns them.

Someone123417d ago

I'd like to draw people's attention to this section of this page:

https://developers.openai.com/codex/pricing?codex-usage-limi...

Note the Local Messages between 5.3, 5.4, and 5.5. And, yes, I did read the linked article and know they're claiming that 5.5's new efficient should make it break-even with 5.4, but the point stands, tighter limits/higher prices.

puppystench17d ago

For API usage, GPT-5.5 is 2x the price of GPT-5.4, ~4x the price of GPT-5.1, and ~10x the price of Kimi-2.6.

Unfortunately I think the lesson they took from Anthropic is that devs get really reliant and even addicted on coding agents, and they'll happily pay any amount for even small benefits.

keyle16d ago

I did one review job that sent off three subagents and I blew the second half of my daily limit in 10 mins 13 seconds. Fun times.

raincole16d ago

It's such a vague table for pricing information. 30-150 messages...? What?

minimaxir17d ago

The more interesting part of the announcement than "it's better at benchmarks":

> To better utilize GPUs, Codex analyzed weeks’ worth of production traffic patterns and wrote custom heuristic algorithms to optimally partition and balance work. The effort had an outsized impact, increasing token generation speeds by over 20%.

The ability for agentic LLMs to improve computational efficiency/speed is a highly impactful domain I wish was more tested than with benchmarks. From my experience Opus is still much better than GPT/Codex in this aspect, but given that OpenAI is getting material gains out of this type of performancemaxxing and they have an increasing incentive to continue doing so given cost/capacity issues, I wonder if OpenAI will continue optimizing for it.

xiphias217d ago

There's already KernelBench which tests CUDA kernel optimizations.

On the other hand all companies know that optimizing their own infrastructure / models is the critical path for ,,winning'' against the competition, so you can bet they are serious about it.

xtracto17d ago

So, im working in some high performance data processing in Rust. I had hit some performance walls, and needed to improve in the 100x or more scale.

I remembered the famous FizzBuzz Intel codegolf optimizations, and gave it to gemini pro, along with my code and instructions to "suggest optimizations similar to those, maybe not so low level, but clever" and it's suggestions were veerry cool.

LLM do not stop amazing me every day.

amrrs17d ago

Honestly the problem with these is how empirical it is, how someone can reproduce this? I love when Labs go beyond traditional benchies like MMLU and friends but these kind of statements don't help much either - unless it's a proper controlled study!

astlouis4417d ago

A playable 3D dungeon arena prototype built with Codex and GPT models. Codex handled the game architecture, TypeScript/Three.js implementation, combat systems, enemy encounters, HUD feedback, and GPT‑generated environment textures. Character models, character textures, and animations were created with third-party asset-generation tools

The game that this prompt generated looks pretty decent visually. A big part of this likely due to the fact the meshes were created using a seperate tool (probably meshy, tripo.ai, or similiar) and not generated by 5.5 itself.

It really seems like we could be at the dawn of a new era similiar to flash, where any gamer or hobbyist can generate game concepts quickly and instantly publish them to the web. Three.js in particular is really picking up as the primary way to design games with AI, in spite of the fact it's not even a game engine, just a web rendering library.

dataviz100017d ago

LLM models can not do spacial reasoning. I haven't tried with GPT, however, Claude can not solve a Rubik Cube no matter how much I try with prompt engineering. I got Opus 4.6 to get ~70% of the puzzle solved but it got stuck. At $20 a run it prohibitively expensive.

The point is if we can prompt an LLM to reason about 3 dimensions, we likely will be able to apply that to math problems which it isn't able to solve currently.

I should release my Rubiks Cube MCP server with the challenge to see if someone can write a prompt to solve a Rubik's Cube.

0x6217d ago

FWIW I've been experimenting with Three.js and AI for the last ~3 years, and noticed a significant improvement in 5.4 - the biggest single generation leap for Three.js specifically. It was most evident in shaders (GLSL), but also apparent in structuring of Three.js scenes across multiple pages/components.

It still struggles to create shaders from scratch, but is now pretty adequate at editing existing shaders.

In 5.2 and below, GPT really struggled with "one canvas, multiple page" experiences, where a single background canvas is kept rendered over routes. In 5.4, it still takes a bit of hand-holding and frequent refactor/optimisation prompts, but is a lot more capable.

Excited to test 5.5 and see how it is in practice.

vunderba17d ago

I’ve had a lot of success using LLMs to help with my Three.js based games and projects. Many of my weird clock visualizations relied heavily on it.

It might not be a game engine, but it’s the de facto standard for doing WebGL 3D. And since it’s been around forever, there’s a massive amount of training data available for it.

Before LLMs were a thing, I relied more on Babylon.js, since it’s a bit higher level and gives you more batteries included for game development.

kingstnap17d ago

The meshes look interesting, but the gameplay is very basic. The tank one seems more sophisticated with the flying ships and whatnot.

What's strange is that this Pietro Schirano dude seems to write incredibly cargo cult prompts.

  Game created by Pietro Schirano, CEO of MagicPath

  Prompt: Create a 3D game using three.js. It should be a UFO shooter where I control a tank and shoot down UFOs flying overhead.
  - Think step by step, take a deep breath. Repeat the question back before answering.
  - Imagine you're writing an instruction message for a junior developer who's going to go build this. Can you write something extremely clear and specific for them, including which files they should look at for the change and which ones need to be fixed?
  -Then write all the code. Make the game low-poly but beautiful.
  - Remember, you are an agent: please keep going until the user's query is completely resolved before ending your turn and yielding back to the user. Decompose the user's query into all required sub-requests and confirm that each one is completed. Do not stop after completing only part of the request. Only terminate your turn when you are sure the problem is solved. You must be prepared to answer multiple queries and only finish the call once the user has confirmed they're done.
  - You must plan extensively in accordance with the workflow steps before making subsequent function calls, and reflect extensively on the outcomes of each function call, ensuring the user's query and related sub-requests are completely resolved.

nemo44x17d ago

It’s like all these things though - it’s not a real production worthy product. It’s a super-demo. It looks amazing until you realize there’s many months of work to make it something of quality and value.

I think people are starting to catch on to where we really are right now. Future models will be better but we are entering a trough of dissolution and this attitude will be widespread in a few months.

peder17d ago

> It really seems like we could be at the dawn of a new era similiar to flash

We've been there for a while.... creativity has been the primary bottleneck

mindhunter17d ago

A friend is building Jamboree[1] (prev name "Spielwerk") for iOS. An app to build and share games. They're all web based so they're easy to share.

[1] https://apps.apple.com/uz/app/jamboree-game-maker/id67473110...

ZeWaka17d ago

I personally don't think the gameplay itself is that impressive.

6thbit17d ago

                          Mythos     5.5
    SWE-bench Pro          77.8%*   58.6%
    Terminal-bench-2.0     82.0%    82.7%*
    GPQA Diamond           94.6%*   93.6%
    H. Last Exam           56.8%*   41.4%
    H. Last Exam (tools)   64.7%*   52.2%    
    BrowseComp             86.9%    84.4%  (90.1% Pro)*
    OSWorld-Verified       79.6%*   78.7%

Still far from Mythos on SWE-bench but quite comparable otherwise. Source for mythos values: https://www.anthropic.com/glasswing

aliljet17d ago

Mythos is only real when it's actually available. If you're using Opus 4.7 right now, you know how incredibly nerfed the Opus autonomy is in service of perceived safety. I'm not so confident this will be as great as Anthropic wants us to believe..

XCSme17d ago

They mentioned in their release page, that the Claude team noticed memorization of the SWE-bench test, so the test is actually in the training data.

Here: https://www.anthropic.com/news/claude-opus-4-7#:~:text=memor...

kaonashi-tyc-0117d ago

I did some study on Verified, not Pro, but Mythos number there rings a lot of questions on my end.

If you look at the SWEBench official submissions: https://github.com/SWE-bench/experiments/tree/main/evaluatio..., filter all models after Sonnet 4, and aggregate ALL models' submission across 500 problems, what I found that the aggregated resolution rate is 93% (sharp).

Mythos gets 93.7%, meaning it solves problems that no other models could ever solve. I took a look at those problems, then I became even more suspicious, for the remaining 7% problems, it is almost impossible to resolve those issues without looking at the testing patch ahead of time, because how drastically the solution itself deviates from the problem statement, it almost feels like it is trying to solve a different problem.

Not that I am saying Mythos is cheating, but it might be too capable to remember all states of said repos, that it is able to reverse engineer the TRUE problem statement by diffing within its own internal memory. I think it could be a unique phenomena of evaluation awareness. Otherwise I genuinely couldn't think of exactly how it could be this precise in deciphering such unspecific problem statements.

alansaber17d ago

A single benchmark is meaningless, you always get quirky results on some benchmarks.

silvertaza17d ago

Still huge hallucination rate, unfortunately at 86%. To compare, Opus sits at 36%.

Source: https://artificialanalysis.ai/models?omniscience=omniscience...

dubcanada17d ago

grok is 17%? And that's the lowest, most models are like 80%+?

While hallucination is probably closer to 100% depending on the question. This benchmark makes no sense.

simianwords17d ago

There's something off with this because Haiku should not be that good.

dakolli17d ago

This indicates they want this behavior, they know the person asking the question probably doesn't understand the problem entirely (or why would they be asking), so they'd prefer a confident response, regardless of outcomes, because the point is to sell the technologies competency (and the perception thereof), not the capabilities, to a bunch of people that have no clue what they're talking about.

LLMs will ruin your product, have fun trusting a billionaires thinking machine they swear is capable of replacing your employees if you just pay them 75% of your labor budget.

mudkipdev17d ago

This is 3x the price of GPT-5.1, released just 6 months ago. Is no one else alarmed by the trend? What happens when the cheaper models are deprecated/removed over time?

Night_Thastus17d ago

This is entirely expected. The low prices of using LLMs early on was totally and completely unsustainable. The companies providing such services were (and still are) burning money by the truckload.

The hope is to get a big userbase who eventually become dependent on it for their workflow, then crank up the price until it finally becomes profitable.

The price for all models by all companies will continue to go up, and quickly.

energy12317d ago

Look a cost per intelligence or cost per task instead of cost per token.

Schlagbohrer17d ago

As others have mentioned you're ignoring the long tail of open-weights models which can be self hosted. As long as that quasi-open-source competition keeps up the pace, it will put a cap on how expensive the frontier models can get before people have to switch to self-hosting.

That's a big if, though. I wish Meta were still releasing top of the line, expensively produced open-weights models. Or if Anthropic, Google, or X would release an open mini version.

dannyw17d ago

It's far more meaningful to look at the actual cost to successfully something. The token efficiency of GPT-5.5 is real; as well as it just being far better for work.

operatingthetan17d ago

We know they cost much more than this for OpenAI. Assume prices will continue to climb until they are making money.

dandaka17d ago

SOTA models get distilled to open source weights in ~6 months. So paying premium for bleeding edge performance sounds like a fair compensation for enormous capex.

scotty7916d ago

When technology settles rent seeking beancounters will squeeze out of this the last drop of value they can. In theory it just needs to be cheaper than human doing the same job. In practice it will have to be cheaper than a human using open source model on his own hardware, that can do the job.

typs17d ago

GPT-4 cost 6x on input and 2x output tokens when it was released as compared go GPT-5.5

kuatroka17d ago

Not really a big problem. Switch to KIMI, Qwen, GLM. You’ll get 95% quality of GPT or Anthropic for a 10th of a price. I feel like the real dependency is more mental, more of a habit but if you actually dip your toes outside OpenAI, Anthropic, Gemini from time to time, you realise that the actual difference in code is not huge if prompted in a good way. Maybe you’ll have to tell it to do something twice and it won’t be a one shot, but it’s really not an issue at all.

msdz17d ago

Such an increase tracks the company's valuation trend, which they constantly, somehow have to justify (let alone break even on costs).

michaelcampbell15d ago

> What happens when the cheaper models are deprecated/removed over time?

Fewer people will use it.

thrawa838733616d ago

Apparently the cost/price is 20x in the major providers. Not clear how it is a business

jeffybefffy51916d ago

Surely its just the same model, just allowed to do more work...??

applfanboysbgon17d ago

If there's a bingo card for model releases, "our [superlative] and [superlative] model yet" is surely the free space.

tom133717d ago

Do "our [superlative] and [superlative] [product] yet" and you have pretty much every product launch

xnx17d ago

"our newest and most expensive model yet"

wiseowise17d ago

"Best iPhone ever"

ertgbnm17d ago

can't wait for "our worst and dumbest model yet"

vthallam17d ago

This model is great at long horizon tasks, and Codex now has heartbeats, so it can keep checking on things. Give it your hardest problem that would take hours with verifiable constraints, you will see how good this is:)

*I work at OAI.

dannyw17d ago

It's genuinely so great at long horizon tasks! GPT-5.5 solved many long-horizon frontier challenges, for the first time for an AI model we've tested, in our internal evals at Canva :) Congrats on the launch!

bkyan17d ago

Sorry, what is "heartbeats", exactly?

dandaka17d ago

Could be a great feature, can't wait to test! Tired of other models (looking at you Opus) constantly stuck mid-task lately.

spaceman_202016d ago

Is there any task that actually doesn't require human intervention in-between, even if its just to setup stuff?

Like I will get Opus to make me an app but it will stop in between because I need to setup the db and plug in the API keys and Opus really can't do that on its own yet

thereeldeel16d ago

Will Codex App support new context window, rather than compaction, for "unrelated" sub-tasks during long horizon tasks?

aliljet17d ago

I've found myself so deeply embedded in the Claude Max subscription that I'm worried about potentially makign a switch. How are people making sure they stay nimble enough not to get trarpped by one company's ecosystem over another? For what it's worth, Opus 4.7 has not been a step up and it's come with an enormously higher usage of the subscription Anthropic offers making the entire offering double worse.

gck117d ago

Start building your own liteweight "harness" that does things you need. Ignore all functionality of clients like CC or Codex and just implement whatever you start missing in your harness.

You can replace pretty much everything - skills system, subagents, etc with just tmux and a simple cli tool that the official clients can call.

Oh and definitely disable any form of "memory" system.

Essentially, treat all tooling that wraps the models as dumb gateways to inference. Then provider switch is basically a one line config change.

type417d ago

I have a directory of skills that I symlink to Codex/Claude/pi. I make scripts that correspond with them to do any heavy lifting, I avoid platform specific features like Claude's hooks. I also symlink/share a user AGENTS.md/CLAUDE.md

MCPs aren't as smooth, but I just set them up in each environment.

threecheese17d ago

Anecdotally, I get the same wall time with my Max x5 (100$) and my ChatGPT Teams (30$) subscriptions.

chis17d ago

It's surprisingly simple to switch. I mean both products offer basically identical coding CLI experiences. Personally I've been paying for Claude max $100, and ChatGPT $20, and then just using ChatGPT to fill in the gaps. Specifically I like it for code review and when Claude is down.

hx816d ago

I use Open Code as my harness. It's open source, bring your own API Key or OAuth token or self-hosted model. I've jumped from Opus 4.6 to Opus 4.7 to GPT 5.5 in the last 7 days. No big deal, intelligence is just a commodity in 2026.

The actual harness is great, very hackable, very extendable.

zackify16d ago

I use pi.dev.

I get openai team plan at work.

Claude enterprise too.

I have openrouter for myself.

I use minimax 2.7. Kimi 2.6. And gpt 5.5 and opus 4.7. I can toggle between them in an open source interface that's how I stay able to not be trapped.

Minimax is so cheap and for personal stuff it works fine. So I'm always toggling between the nre releases

pdntspa17d ago

As a rule I've been symlinking or referencing generic "agents" versions of claude workflow files instead of placing those files directly in claude's purview

AGENTS.md / skills / etc

beering17d ago

What is the switching cost besides launching a different program? Don’t you just need to type what you want into the box?

cube222217d ago

Small tip, at least for now you can switch back to Opus 4.6, both in the ui and in Claude Code.

rane17d ago

This might be the opposite of staying nimble as my workflows are quite tied to Claude Code specifically, however I've been experimenting with using OpenAI models in CC and it works surprisingly well.

babelfish17d ago

I use Conductor which lets me flip trivially between OpenAI/Anthropic models

dannyw17d ago

It’s good to just keep trying different ones from time to time.

dogline17d ago

Except for history, I don’t find much that stops you from switching back and forth on the CLI. They both use tools, each has a different voice, but they both work. Have it summarize your existing history into a markdown file, and read it in with any engine.

The APIs are pretty interchangeable too. Just ask to convert from one to the other if you need to.

karlosvomacka17d ago

use copilot and have access to all models

dheera17d ago

Coding models are effectively free. They are capable of making money and supporting themselves given access to the right set of things. That is what I do

basisword17d ago

I switched a couple of weeks ago just to see how it went. Codex is no better or worse. They’re both noticeably better at different things. I burn through my tokens much much faster on Codex though. For what it’s worth I’m sticking with Codex for now. It seems to be significantly better at UI work although has some really frustrating bad habits (like loading your UI with annoying copywriting no sane person would ever do).

_alternator_17d ago

> One engineer at NVIDIA who had early access to the model went as far as to say: "Losing access to GPT‑5.5 feels like I've had a limb amputated.”

This quote is more sinister than I think was intended; it likely applies to all frontier coding models. As they get better, we quickly come to rely on them for coding. It's like playing a game on God Mode. Engineers become dependent; it's truly addictive.

This matches my own experience and unease with these tools. I don't really have the patience to write code anymore because I can one shot it with frontier models 10x faster. My role has shifted, and while it's awesome to get so much working so quickly, the fact is, when the tokens run out, I'm basically done working.

It's literally higher leverage for me to go for a walk if Claude goes down than to write code because if I come back refreshed and Claude is working an hour later then I'll make more progress than mentally wearing myself out reading a bunch of LLM generated code trying to figure out how to solve the problem manually.

Anyway, it continues to make me uneasy, is all I'm saying.

noosphr17d ago

LLMs upend a few centuries of labor theory.

The current market is predicated on the assumption that labor is atomic and has little bargaining power (minus unions). While capital has huge bargaining power and can effectively put whatever price it wants on labor (in markets where labor is plentiful, which is most of them).

What happens to a company used to extracting surplus value from labor when the labor is provided by another company which is not only bigger but unlike traditional labor can withhold its labor indefinitely (because labor is now just another for of capital and capital doesn't need to eat)?

Anyone not using in house models is signing up to find out.

12 more replies

andai16d ago

A while ago I was at the supermarket. I suddenly became curious about some fact, and reached into my pocket to Google it.

I found my pocket empty, and the specific pain I felt in that moment was the feeling of not being able to remember something.

I thought it was interesting, because in this case, I was trying to "remember" something I had never learned before -- by fetching it from my second brain (hypertext).

L1 cache miss, L2 missing.

sharts17d ago

One might argue that it’s not too too different from higher level abstractions when using libraries. You get things done faster, write less code, library handles some internal state/memory management for you.

Would one be uneasy about calling a library to do stuff than manually messing around with pointers and malloc()? For some, yes. For others, it’s a bit freeing as you can do more high-level architecture without getting mired and context switched from low level nuances.

tshaddox17d ago

Assuming that local models are able to stay within some reasonably fixed capability delta of the cutting edge hosted models (say, 12 months behind), and assuming that local computing hardware stays relatively accessible, the only risk is that you'll lose that bit of capability if the hosted models disappear or get too expensive.

Note that neither of these assumptions are obviously true, at least to me. But I can hope!

Alex_L_Wood17d ago

Well, they obviously are going to say that, they have vested interest in OpenAI and thus Nvidia stock price growing.

Also, I honestly can’t believe the 10x mantra is being still repeated.

jstummbillig17d ago

> This quote is more sinister than I think was intended; it likely applies to all frontier coding models. As they get better, we quickly come to rely on them for coding. It's like playing a game on God Mode. Engineers become dependent; it's truly addictive.

What's the worst potential outcome, assuming that all models get better, more efficient and more abundant (which seems to be the current trend)? The goal of engineering has always been to build better things, not to make it harder.

jnpnj17d ago

Who else is trying to leverage the situation so that they don't dig their own grave too fast ?

    - I often don't ask the LLM for precompiled answers, i ask for a standalone cli / tool
    - I often ask how it reached its conclusions, so I can extend my own perspective
    - I often ask to describe it's own metadata level categorization too

I'm trying to use it to pivot and improve my own problem solving skills, especially for large code base where the difficulty is not conceptual but more reference-graph size

HasKqi17d ago

This engineer had their brain amputated once they started using AI. All the AI-addicted can do is tinker with the AI computer game and feel "productive". They could as well play Magic The Gathering.

__alexs17d ago

I feel like most engineers I talk to still haven't realised what this is going to mean for the industry. The power loom for coding is here. Our skills still matter, but differently.

matheusmoreira17d ago

It's very addictive indeed. After I subscribed to Claude, I've been on a sort of hypomanic state where I just want to do stuff constantly. It essentially cured my ADHD. My ability to execute things and bring ideas to fruition skyrocketed. It feels good but I'm genuinely afraid I'll crash and burn once they rug pull the subscriptions.

And I'm being very cautious. I'm not vibecoding entire startups from scratch, I'm manually reviewing and editing everything the AI is outputting. I still got completely hooked on building things with Claude.

alansaber17d ago

That's the path we've been going down for a few years now. The current hedge is that frontier labs are actively competing to win users. The backup hedge is that open source LLMs can provide cheap compute. There will always be economical access to LLMs, but the provider with the best models will be able to charge basically whatever they want and still have buyers.

littlestymaar17d ago

That's why local models are important.

Of course they aren't alternative to the current frontier model, and as such you cannot easily jump from the later to the former, but they aren't that far behind either, for coding Qwen3.5-122B is comparable to what Sonnet was less than a year ago.

So assuming the trend continues, if you can stop following the latest release and stick with what you're already using for 6 or 9 months, you'll be able to liberate yourself from the dependency to a Cloud provider.

Personally I think the freedom is worth it.

neya17d ago

You are 100% right to be cautious about this. That's why as stupid as it sounds, I've purposely made my workflow with AI full of friction:

1. I only have ONE SOTA model integrated into the IDE (I am mostly on Elixir, so I use Gemini). I ensure I use this sparingly for issues I don't really have time to invest or are basically rabbit holes eg. Anything to do with Javascript or its ecosystem). My job is mostly on the backend anyway.

2. For actual backend architecture. I always do the high level architecture myself. Eg. DDD. Then I literally open up gemini.google.com or claude.ai on the browser, copy paste existing code base into the code base, physically leavey chair to go make coffee or a quick snack. This forces me to mentally process that using AI is a chore.

Previously, I was on tight Codex integration and leaving the licensing fears aside, it became too good in writing Elixir code that really stopped me from "thinking" aka using my brain. It felt good for the first few weeks but I later realised the dependence it created. So I said fuck it, and completely cancelled my subscription because it was too good at my job.I believe this is the only way that we won't end up like in Wall-E sitting infront of giant screens just becoming mere blobs of flesh.

chrismarlow917d ago

I use local models on a Mac mini for most things and fall back to the hosted ones when they can't get the job done. Of course you have to break the work into smaller pieces yourself that a local model can understand. One good side effect of this is that you end up actually learning the code and how it's structured.

eitally17d ago

I have found something similar. I am easily distractible and if I don't have a written task backlog in front of me at all times, I find that when Claude is spinning I'll stop being productive. This is disconcerting for a number of reasons. Overall, I think training young people & new hires on agentic workflows -- and how to use agentic "human augmentation" productivity systems is critical. If it doesn't happen, that same couple of classes that lost academic progress during covid are going to suffer a double-whammy of being unprepared for workplace expectations.

Fwiw, I haven't spoken with any management-level colleague in the past 9 months who hasn't noted that asking about AI-comfort & usage is a key interview topic. For any role type, business or technical.

wiseowise17d ago

> It's literally higher leverage for me to go for a walk

Touching grass while you're outside might yield highest leverage.

i_love_retros17d ago

It makes me uneasy because my role now, which is prompting copilot, isn't worth my salary.

lumost17d ago

Out of curiousity why do you not refill tokens in this case? When I'm actively working on a project I'm prone to spending a few hundred dollars per day or a few thousand during the initial buildout of a new module etc.

dannyw17d ago

You’re still the one that’s controlling the model though and steering it with your expertise. At least that’s what I tell myself at night :)

I haven’t really thought about this before, but you’re right, it feels a bit uneasy for me too.

cco17d ago

Will the foundation for a skyscraper ever be dug with shovels again?

sigil17d ago

"Every augmentation is also an amputation." – McLuhan

https://driverlesscrocodile.com/technology/neal-stephenson-o...

bwhiting235617d ago

You are now a manager. If your minions are out sick, project is delayed, not the end of the world.

William_BB16d ago

> I'll make more progress than mentally wearing myself out reading a bunch of LLM generated code trying to figure out how to solve the problem manually.

I feel sorry for whoever has to work on that codebase. This is the literal definition of tech debt.

goosejuice17d ago

> than mentally wearing myself out reading a bunch of LLM generated code trying to figure out how to solve the problem manually.

That's probably a bad sign. Skills will atrophy, but we should be building systems that are still easy to understand.

jmole17d ago

The meta here is to use LLMs to make things simpler and easier, not to make things harder.

Turning tokens into a well-groomed and maintainable codebase is what you want to do, not "one shot prompt every new problem I come across".

rebolek17d ago

Have a pet project never touched by LLM. Once the tokens run out, go back to it and flourish it like your secret garden. It will move slowly but it will keep your sanity and your ability to review LLM code.

epolanski17d ago

I actually don't mind the coding part, but the information digging across the project is definitely by orders of magnitude slower if I do it on my own.

Melatonic17d ago

Suspect it will be like turn based directions for driving - soon we will have a whole group of people who can barely operate a vehicle without it

Bridged775617d ago

Not sure what you're doing then, or what kind of jobs you all work in where you can or do just brainlessly prompt LLMs. Don't you review the code? Don't you know what you want to do before you begin? This is such a non issue. Baffling that any engineer is just opening PRs with unreviewed LLM slop.

davmar17d ago

i wonder if this is how engineers felt when the first electronic calculators came out and engineers stopped doing math by hand.

did we feel uneasy that a new generation of builders didn't have to solve equations by hand because a calculator could do them?

i'm not sure it's the same analogy but in some ways it holds.

konfusinomicon17d ago

soooooo about Claude going down. we're gonna need you to sign in on Saturday and make up for lost time or unfortunately we're going to have to deduct the time lost from your paycheck. and as an aside your TPS reports have been sub-par as of late..is everything OK?

piokoch16d ago

Soon, very soon, AI tools providers will figure that out. And rise prices accordingly.

drusepth16d ago

> It's literally higher leverage for me to go for a walk if Claude goes down than to write code because if I come back refreshed and Claude is working an hour later then I'll make more progress than mentally wearing myself out reading a bunch of LLM generated code trying to figure out how to solve the problem manually.

Taking more breaks and "not working" during the work day sounds like something we should probably be striving to work towards more as a society.

keybored17d ago

Help. They’re constantly trying to make me try crack cocaine on the front page.

gip17d ago

Totally. That is why it is key important to have open source and sovereign models that will be accessible to all and always.

At the end of the day, all these closed models are being built by companies that pumped all the knowledge from the internet without giving much back. But competition and open source will make sure most of the value return to the most of the people.

ransom153817d ago

"when the tokens run out, I'm basically done working."

Oh stop the drama. Open source models can handle 99% of your questions.

deadbabe17d ago

Given that it’s so easy, would you still do this same job if paid half as much?

singingtoday16d ago

Very well put, and it mirrors my own thoughts.

Mauneam16d ago

You are that guy in early 1900s who would rather ride a horse than get in a car because cars "continued to make him uneasy."

simianwords17d ago

eh this kind of FUD needs to stop because it is kind of normal and expected and in fact good to have relation like this with technology.

BrokenCogs17d ago

I'm here for the pelicans and I'm not leaving until I see one!

qingcharles17d ago

I've come to prompt pelicans and chew gum, and I'm all outta gum!

pixel_popping17d ago

That's a true CTO right there.

bytesandbits17d ago

I know a 10x engineer when i see one.

RomanPushkin17d ago

Ctrl+F: pelican

F5

tantalor17d ago

simonw pls

h14h17d ago

This seems huge for subscription customers. Looking at the Artificial Analysis numbers, 5.5 at medium effort yields roughly the intelligence as 5.4 (xhigh) while using less than a fifth the tokens.

As long as tokens count roughly equally towards subscription plan usage between 5.5 & 5.4, you can look at this as effectively a 5x increase in usage limits.

gausswho17d ago

As someone who always leaves intelligence at default, and am ok with existing models, should I be shifting gears more manually as providers sell us newer models? Is medium or lower better than free/cheaper models?

CompleteSkeptic17d ago

Is this the first time OpenAI has published comparisons to other labs?

Seems so to me - see GPT-5.4[1] and 5.2[2] announcements.

Might be an tacit admission of being behind.

[1] https://openai.com/index/introducing-gpt-5-4/ [2] https://openai.com/index/introducing-gpt-5-2/

oliver23616d ago

beautiful!!

gallerdude17d ago

If GPT-5.5 Pro really was Spud, and two years of pretraining culminated in one release, WOW, you cannot feel it at all from this announcement. If OpenAI wants to know why they like they’ve fallen behind the vibes of Anthropic, they need to look no further than their marketing department. This makes everything feel like a completely linear upgrade in every way.

I_am_tiberius17d ago

Clearly they felt a big backlash when version 5 was released. Now they are afraid of another response like this. And effectively, for the user it will likely only be a small update.

jimbob4517d ago

Also the naming department. You can tell that this is the AI company Microsoft chose to back because their naming scheme is as bad as .NET's.

jryio17d ago

Their 'Preparedness Framework'[1] is 20 pages and looks ChatGPT generated, I don't feel prepared reading it.

https://cdn.openai.com/pdf/18a02b5d-6b67-4cec-ab64-68cdfbdde...

sosodev17d ago

I hope the industry starts competing more on highest scores with lowest tokens like this. It's a win for everybody. It means the model is more intelligent, is more efficient to inference, and costs less for the end user.

So much bench-maxxing is just giving the model a ton of tokens so it can inefficiently explore the solution space.

an0malous17d ago

The premise of the trillion dollars in AI investments is not that it’ll be as good as it currently is but cheaper. It’s AGI or bust at this point.

louiereederson17d ago

For a 56.7 score on the Artificial Intelligence Index, GPT 5.5 used 22m output tokens. For a score of 57, Opus 4.7 used 111m output tokens.

The efficiency gap is enormous. Maybe it's the difference between GB200 NVL72 and an Amazon Tranium chip?

swyx17d ago

why would chip affect token quantity. this is all models.

dist-epoch17d ago

If it's a new pretrain, the token embeddings could be wider - you can pack more info into a token making it's way through the system.

Like Chinese versus English - you need fewer Chinese characters to say something than if you write that in English.

So this model internally could be thinking in much more expressive embeddings.

AtNightWeCode17d ago

You need to compare total cost. Token count is irrelevant.

karmasimida17d ago

Chips doesn’t impact output quality in this magnitude

ativzzz17d ago

I like that they waited for opus 4.7 to come out first so they had a few days to find the benchmarks that gpt 5.5 is better at

eknkc17d ago

Well anectodally, 5.4 was already better than opus 4.7 so it should not have been hard.

wahnfrieden17d ago

I like that Anthropic rushed 4.7 out to get a couple days of coverage before 5.5 hit

khutorni16d ago

> One engineer at NVIDIA who had early access to the model went as far as to say: "Losing access to GPT‑5.5 feels like I've had a limb amputated.”

That's a wild statement to put into your announcement. Are LLM providers now openly bragging about our collective dependency on their models?

Manik_agg16d ago

Recently started using Codex and Chatgpt again due to claude model getting nerfed or rate limits.

Tried gpt5.5 and so far good. Zapier also shared an automation benchmark where 5.5 came on top in the leaderboard https://zapier.com/benchmarks

cynicalpeace17d ago

It's possible that "smarter" AI won't lead to more productivity in the economy. Why?

Because software and "information technology" generally didn't increase productivity over the past 30 years.

This has been long known as Solow's productivity paradox. There's lots of theories as to why this is observed, one of them being "mismeasurement" of productivity data.

But my favorite theory is that information technology is mostly entertainment, and rather than making you more productive, it distracts you and makes you more lazy.

AI's main application has been information space so far. If that continues, I doubt you will get more productivity from it.

If you give AI a body... well, maybe that changes.

ewrs17d ago

Its quite possible the use of LLMs means that we are using less effort to produce the same output. This seems good.

But the less effort exertion also conditions you to be weaker, and less able to connect deeply with the brain to grind as hard as once did. This is bad.

Which effect dominates? Difficult to say.

Of course this is absolutely possible. Ultimately there was a time where physical exertion was a thing and nobody was over-weight. That isn't the case anymore is it.

hol4b16d ago

25 years of shipping software, and IT absolutely increased productivity - just not for everyone, not everywhere. Some workflows got 10x faster, others got slower from meetings about the new tools.

AI feels the same. I'm shipping indie apps solo now that would have needed a small team five years ago. But in bigger orgs I see people spending 20 minutes verifying 15-minute AI output that used to be a 30-minute task they'd just do. Depends where you sit.

aerhardt17d ago

> "information technology" generally didn't increase productivity

Do you think it'd be viable to run most businesses on pen and paper? I'll give you email and being able to consume informational websites - rest is pen and paper.

aiaiai17717d ago

Downvoted by the AI Nazis. They are running a tight ship before the IPOs.

losvedir17d ago

> It excels at ... researching online

How does this work exactly? Is there like a "search online" tool that the harness is expected to provide? Or does the OpenAI infra do that as part of serving the response?

I've been working on building my own agent, just for fun, and I conceptually get using a command line, listing files, reading them, etc, but am sort of stumped how I'm supposed to do the web search piece of it.

Given that they're calling out that this model is great at online research - to what extent is that a property of the model itself? I would have thought that was a harness concern.

wincy17d ago

I’ve noticed when writing little bedtime stories that require specific research (my kids like Pokemon stories and they’ve been having an episodic “pokemon adventure” with them as the protagonists) ChatGPT has done a fantastic job of first researching the moves the pokemon have, then writing the actual story. The only mistake it consistently makes is when I summarize and move from a full context session, it thinks that Gyarados has to swim and is incapable of flying.

It definitely seems like it does all the searching first, with a separate model, loads that in, then does the actual writing.

dist-epoch17d ago

It's a property of the model in the sense that it has great Google Fu.

The harness provides the search tool, but the model provides the keywords to search for, etc.

100ms17d ago

It's literally a distinct model with a different optimisation goal compared to normal chat. There's a ton of public information around how they work and how they're trained

2001zhaozhao17d ago

Pricing: $5/1M input, $30/1M output

(same input price and 20% more output price than Opus 4.7)

tedsanders17d ago

Yep, it's more expensive per token.

However, I do want to emphasize that this is per token, not per task.

If we look at Opus 4.7, it uses smaller tokens (1-1.35x more than Opus 4.6) and it was also trained to think longer. https://www.anthropic.com/news/claude-opus-4-7

On the Artificial Analysis Intelligence Index eval for example, in order to hit a score of 57%, Opus 4.7 takes ~5x as many output tokens as GPT-5.5, which dwarfs the difference in per-token pricing.

The token differential varies a lot by task, so it's hard to give a reliable rule of thumb (I'm guessing it's usually going to be well below ~5x), but hope this shows that price per task is not a linear function of price per token, as different models use different token vocabularies and different amounts of tokens.

We have raised per-token prices for our last couple models, but we've also made them a lot more efficient for the same capability level.

(I work at OpenAI.)

sergiotapia17d ago

That pricing is extremely spicy, wow.

oh_no17d ago

yes but as far as i know gpt tokenizer is about the same as opus 4.6's, where 4.7 is seeing something in the ballpark of a 30% increase. this should still be cheaper even disregarding the concerns around 4.7 thinking burning tokens

blixt17d ago

Releases keep shifting from API forward to product forward, with API now lagging behind proprietary product surface and special partnerships.

I'd not be surprised if this is the year where some models simply stop being available as a plain API, while foundation model companies succeed at capturing more use cases in their own software.

baalimago17d ago

Worth the 100% price increase over GPT-5.4?

cbg017d ago

For less than 10% bump across the benchmarks? Probably not, but if your employer is paying (which is probably what OAI is counting on) it's all good.

It's kind of starting to make sense that they doubled the usage on Pro plans - if the usage drains twice as fast on 5.5 after that promo is over a lot of people on the $100 plan might have to upgrade.

vessenes17d ago

Yay. 5.4 was a frustrating model - moments of extreme intelligence (I liked it very much for code review) - but also a sort of idiocy/literalism that made it very unsuited for prompting in a vague sense. I also found its openclaw engagement wooden and frustrating. Which didn’t matter until anthropic started charging $150 a day for opus for openclaw.

Anyway - these benchmarks look really good; I’m hopeful on the qualitative stuff.

NitpickLawyer17d ago

> Across all three evals, GPT‑5.5 improves on GPT‑5.4’s scores while using fewer tokens.

Yeah, this was the next step. Have RLVR make the model good. Next iteration start penalising long + correct and reward short + correct.

> CyberGym 81.8%

Mythos was self reported at 83.1% ... So not far. Also it seems they're going the same route with verification. We're entering the era where SotA will only be available after KYC, it seems.

toraway17d ago

Isn't Mythos limited to a selected group of companies/organizations Anthropic chose themselves? If the OpenAI announcement for GPT-5.5 is accurate the "trusted cyber access" just requires an open, seemingly straightforward identity verification step.

https://openai.com/index/scaling-trusted-access-for-cyber-de...

  > We are expanding access to accelerate cyber defense at every level. We are making our cyber-permissive models available through Trusted Access for Cyber , starting with Codex, which includes expanded access to the advanced cybersecurity capabilities of GPT‑5.5 with fewer restrictions for verified users meeting certain trust signals (opens in a new window) at launch.

  > Broad access is made possible through our investments in model safety, authenticated usage, and monitoring for impermissible use. We have been working with external experts for months to develop, test and iterate on the robustness of these safeguards. With GPT‑5.5, we are ensuring developers can secure their code with ease, while putting stronger controls around the cyber workflows most likely to cause harm by malicious actors.

  > Organizations who are responsible for defending critical infrastructure  can apply to access cyber-permissive models like GPT‑5.4‑Cyber, while meeting strict security requirements to use these models for securing their internal systems.

"GPT‑5.4‑Cyber" is something else and apparently needs some kind of special access, but that CyberGym benchmark result seems to apply to the more or less open GPT-5.5 model that was just released.

cbg017d ago

Isn't CyberGym an open benchmark so trivial to benchmaxx anyway?

mattas17d ago

Not good for employees that are being measured by their token usage.

thinkindie17d ago

This is reminding me when Chrome and Firefox where racing to release a new “major version” (at least from the semver POV) without adding significantly new functionality at a time that browsers were already becoming a commodity. As much as we don’t care anymore for a new chrome or Firefox version so will be the release of a new model version.

k2xl17d ago

Surprised to see SWE-Bench Pro only a slight improvement (57.7% -> 58.6%) while Opus 4.7 hit 64.3%. I wonder what Anthropic is doing to achieve higher scores on this - and also what makes this test particular hard to do well in compared to Terminal Bench (which 5.5 seemed to have a big jump in)

vexna17d ago

There's an asterisk right below that table stating that:

> *Anthropic reported signs of memorization on a subset of problems

And from the Anthropic's Opus 4.7 release page, it also states:

> SWE-bench Verified, Pro, and Multilingual: Our memorization screens flag a subset of problems in these SWE-bench evals. Excluding any problems that show signs of memorization, Opus 4.7’s margin of improvement over Opus 4.6 holds.

conradkay17d ago

Was 4.7 distilled off Mythos (which got 77.8%)? Interesting how mythos got 82% on terminal-bench 2.0 compared to 82.7% for GPT-5.5.

Also notice how they state just for SWE-Bench Pro: "*Anthropic reported signs of memorization on a subset of problems"

kburman17d ago

What a time. I am back here genuinely wishing for OpenAI to release a great model, because without stiff competition, it feels like Anthropic has completely lost its mind.

victor900016d ago

Care to elaborate? I jumped ship when 5.4 first released, have things gotten worse?

nickvec17d ago

I'm conflicted whether I should keep my Claude Max 5x subscription at this point and switch back to GPT/Codex... anyone else in a similar position? I'd rather not be paying for two AI providers and context switching between the two, though I'm having a hard time gauging if Claude Code is still the "cream of the crop" for SWE work. I haven't played around with Codex much.

the_sleaze_17d ago

I have experienced 0 friction swapping between the 2 models, in fact pitting them against eachother has resulted in the highest success rate for me so far.

mpaepper17d ago

I switched from CC to Codex a few days ago. I get limited much less and the code quality is similar, so not looking back

slawr180517d ago

I was all in on Claude code as my daily driver for web development. And love it. But I enjoy using pi as my harness more and have never ran out of tokens with Codex yet. Claude code almost always runs out for me with the same amount of usage.

After migrating for the token and harness issues, I was pleasantly surprised that Codex seems to perform as good or better too!

Things change so often in this field, but I prefer Codex now even though Anthropocene has so much more hype for coding it seems.

scottyah17d ago

Every time I've followed the hype and tried OpenAI models I've found them lacking for the most part. It might just be that I prefer the peer-programming vs spec-ing out the task and handing it off, but I've never been as productive as I am with Claude. Also, I'm still caught up on the DoD ethics stuff.

meetpateltech17d ago

GPT-5.5 System Card:

https://deploymentsafety.openai.com/gpt-5-5

ZeroCool2u17d ago

Benchmarks are favorable enough they're comparing to non-OpenAI models again. Interesting that tokens/second is similar to 5.4. Maybe there's some genuine innovation beyond bigger model better this time?

qsort17d ago

It's behind Opus 4.7 in SWE-Bench Pro, if you care about that kind of thing. It seems on-trend, even though benchmarks are less and less meaningful for the stuff we expect from models now.

Will be interesting to try.

vanillameow16d ago

Because Opus is kind of degrading lately, I said "fuck it" and made a new OAI account and used the month free trial. I put one query into ChatGPT using 5.5 thinking - the frustrating thing was that it did put more effort into getting correct answers rather than Opus, which is just guessing. Specifically, I asked about the coding harness pi, and despite explicitly referring to it as a harness, Opus 4.7, 4.6 and Sonnet 4.6 all fell back to telling me about Aider or OpenCode and ignored my query completely, while ChatGPT said "I'll assume pi is a harness" and then did in fact find the harness.

However the language of ChatGPT is still the same slop as years ago, so many headings, so many emojis, so many "the important thing nobody mentions". 10 paragraphs of text for what should be a two paragraph response. Even with custom instructions (keep answers short and succinct) and using their settings (less list, less emoji, less fluff) it's still NOTICEABLY worse than Claude on base settings.

I've yet to test Codex, will get to that this weekend, but in terms of research or general Q&A I have no idea how anyone could prefer this to Claude. Unfortunately Claude has seemingly stopped giving a fuck about researching entirely.

jdw6417d ago

GPT is really great, but I wish the GPT desktop app supported MCP as well.

You can kind of use connectors like MCP, but having to use ngrok every time just to expose a local filesystem for file editing is more cumbersome than expected.

throwaway91128217d ago

Use codex app

M4R5H4LL17d ago

I am a heavy Claude Code user. I just tried using Codex with 5.4 (as a Plus user I don't have access to 5.5 yet), and it was quite underwhelming. The app stopped regularly much earlier than what I wanted. It also claimed to have fixed issues when it did not; this is not a hallmark of GPT, and Opus has similar issues, but Claude will not make the same mistake three times in a row. It is unusable at the moment, while Claude allows me do get real work done on a daily basis. Until then...

bhu817d ago

Gpt-5.3-codex is miles better than 5.4 in that regard. It’s better at orchestration, and does the things that it said it did. Haven’t tested 5.5 yet but using 5.4 for exploration + brainstorming and handing over the findings to 5.3-codex works pretty well

thimabi17d ago

Will we also see a GPT-5.5-Codex version of this model? Or will the same version of it be served both in the web app and in Codex?

Uehreka17d ago

After 5.1, we haven’t seen a -codex-max model, presumably because the benefits of the special training gpt-5.1-codex-max got to improve long context work filtered into gpt-5.2-codex, making the variant no longer necessary (my personal experience accords with this). I’ve been using gpt-5.4 in Codex since it came out, it’s been great. I’ve never back-to-back tested a version against its -codex variant to figure out what the qualitative difference is (this would take a long time to get a really solid answer), but I wouldn’t be surprised if at some point the general-purpose model no longer needs whatever extra training the -codex model gets and they just stop releasing them.

I thought it was weird that for almost the entire 5.3 generation we only had a -codex model, I presume in that case they were seeing the massive AI coding wave this winter and were laser focused on just that for a couple months. Maybe someday someone will actually explain all of this.

jumploops17d ago

> GPT‑5.5 improves on GPT‑5.4’s scores while using fewer tokens.

This might be great if it translates to agentic engineering and not just benchmarks.

It seems some of the gains from Opus 4.6 to 4.7 required more tokens, not less.

Maybe more interesting is that they’ve used codex to improve model inference latency. iirc this is a new (expectedly larger) pretrain, so it’s presumably slower to serve.

beering17d ago

With Opus it’s hard to tell what was due to the tokenizer changes. Maybe using more tokens for the same prompt means the model effectively thinks more?

conradkay17d ago

They say latency is the same as 5.4 and 5.5 is served on GB200 NVL72, so I assume 5.4 was served on hopper.

Rapzid17d ago

In Copilot where it's easy to switch models Opus 4.6 was still providing, IMHO, better stock results than GPT-5.4.

Particularly in areas outside straight coding tasks. So analysis, planning, etc. Better and more thorough output. Better use of formatting options(tables, diagrams, etc).

I'm hoping to see improvements in this area with 5.5.

svara16d ago

Do we know if this is another post training fine tune or based on a much larger new pretraining run (which I believe they were calling 'Spud' internally)?

The large price bump might indicate the latter.

cscheid17d ago

I know this is irrelevant on the grand scheme of things, but that WebGL animation is really quite wrong. That is extra funny given the "ensure it has realistic orbital mechanics." phrase in the prompt.

I prescribe 20 hours of KSP to everyone involved, that'll set them right.

williamcotton16d ago

One-shot converted my game from a 2D board to a 3D board along with all entities and animations. Sold!

https://github.com/williamcotton/space-trader/commit/0859c65...

gcanyon17d ago

Once upon a time humans had to memorize log tables.

Once upon a time humans had to manually advance the spark ignition as their car's engine revved faster.

Once upon a time humans had to know the architecture of a CPU to code for it.

History is full of instances of humans meeting technology where it was, accommodating for its limitations. We are approaching a point where machines accommodate to our limitations -- it's not a point, really, but a spectrum that we've been on.

It's going to be a bumpy ride.

kordlessagain16d ago

If anyone wants Codex CLI containers with various MCP tools available, I built this: https://deepbluedynamics.com/nemesis

pants217d ago

Labs still aren't publishing ARC-AGI-3 scores, even though it's been out for some time. Is it because the numbers are too embarrassing?

tedsanders16d ago

Honest answer is that it isn't done running yet. It takes some human bandwidth and time to run, so results weren't ready by this morning. We don't know what the score will be, but will probably go up on the leaderboard sometime soon. I personally don't put a lot of stock in the ARC-AGI evals, as it's not relevant to most work that people do, but should still be interesting to see as a measure of reasoning ability.

(I work at OpenAI.)

stonecauldron11d ago

Because they want to keep the narrative that they'll achieve AGI with LLMs alive.

AG2517d ago

GPT-5.5 was just released and OpenAI didnt mention ARC AGI 3 at all, their score probably sucks.

kilroy12317d ago

To be fair, there's not much to report. Isn't it pretty much at 0?

nullbyte17d ago

82.7% on Terminal Bench is crazy

toephu217d ago

Is it? There are 5 other models near ~80% and it was achieved in March... which in AI-world seems like a century ago.

https://www.tbench.ai/leaderboard/terminal-bench/2.0

bradley1317d ago

"our strongest set of safeguards to date"

How much capability is lost, by hobbling models with a zillion protections against idiots?

Every prompt gets evaluated, to ensure you are not a hacker, you are not suicidal, you are not a racist, you are not...

Maybe just...leave that all off? I know, I know, individual responsibility no longer exists, but I can dream.

iugtmkbdfil83417d ago

This is my personal pet peeve as well. Like, I accept maybe everything shouldn't be offered to everyone, but maybe just gate keep it behind credit card( but I know that is a market penetration no no ). I feel like such a waste of power ( electrical and the potential we might be missing out on ).

maxdo17d ago

With such a huge progress of open ai and anthropic . How Chinese open source provides even think to make comparable money . I have a few friends in China they all use Claude. To train the model cost the same but the output from open source model id imagine is 1000 times less . Money flow for them outside of China is abysmal

objektif17d ago

Are there faster mini/nano versions as well?

tedsanders17d ago

Not this time, no.

abi17d ago

Usually, those get released a few weeks later.

niklasd16d ago

Just burned through my 5 hour window in Codex (Business plan) in 10 minutes with GPT-5.5. Was excited to use it, but I guess I have to wait 5 hours now (it's not yet available in the API, so I can't switch there).

bandrami17d ago

Cool. Now there will be a week or "this is the greatest model ever and I think mine just gained sentience", followed by a week of "I think they must have just nerfed it because it's not as good as it was a week ago", followed by three weeks of smart people cargo culting the specific incantations they then convince themselves make it work best.

extr17d ago

Seems like a continuation of the current meta where GPT models are better in GPT-like ways and Claude models are better in Claude-like ways, with the differences between each slightly narrowing with each generation. 5.5 is noticeably better to talk to, 4.7 is noticeably more precise. Etc etc.

benjx8817d ago

Good job on the release notice. I appreciate that it isn't just marketing fluff, but actually includes the technical specs for those of us who care and not concentrated in coding agents only.

I hope GPT 5.5 Pro is not cutting corners and neuter from the start, you got the compute for it not to be.

rarisma17d ago

I like that its more consistent than the 4o and o4 days but still 5.4, 5.3, 5.2, etc still are a mess, for example 5.2 and 5.1 don't have mini models and 5.3 was codex only.

Anthropic is slightly better but where is 4.6 or 4.7 haiku or 4.7 sonnet etc.

nickandbro17d ago

Very impressive! Interesting how all other benchmarks it seems to surpass Opus 4.7 except SWE-Bench Pro (Public). You would think that doing so well at Cyber, it would naturally possess more abilities there. Wonder what makes up the actual difference there

GenerWork17d ago

Looking at the space/game/earthquake tracker examples makes me hopeful that OpenAI is going to focus a bit more on interface visual development/integration from tools like Figma. This is one area where Anthropic definitely reigns supreme.

impulser_17d ago

What is the reason behind OpenAI being able to release new models very fast?

Since Feb when we got Gemini 3.1, Opus 4.6, and GPT-5.3-Codex we have seen GPT-5.4 and GPT-5.5 but only Opus 4.7 and no new Gemini model.

Both of these are pretty decent improvements.

minimaxir17d ago

Competition.

literalAardvark17d ago

Anthropic is really tiny, and Google is just being Google, their models are just to show that they're hip with what the kids are doing.

wmf17d ago

I wonder if it's the same model and they just keep adding more post-training.

tantalor17d ago

They aren't new models.

Flow16d ago

These new models consume so many tokens. I’m very satisfied with GPT-5.2 on High. I hope they keep that one for many years

YmiYugy17d ago

So according to the benchmarks somewhere in between Opus 4.7 and Mythos

jorl1717d ago

GPT 5.4 is already better than Opus 4.7 to me. But, then again, Opus 4.7 is a massive disappointment. I hope they don't discontinue 4.6.

aetherspawn17d ago

Umm yeah but this is like every release in the last 3 years.

The big question is: does it still just write slop, or not?

Fool me once, fool me twice, fool me for the 32nd time, it’s probably still just slop.

enraged_camel17d ago

Is this the first time OpenAI compared their new release to Anthropic models? Previously they were comparing only to GPT's own previous versions.

ionwake17d ago

is there anywhere I can try it? ( I just stopped my pro sub ) but was wondering if there is a playground or 3rd party so i can just test it briefly?

k2xl17d ago

ARC-AGI 3 is missing on this list - given that the SOTA before 5.5 <1% if I recall, I wonder if this didn't make meaningful progress.

redox9917d ago

It's a silly benchmark anyways.

w10-117d ago

NYTimes article - on the same day?

  https://www.nytimes.com/2026/04/23/technology/openai-new-model.html

I can see how some model releases would meet the NY Times news-worthy threshold if they demonstrated significance to users - i.e., if most users were astir and competitors were re-thinking their situation.

However, this same-day article came out before people really looked at it. It seems largely intended to contrast OpenAI with Anthropic's caution, before there has been any evidence that the new model has cyber-security implications.

It's not at all clear that the broader discourse is helping, if even the NY Times is itself producing slop just to stoke questions.

kaant16d ago

The '.5' models are always the actual production-ready versions. GPT-5 was for the mainstream hype, 5.5 is for the developers. I don't need it to be magically smarter; just give me lower latency, cheaper API tokens, and reliable tool-calling without hallucinations.

AbuAssar17d ago

This is the first time openAi include competing models in their benchmarks, always included only openAi models.

tantalor17d ago

> A playable 3D dungeon arena

Where's the demo link?

deaux17d ago

ctrl+f "cutoff, 0 results"

Surely it doesn't still have the same ancient data cutoff as 5.4 did?

zerotosixty17d ago

Those who are using gpt5.5 how does it compare to Opus 4.6 / 4.7 in terms of code generation?

cmrdporcupine17d ago

Not rolled out to my Codex CLI yet, but some users on Reddit claiming it's on theirs.

mondojesus17d ago

I'm still using 5.3 in codex. Are 5.4 and 5.5 better than 5.3 in concrete ways?

cbg017d ago

The benchmarks say so, but try it out with actual tasks and be the judge.

amiune16d ago

Will there ever be ChatGPT 6.0 or Claude 5.0?

faxmeyourcode17d ago

How does it compare to mythos?

renecito17d ago

why the stats of every AI on every release looks around the same?

Are the tests getting harder and harder so the older AIs look worst and the new ones look like they are "almost there" ?

arjunthazhath16d ago

Is it better than claude code?

adam1217d ago

"Sometime with GPT-5.5 I become lazy"

I don't want to be lazy.

phillipcarter17d ago

... sigh. I realize there's little that can be done about this, but I just got through a real-world session determining of Opus 4.7 is meaningfully better than Opus 4.6 or GPT 5.4, and now there's another one to try things with. These benchmark results generally mean little to me in practice.

Anyways, still exciting to see more improvements.

cchrist17d ago

Which is better GPT-5.5 or Opus 4.7? And for what tasks?

senko17d ago

I might just be following too many AI-related people on X, but omg the media blitz around 5.5 is aggressive.

Soo many unconvincing "I've had access for three weeks and omg it's amazing" takes, it actually primes me for it to be a "meh".

I prefer to see for myself, but the gradual rollout, combined with full-on marketing campaign, is annoying.

user3428316d ago

I used it last night for iOS app development and it felt like a noticeable improvement.

With the Pro plan it was available in both Codex and ChatGPT already when I first checked, which was within an hour of the release.

egorfine17d ago

> We are releasing GPT‑5.5 with our strongest set of safeguards to date

...

> we’re deploying stricter classifiers for potential cyber risk which some users may find annoying initially

So we should be expecting to not be able to check our own code for vulnerabilities, because inherently the model cannot know whether I'm feeding my code or someone else's.

dannyw17d ago

Hopefully not, because checking your codebase for vulnerabilities is really valuable.

I hope it’s just limits on pentesting and stuff, and not for code analysis and review.

Manik_agg16d ago

OpenAI finally catching up with claude

onepiecenaruto16d ago

had issues using this model on my codex

Schlagbohrer17d ago

entering this comments area wondering if it will be full of complaints about the new personality, as with every single LLM update

vardump17d ago

I just can't bear to use services from this company after what they did to the global DRAM markets.

I'm not trying to make any kind of moral statement, but the company just feels toxic to me.

woeirua17d ago

Nice to see them openly compare to Opus-4.7… but they don’t compare it against Mythos which says everything you need to know.

The LinkedIn/X influencers who hyped this as a Mythos-class model should be ashamed of themselves, but they’ll be too busy posting slop content about how “GPT-5.5 changes everything”.

A_D_E_P_T17d ago

Almost nobody can actually use Mythos, though?

I_am_tiberius17d ago

I'd really like to see improvements like these: - Some technical proof that data is never read by open ai. - Proof that no logs of my data or derived data is saved. etc...

anematode17d ago

I don't think this is technically possible without something like homomorphic encryption, which poses too large of a runtime cost for usage in LLMs

throwaway202717d ago

Good timing I had just renewed my subscription.

numbers17d ago

I've stopped trusting these "trust me bro" benchmarks and just started going to LM Arena and looking for the actual benchmark comparisons.

https://arena.ai/leaderboard/code

stri8ted17d ago

I doubt this is representative of real world usage. There is a difference between a few turns on a web chatbot, vs many-turn cli usage on a real project.

nba456_17d ago

This is not any better of a benchmark

theihtisham16d ago

i just installed Codex and And Gave try to GPT 5.5 Its Good As compare to previous one

ace2pace17d ago

I hear its as good as Opus 4.7.

The battle has just begun

swrrt16d ago

I heard someone said it is better than Opus 4.7. Recently, a lot of my friends complain about Opus 4.7 and previous models performance degradation.

dmd16d ago

How do people feel about using Altman’s company’s stuff considering what we now know about him? I switched to Anthropic months ago because of it, but Anthropic’s product has been on a total shitshow decline train since then I’m starting to be tempted back in spite of the evil.

nickandbro17d ago

I just prompted GPT-5.5 Pro "Solve Nuclear Fusion" and it one shotted it (kidding obviously)

debba17d ago

Cannot see it in Codex CLI

boring-human17d ago

Did you upgrade the tool binaries? I also couldn't see it until after the upgrade.

c0rruptbytes17d ago

literally cannot launch the codex app anymore

neuroelectron16d ago

Are they using RTX 5090s now?

RayVR16d ago

My first experience with 5.5 via ChatGPT was immensely disappointing. It was a massive reduction in quality compared to 5.4, which already had issues.

Pooge16d ago

Up until now I only paid LLM subscriptions to Anthropic but I'm going to give ChatGPT a chance when my current subscription runs out next month.

elAhmo17d ago

Is Codex receiving 5.4 or 5.5 release?

I am still using Codex 5.3 and haven't switched to GPT 5.4 as I don't like the 'its automatic bro trust us', so wondering is Codex going to get these specific releases at all in the future.

jawiggins17d ago

What is the major and minor semver meaning for these models? Is each minor release a new fine-tuning with a new subset of example data while the major releases are made from scratch? Or do they even mean anything at this point?

gck117d ago

Nothing. The next major increment is going to happen when marketing department is confident they can sell it as a major improvement without everyone laughing at them. Which at this point seems like never.

I think Anthropic fearmongering and "leaks" of Mythos was them testing the ground for 5.x, which seems to have backfired.

PilotJeff16d ago

So exhausted from all this endless bs…. Keep releasing , this reminds me of all the .com software during that era where wow we are already at version 3.0 it’s only been 60 Days

journal17d ago

does it have cached pricing?

aussieguy123417d ago

If SWE-Bench Verified is no longer a good measure of agentic coding abilities, what benchmark now is?

jedisct117d ago

GPT-5.4 is already an incredible model for code reviews and security audits with the swival.dev /audit command.

The fact that GPT-5.5 is apparently even better at long-running tasks is very exciting. I don’t have access to it yet, but I’m really looking forward to trying it.

wslh17d ago

Related and insightful: "GPT-5.5: Mythos-Like Hacking, Open to All" [1].

[1] https://news.ycombinator.com/item?id=47879330

ant6n17d ago

My impression has been that ChatGPT-5.4 has been getting dumber and more exhausting in the last couple of weeks. Like it makes a lot of obvious mistakes, ignores (parts of) prompts. keeps forgetting important facts or requirement.

Maybe this is a crazy theory, but I sometimes feel like they gimp their existing models before a big release to you'll notice more of a "step".

varispeed17d ago

I am sceptical. The generation after 4o models have become crappier and crappier. Hope this one changes the trend. 5.4 is unusable for complex coding work.

mannanj16d ago

This might not be the place to discuss this press release by the company, though here it is. I feel like companies like OpenAI have lost their integrity and honor from past actions and activities, and then just pretend that didn't happen and use media and influence to shift focus onto denying their past. There's so much distasteful and IMO outright harmful conduct that has occurred with this company: openAI employee murdered before a large testimony and that employee's mom actively sharing posts that light Altman in a distrustful way (pointing to the CEO clearly not demonstrating proper responsibility towards this matter), theres the large amount of resignments many recent, the whole board matter where the Coup and leveraging Microsoft and large company relationships and threatening to destroy the company brought Altman back in (the anthropic company forming as a result of all that)- how can I trust them when they employ the same controversial, manipulative, abusive tactics as every other large company?

luqtas17d ago

they are using ethical training weights this time!!! /j

xnx17d ago

Next up: Google I/O on May 19?

I have to imagine they'll go to Gemini 3.5 if only for marketing reasons.

throwaw1217d ago

If anyone tried it already, how do you feel?

Numbers look too good, wondering if it is benchmaxxed or not

i_love_retros17d ago

Oh shiiiiit boy! An incrementation dropped!!

damnitbuilds16d ago

Woop woop !

Now, after all this time, this must shurely be the release that does all software developers out of a job ?

Or has Dirty Sam being caught lying, again ?

Cos I've still got a programming job, and GPT can't do it for shit.

yuvrajmalgat17d ago

finally

immanuwell16d ago

Big claims from OpenAI as usual - GPT-5.5 sounds impressive on paper, but we've been down this road before, so I'll believe the 'no speed tradeoff' part when I see it in the wild

baxuz17d ago

Ah yes, the next "trust me bro"

XCSme17d ago

2x the price for 1-5% performance gain

justonepost217d ago

the attenuation of man nears

< 5 years until humans are buffered out of existence tbh

may the light of potentia spread forth beyond us

coderssh17d ago

Great modal, I have been using codex and its awesome. Lets see what GPT-5.5 does to it

MagicMoonlight17d ago

Two hundred pages of shilling and it’s a 1% improvement in the benchmarks. They’re dead in the water.

Imagine spending 100m on some of these AI “geniuses” and this is the best they can do.

j / k navigate · click thread line to collapse