Not always, no, and it takes investment in good prompting/guardrails/plans/explicit test recipes for sure. I'm still on average better at programming in context than Codex 5.4, even if slower. But in terms of "task complexity I can entrust to a model and not be completely disappointed and annoyed", it scores the best so far. Saves a lot on review/iteration overhead.
It's annoying, too, because I don't much like OpenAI as a company.
(Background: 25 years of C++ etc.)
At least until next week when Mythos and GPT 6 throw it all up in the air again.
But I do not use extra high thinking unless its for code review. I sit at GPT 5.4 high 95% of the time.
RE is very interesting problem. A lot more that SWE can be RE'd. I've found the LLMs are reluctant to assist, though you can workaround.
GPT-5 is good at benchmarks, but benchmarks are more forgiving of a misaligned model. Many real world tasks often don't require strong reasoning abilities or high intelligence, so much as the ability to understand what the task is with a minimal prompt.
Not every shop assistant needs a physics degree, and not every physics professor is necessarily qualified to be a shop assistant. A person, or LLM, can be very smart while at the same time very bad at understanding people.
For example, if GPT-5 takes my code and rearranges something for no reason, that's not going to affect its benchmarks because the code will still produce the same answers. But now I have to spend more time reviewing its output to make sure it hasn't done that. The more time I have to spend post-processing its output, the lower its capabilities are since the measurement of capability on real world tasks is often the amount of time saved.
Me: Let's figure out how to clone our company Wordpress theme in Hugo. Here're some tools you can use, here's a way to compare screenshots, iterate until 0% difference.
Codex: Okay Boss! I did the thing! I couldn't get the CSS to match so I just took PNGs of the original site and put them in place! Matches 100%!