Cost per token is a bit misleading because, as others have noted, different models use tokens in different ways. (Aside - This is also why TPS isn't a great metric).
We found that 5.5 is about 1.5-2x more expensive overall. On a "Pareto" basis, we only find 5.5 xhigh worth it. At the lower reasoning levels, 5.4 still edges it out on cost/perf.
We take a spec-driven approach and mostly work in TS (on product development), so if you use a more steer-y approach, or work in a different domain, YMMV.
[0]: https://aibenchy.com/compare/openai-gpt-5-4-medium/openai-gp...
Stronger models needing fewer turns to achieve a task feels like a prime source of efficiency gains for agentic coding, more so than individual responses being shorter.
Rankings at https://gertlabs.com/rankings?mode=agentic_coding. See the efficiency chart at the bottom.
And here I am daily driving Sonnet 4.6 with medium or high thinking. I actually am thoroughly satisfied with the work it does. Perhaps it has to do with the bite sized pieces of work I give it, that fits better with my workflow.
For personal use I switched to coding plans containing GLM 5.1, Kimi K2.6 and Xiaomi MiMo V2.5 Pro and I never been happier. I said goodbye to both Claude Max and Cursor.
That's got to be a very tricky analysis given how subjective quality is. But I'm sure there are people trying to pin it down.