GPT-5.5 Price Increase: What It Costs (opens in new tab)

(openrouter.ai)

206 pointsgmays2d ago68 comments

68 comments

We track performance vs. the all-in cost of completing real engineering tasks, rather than cost per token. [1]

Cost per token is a bit misleading because, as others have noted, different models use tokens in different ways. (Aside - This is also why TPS isn't a great metric).

We found that 5.5 is about 1.5-2x more expensive overall. On a "Pareto" basis, we only find 5.5 xhigh worth it. At the lower reasoning levels, 5.4 still edges it out on cost/perf.

We take a spec-driven approach and mostly work in TS (on product development), so if you use a more steer-y approach, or work in a different domain, YMMV.

[1] https://voratiq.com/leaderboard?x=cost

4 more replies

iceKirin1d ago

I feel that the recent iterations of LLM haven't provided an intuitive qualitative leap. Have they entered a bottleneck period so quickly?

16 more replies

XCSme1d ago

~3.5x more expensive to run my benchmarks[0].

[0]: https://aibenchy.com/compare/openai-gpt-5-4-medium/openai-gp...

1 more reply

jsnell2d ago

This doesn't seem to be controlling for the number of turns in any way. Am I missing something?

Stronger models needing fewer turns to achieve a task feels like a prime source of efficiency gains for agentic coding, more so than individual responses being shorter.

3 more replies

gertlabs1d ago

We observed slightly smaller outputs over long horizon agentic coding for GPT 5.5, at a significant improvement in overall response scores. For one-shot coding responses, GPT 5.5 was actually more verbose than GPT 5.4, but again, the responses were significantly stronger. The expected cost increases reported by OpenRouter seem reasonably accurate (perhaps a bit optimistic), but in my opinion, highly worth it. GPT 5.5 has a pretty wide lead on the #2 model for understanding complex scenarios.

Rankings at https://gertlabs.com/rankings?mode=agentic_coding. See the efficiency chart at the bottom.

spprashant22h ago

I feel very lost in these threads. A lot of people talk about getting bad results from gpt 5.5 xH or Opus 4.7 xH.

And here I am daily driving Sonnet 4.6 with medium or high thinking. I actually am thoroughly satisfied with the work it does. Perhaps it has to do with the bite sized pieces of work I give it, that fits better with my workflow.

boh1d ago

New model releases are now like new iPhones--mostly imperceivable improvements with a higher price tag. That's one of the major benefits to open source: you can "freeze" what model you're using. Often it's the model that you know that wins over the one that is different enough that you have to start from scratch with every major update. Most businesses require cost control and predictability over a cutting edge with limited evidence of profitable output outside of tech.

1 more reply

degutemesgen1d ago

I do think recent models are too expensive to be used for customer-facing agentic workflows.

coalhouse1d ago

it does seem like a step change in token efficiency, though based on the earlier artificial analysis reporting it's also quite the cost lottery and i'm not sure i am comfortable with that

DeathArrow1d ago

In terms of work done per dollar, new models from OpenAI and Anthropic are worse than the older models. They are trying to squeeze the customers.

For personal use I switched to coding plans containing GLM 5.1, Kimi K2.6 and Xiaomi MiMo V2.5 Pro and I never been happier. I said goodbye to both Claude Max and Cursor.

2 more replies

i_think_so2d ago

Has any enterprising hacker here yet graphed price vs "output" over time since 2023, taking "quality" into account?

That's got to be a very tricky analysis given how subjective quality is. But I'm sure there are people trying to pin it down.

3 more replies

j / k navigate · click thread line to collapse

68 comments

languid-photic1d ago

We track performance vs. the all-in cost of completing real engineering tasks, rather than cost per token. [1]

Cost per token is a bit misleading because, as others have noted, different models use tokens in different ways. (Aside - This is also why TPS isn't a great metric).

We found that 5.5 is about 1.5-2x more expensive overall. On a "Pareto" basis, we only find 5.5 xhigh worth it. At the lower reasoning levels, 5.4 still edges it out on cost/perf.

We take a spec-driven approach and mostly work in TS (on product development), so if you use a more steer-y approach, or work in a different domain, YMMV.

[1] https://voratiq.com/leaderboard?x=cost

4 more replies

iceKirin1d ago

I feel that the recent iterations of LLM haven't provided an intuitive qualitative leap. Have they entered a bottleneck period so quickly?

16 more replies

XCSme1d ago

~3.5x more expensive to run my benchmarks[0].

[0]: https://aibenchy.com/compare/openai-gpt-5-4-medium/openai-gp...

1 more reply

jsnell2d ago

This doesn't seem to be controlling for the number of turns in any way. Am I missing something?

Stronger models needing fewer turns to achieve a task feels like a prime source of efficiency gains for agentic coding, more so than individual responses being shorter.

3 more replies

gertlabs1d ago

Rankings at https://gertlabs.com/rankings?mode=agentic_coding. See the efficiency chart at the bottom.

spprashant22h ago

I feel very lost in these threads. A lot of people talk about getting bad results from gpt 5.5 xH or Opus 4.7 xH.

boh1d ago

1 more reply

degutemesgen1d ago

I do think recent models are too expensive to be used for customer-facing agentic workflows.

coalhouse1d ago

it does seem like a step change in token efficiency, though based on the earlier artificial analysis reporting it's also quite the cost lottery and i'm not sure i am comfortable with that

DeathArrow1d ago

In terms of work done per dollar, new models from OpenAI and Anthropic are worse than the older models. They are trying to squeeze the customers.

For personal use I switched to coding plans containing GLM 5.1, Kimi K2.6 and Xiaomi MiMo V2.5 Pro and I never been happier. I said goodbye to both Claude Max and Cursor.

2 more replies

i_think_so2d ago

Has any enterprising hacker here yet graphed price vs "output" over time since 2023, taking "quality" into account?

That's got to be a very tricky analysis given how subjective quality is. But I'm sure there are people trying to pin it down.

3 more replies

j / k navigate · click thread line to collapse