However, o3 resides in the ChatGPT app, which is still superior to the other chat apps in many ways, particularly the internet search implementation works very well.
Your model rankings are spot on. I’m hesitant to make the jump to top tier premium models as daily drivers, so I hang out with sonnet 4 and/or Gemini 2.5 pro for most of the day (max mode in Cursor). I don’t want to get used to premium quality coming that easy, for some reason. I completely align with the concise, thoughtful code being worth it though. I’m having to do that myself using tier 2 models. I still use o3 periodically for getting clarity of thought or troubleshooting gnarly bugs that Claude gets caught looping on.
How would you compare Cursor to Claude Code? I’m yet to try the latter.
I'm surprised there isn't a VibeIDE yet that is purpose build to make it possible for your grandmother to execute code output by an LLM.
My impression (with Cursor) is that you need to practice some sort of LLM-first design to get the best out of it. Either vibe code your way from the start, or be brutal about limiting what changes the agent can make without your approval. It does force you to be very atomic about your requests, which isn't a bad thing, but writing a robust spec for the prompt is often slower than writing the code by hand and asking for a refactor. As soon as kipple, for lack of a better word, sneaks into the code, it's a reinforcing signal to the agent that it can add more.
It's definitely worth paying the $20 and playing with a few different clients. The rabbit hole is pretty deep and there's still a ton of prompt engineering suggestions from the community. It encourages a lot of creative guardrails, like using pre-commit to provide negative feedback when the model does something silly like try to write a 200 word commit message. I haven't tried JetBrains' agent yet (Junie), but that seems like it would be a good one to explore as well since it presumably integrates directly with the tooling.
It's mostly about the cost though. Things are far more affordable in the the various apps/subscriptions. Token-priced API's can get very expensive very quickly.
I used Cursor well over a year ago. It gave me a headache. It was very immature. Used cursor more recently: the headache intensity increased. It's not cursor it is the senseless loops hoping for the LLM to spit out something somewhat correct. Revisiting the prompt. Trying to become an elite in language protocols because we need that machine to understand us.
Leaving aside the headache, its side effects. It isn't clear we haven't already maxed out on the productivity tools efficiency. Auto complete. Indexed and searchable doc a second screen rather than having to turn the pages of some reference book. Etc etc.
I'm convinced at this stage that we've already started to trade too far. So far beyond the optimal balance that these aren't diminishing returns. It is absolute diminishing.
Engineers need to spend more time thinking.
I'm convinced that engineers, if they were to chose, would throw this thing out and make space for more drawing boards, would use a 5 minute Solitaire break every 1h. Or take a walk.
For some reason the constant pressure to go faster eventually makes its mark.
It feels right to see thousands of lines of code written up by this thing. It feels aligned with the inadequate way we've been measured.
Anyway. It can get expensive and this is by design.
You can obviously alleviate this by asking it to be more concise but even then it bleeds through sometimes.
I have to think of the guy posting that he fed his entire project codebase to an AI, it refactored everything, modularizing it but still reducing the file count from 20 to 12. "It was glorious to see. Nothing worked of course, but glorious nonetheless".
In the future I can certainly see it get better and better, especially because code is a hard science that reduces down to control flow logic which reduces down to math. It's a much more narrow problem space than, say, poetry or visuals.
I think my coding model ranking is something like Claude Code > Claude 4 raw > Gemini > big gap > o4-mini > o3
All subscription based, not per token pricing. I'm currently using Claude Max. Can't see myself exhausting its usage at this rate but who knows.