Your model rankings are spot on. I’m hesitant to make the jump to top tier premium models as daily drivers, so I hang out with sonnet 4 and/or Gemini 2.5 pro for most of the day (max mode in Cursor). I don’t want to get used to premium quality coming that easy, for some reason. I completely align with the concise, thoughtful code being worth it though. I’m having to do that myself using tier 2 models. I still use o3 periodically for getting clarity of thought or troubleshooting gnarly bugs that Claude gets caught looping on.
How would you compare Cursor to Claude Code? I’m yet to try the latter.
I'm surprised there isn't a VibeIDE yet that is purpose build to make it possible for your grandmother to execute code output by an LLM.
The major LLM chat interfaces often have code execution built in, so there kind of is, it just doesn't look like what an SWE thinks of as an IDE.
My impression (with Cursor) is that you need to practice some sort of LLM-first design to get the best out of it. Either vibe code your way from the start, or be brutal about limiting what changes the agent can make without your approval. It does force you to be very atomic about your requests, which isn't a bad thing, but writing a robust spec for the prompt is often slower than writing the code by hand and asking for a refactor. As soon as kipple, for lack of a better word, sneaks into the code, it's a reinforcing signal to the agent that it can add more.
It's definitely worth paying the $20 and playing with a few different clients. The rabbit hole is pretty deep and there's still a ton of prompt engineering suggestions from the community. It encourages a lot of creative guardrails, like using pre-commit to provide negative feedback when the model does something silly like try to write a 200 word commit message. I haven't tried JetBrains' agent yet (Junie), but that seems like it would be a good one to explore as well since it presumably integrates directly with the tooling.
It's mostly about the cost though. Things are far more affordable in the the various apps/subscriptions. Token-priced API's can get very expensive very quickly.
I used Cursor well over a year ago. It gave me a headache. It was very immature. Used cursor more recently: the headache intensity increased. It's not cursor it is the senseless loops hoping for the LLM to spit out something somewhat correct. Revisiting the prompt. Trying to become an elite in language protocols because we need that machine to understand us.
Leaving aside the headache, its side effects. It isn't clear we haven't already maxed out on the productivity tools efficiency. Auto complete. Indexed and searchable doc a second screen rather than having to turn the pages of some reference book. Etc etc.
I'm convinced at this stage that we've already started to trade too far. So far beyond the optimal balance that these aren't diminishing returns. It is absolute diminishing.
Engineers need to spend more time thinking.
I'm convinced that engineers, if they were to chose, would throw this thing out and make space for more drawing boards, would use a 5 minute Solitaire break every 1h. Or take a walk.
For some reason the constant pressure to go faster eventually makes its mark.
It feels right to see thousands of lines of code written up by this thing. It feels aligned with the inadequate way we've been measured.
Anyway. It can get expensive and this is by design.
I have bipolar disorder. This makes programming incredibly difficult for me at times. Almost all the recent improvements to code generation tooling have been a tremendous boon for me. Coding is now no longer this test of how frustrated I can get over the most trivial of tasks. I just ask for what I want precisely and treat responses like a GitHub PR where mistakes may occur. In general (and for the trivial tasks I'm describing) Claude Code will generate correct, good code (I inform it very precisely of the style I want, and tell it to use linters/type-checkers/formatters after making changes) on the first attempt. No corrections needed.
tl;dr - It's been nothing but a boon for this particular mentally ill person.
Now if Ai assistance allow you to perform well then that is a different story and I take my advice back of course.
There is a lot to say, positive things about how LLMs enables people to perform at tasks that would be impossible for them. Whether due to handicaps or simply lacking the abilities, or opportunity to train.
My comment was on the impact on "healthy" individuals who remain the majority of the population. And I only spoke for myself, I have no clue maybe it is just me or due to how I use the thing. Thanks for sharing your experience though, I had not considered what might be a concern for the majority with this might very well be an enabler.
Language: Syntax errors rise, and a common form is the syntax of a more common language bleeding through.
Domain: Less so than what humans deem complex, quality is more strongly controlled by how much code and documentation there is for a domain. Interesting is that if in a less common subdomain, it will often revert to a more common approach (for example working on shaders for a game that takes place in a cylinder geometry requires a lot more hand-holding than on a plane). It's usually not that they can't do it, but that they require much more involved prompting to get the context appropriately set up and then managing drifting to default, more common patterns. Related is decisions with long term consequences. LLMs are pretty weak at this. In humans this one comes with experience, so it's rare and an instance of low coverage.
Dates: Related is reverting to obsolete API patterns.
Complexity: While not as dominant as domain coverage, complexity does play a role. With likelihood of error rising with complexity.
This means if you're at the intersection of multiple of these (such as a low coverage problem in a functional language), agent mode will likely be too much of a waste for you. But interactive mode can still be highly productive.