undefined | Better HN

0 pointsthrowaway3141559mo ago0 comments

It's interesting you say that because o3, while being a considerable improvement over OpenAI's other models, still doesn't match the performance of Opus 4 and Gemini 2.5 Pro by a long shot for me.

However, o3 resides in the ChatGPT app, which is still superior to the other chat apps in many ways, particularly the internet search implementation works very well.

0 comments

svachalek9mo ago

If you're coding through chat apps you're really behind the times. Try an agent IDE or plugin.

joshmlewis9mo ago

Yeah, exactly. For everyone who might not know, the chat apps add lots of complex system prompting to handle and shape personality, tone, general usability, etc. IDE's also do this (with Claude Code being one of the ones that are closest to "bare" model that you can get) but at they are at least guiding it's behavior to be really good at coding tasks. Another reason is using the Agent feature that IDE's have had for a few months now which gives it the ability to search/read/edit files across your codebase. You may not like the idea of this and it feels like losing control, but it's the future. After months of using it I've learned how to get it to do what I want but I think a lot of people who try it once and stop get frustrated that it does something dumb and just assume it's not good. That's a practice and skill problem not a model problem.

jona777than9mo ago

This has been my experience. It has been something I’ve had to settle into. After some reps, it is becoming more difficult to imagine going back to regular old non-assisted coding sessions that aren’t purely for hobby.

Your model rankings are spot on. I’m hesitant to make the jump to top tier premium models as daily drivers, so I hang out with sonnet 4 and/or Gemini 2.5 pro for most of the day (max mode in Cursor). I don’t want to get used to premium quality coming that easy, for some reason. I completely align with the concise, thoughtful code being worth it though. I’m having to do that myself using tier 2 models. I still use o3 periodically for getting clarity of thought or troubleshooting gnarly bugs that Claude gets caught looping on.

How would you compare Cursor to Claude Code? I’m yet to try the latter.

Workaccount29mo ago

IDE's are intimidating to non-tech people.

I'm surprised there isn't a VibeIDE yet that is purpose build to make it possible for your grandmother to execute code output by an LLM.

3 more replies

joshvm9mo ago

An important caveat here is yes, for coding. Apps are fine for coming up with one-liners, or doing other research. I haven't found the quality of IDE based code to be significantly better than what ChatGPT would suggest, but it's very useful to ask questions when the model has access to both the code and can prompt you to run tests which rely on local data (or even attached hardware). I really don't trust YOLO mode so I manually approve terminal calls.

My impression (with Cursor) is that you need to practice some sort of LLM-first design to get the best out of it. Either vibe code your way from the start, or be brutal about limiting what changes the agent can make without your approval. It does force you to be very atomic about your requests, which isn't a bad thing, but writing a robust spec for the prompt is often slower than writing the code by hand and asking for a refactor. As soon as kipple, for lack of a better word, sneaks into the code, it's a reinforcing signal to the agent that it can add more.

It's definitely worth paying the $20 and playing with a few different clients. The rabbit hole is pretty deep and there's still a ton of prompt engineering suggestions from the community. It encourages a lot of creative guardrails, like using pre-commit to provide negative feedback when the model does something silly like try to write a 200 word commit message. I haven't tried JetBrains' agent yet (Junie), but that seems like it would be a good one to explore as well since it presumably integrates directly with the tooling.

throwaway314155OP9mo ago

I think this is debatable. But I've used Cursor and various extensions for VS Code. They're all fine (but cursor can fuck all the way off for stealing the `code` shell integration from VS Code) but you don't _need_ an IDE as Claude Code has shown us (currently my primary method of vibe coding).

It's mostly about the cost though. Things are far more affordable in the the various apps/subscriptions. Token-priced API's can get very expensive very quickly.

hirako20009mo ago

We are trading tokens and mental health for time?

I used Cursor well over a year ago. It gave me a headache. It was very immature. Used cursor more recently: the headache intensity increased. It's not cursor it is the senseless loops hoping for the LLM to spit out something somewhat correct. Revisiting the prompt. Trying to become an elite in language protocols because we need that machine to understand us.

Leaving aside the headache, its side effects. It isn't clear we haven't already maxed out on the productivity tools efficiency. Auto complete. Indexed and searchable doc a second screen rather than having to turn the pages of some reference book. Etc etc.

I'm convinced at this stage that we've already started to trade too far. So far beyond the optimal balance that these aren't diminishing returns. It is absolute diminishing.

Engineers need to spend more time thinking.

I'm convinced that engineers, if they were to chose, would throw this thing out and make space for more drawing boards, would use a 5 minute Solitaire break every 1h. Or take a walk.

For some reason the constant pressure to go faster eventually makes its mark.

It feels right to see thousands of lines of code written up by this thing. It feels aligned with the inadequate way we've been measured.

Anyway. It can get expensive and this is by design.

1 more reply

baw-bag9mo ago

I am really struggling with this. I tried Cline with both OpenAI and Claude to very weird results. Often burning through credits to get no where or just running out of context. I just got Cursor for a try so can't say anything on that yet.

joshmlewis9mo ago

It's a skill that takes some persistence and trial and error. Happy to chat with you about it if you want to send me an email.

2 more replies

PeterStuer9mo ago

Depends. For devops chat is quite nice as the exploration/understanding is key, not just writing out the configs.

jorvi9mo ago

What's most annoying about Gemini 2.5 is that it is obnoxiously verbose compared to Opus 4. Both in explaining the code it wrote and the amount of lines it writes and comments it adds, to the point where the output is often 2-3x more than Opus 4.

You can obviously alleviate this by asking it to be more concise but even then it bleeds through sometimes.

joshmlewis9mo ago

Yes this is what I mean by conciseness with o3. If prompted well it can produce extremely high level quality code that blows me away at times. I've also had several instances now where I gave it slightly wrong context and other models just butchered a solution with dozens of lines for the proposed fix which I could tell wasn't right and then after reverting and asking o3, it immediately went searching for another file I hadn't included and fixed it in one line. That kind of, dare I say independent thinking, is worth a lot when dealing with complex codebases.

jorvi9mo ago

Personally I still am of the opinion current LLMs are more of a very advanced autocomplete.

I have to think of the guy posting that he fed his entire project codebase to an AI, it refactored everything, modularizing it but still reducing the file count from 20 to 12. "It was glorious to see. Nothing worked of course, but glorious nonetheless".

In the future I can certainly see it get better and better, especially because code is a hard science that reduces down to control flow logic which reduces down to math. It's a much more narrow problem space than, say, poetry or visuals.

joshmlewis9mo ago

What languages do you use it with and IDE? I use it in Cursor mainly with Max reasoning on. I spent around $300 on token based usage for o3 alone in May still only accepting around 33% of suggestions though. I made a post on X about this the other day but I expect that amount of rejections will go down significantly by the end of this year at the rate things are going.

drawnwren9mo ago

Very strange. I find reasoning has very narrow usefulness for me. It's great to get a project in context or to get oriented in the conversation, but on long conversations I find reasoning starts to add way too much extraneous stuff and get distracted from the task at hand.

I think my coding model ranking is something like Claude Code > Claude 4 raw > Gemini > big gap > o4-mini > o3

joshmlewis9mo ago

Claude Code isn't a model in itself. By default it routes some to Opus 4 or Sonnet 4 but mostly Sonnet 4 unless you explicitly set it.

1 more reply

throwaway314155OP9mo ago

i'm using with python, VS Code (not integrated with claude just basic copilot) and Claude Code. For Gemini i'm using AI studio with repomix to package my code into a single file. I copy files over manually in that workflow.

All subscription based, not per token pricing. I'm currently using Claude Max. Can't see myself exhausting its usage at this rate but who knows.

j / k navigate · click thread line to collapse

0 comments

svachalek9mo ago

If you're coding through chat apps you're really behind the times. Try an agent IDE or plugin.

joshmlewis9mo ago

jona777than9mo ago

How would you compare Cursor to Claude Code? I’m yet to try the latter.

Workaccount29mo ago

IDE's are intimidating to non-tech people.

I'm surprised there isn't a VibeIDE yet that is purpose build to make it possible for your grandmother to execute code output by an LLM.

3 more replies

joshvm9mo ago

throwaway314155OP9mo ago

It's mostly about the cost though. Things are far more affordable in the the various apps/subscriptions. Token-priced API's can get very expensive very quickly.

hirako20009mo ago

We are trading tokens and mental health for time?

I'm convinced at this stage that we've already started to trade too far. So far beyond the optimal balance that these aren't diminishing returns. It is absolute diminishing.

Engineers need to spend more time thinking.

I'm convinced that engineers, if they were to chose, would throw this thing out and make space for more drawing boards, would use a 5 minute Solitaire break every 1h. Or take a walk.

For some reason the constant pressure to go faster eventually makes its mark.

It feels right to see thousands of lines of code written up by this thing. It feels aligned with the inadequate way we've been measured.

Anyway. It can get expensive and this is by design.

1 more reply

baw-bag9mo ago

joshmlewis9mo ago

It's a skill that takes some persistence and trial and error. Happy to chat with you about it if you want to send me an email.

2 more replies

PeterStuer9mo ago

Depends. For devops chat is quite nice as the exploration/understanding is key, not just writing out the configs.

jorvi9mo ago

You can obviously alleviate this by asking it to be more concise but even then it bleeds through sometimes.

joshmlewis9mo ago

jorvi9mo ago

Personally I still am of the opinion current LLMs are more of a very advanced autocomplete.

joshmlewis9mo ago

drawnwren9mo ago

I think my coding model ranking is something like Claude Code > Claude 4 raw > Gemini > big gap > o4-mini > o3

joshmlewis9mo ago

Claude Code isn't a model in itself. By default it routes some to Opus 4 or Sonnet 4 but mostly Sonnet 4 unless you explicitly set it.

1 more reply

throwaway314155OP9mo ago

All subscription based, not per token pricing. I'm currently using Claude Max. Can't see myself exhausting its usage at this rate but who knows.

j / k navigate · click thread line to collapse