undefined | Better HN

0 pointsuser4392812d ago0 comments

It wasn't so much an eval, I really just wanted a small change moved out to another branch.

GPT 5.4 mini couldn't do it. Not even on the second attempt, where it went from obviously wrong to a subtly wrong copy.

In the end I had to manually copy and paste the 10-20 lines over.

If it can't even do that job, I seriously doubt it's going to be adequate for implementing a plan, like people often seem to suggest it could do, in order to save output tokens of a better model.

0 comments

1 comments · 1 top-level

pixlmint11d ago

Like I said, I never really used it for agentic work. I had previously evaluated locally runnable models with opencode (such as qwen3-coder), but found that it wasn't really feasible.

Since then I've adopted a different philosophy, and I actually prefer it this way.

I still very much enjoy doing most coding myself, but when I tried using tools like Claude Code, it felt very difficult to return to the codebase after letting Claude make some changes. Maybe that's just because of poor AI-use discipline, I don't know. But with smaller models, that's not even an issue. I can't just let it do all the coding and thinking for me, however if I can describe a function I want to great detail in plain english, then Gemma can write it for me, and it will most likely work. It's perfect for boilerplate.

I also recently worked with a web framework I'd never worked before, though I'm deeply familiar with other ones. So I asked it "I know how to do this in Y framework, what's the best-practice approach to doing it in Z framework?" and it was incredibly helpful, even pushing back on some of my 'bad' attempts at solving a problem.

I think GPT5.4 mini might fall into a similar category, in that it probably performs best when not overwhelmed with too many tools/ skills/ mcps, instead being given clearly defined tasks by an orchestrator model. I call those my token burners, as they're super cheap to run and have high tokens/second.