A coding hardness with just bash outperforms Codex, Claude Code, OpenCode, Pi ect. The added features are just user experience features.
Like the thing people attach a dog lead to so that their kids won't just go kamikaze into a car.
Coding harnesses are named by analogy to that.
They are not hard.
If kids run into a car, they will most probably just bounce and continue, perhaps inflicting some minor damage. But if a car mows down a kid, that could well be a fatal injury. Leashes for all the cars! ;)
The strapped kids are often normal with no apparent disabilities(but it is possible they have an ADHD diagnosis).
Never thought about doing it to my own.
https://www.endorlabs.com/research/ai-code-security-benchmar...
There's a lot of ways to configure agents and any implicit configuration to harnesses may have a non-trivial effect.
> It's because they do things that is why they score differently.
That was my point. Regardless of how you feel about UX, it's a value added set of features. The question initially posited, stands. Why would a company do any of these things?
> Coding hardness add features for user experience not for agent efficiency.
Pretending it was always about some metric you just decided was important is moving the goalpost. It's not compelling.
I think it makes more sense that it's Freemium Dominance or they act as Low-Cost Marketing tools.
Building a good and working coding harness with smaller models is really hard. Everything evolves around the limited context size.
Tools must be specification driven to reduce noise and high temp hallucinations, tool call shrinking needs to remove errors and tryouts of different formats of parameters (because LLMs always ignore descriptions in the JSON...), and you have to deal with long running agents because you can't afford them. Planner/orchestrator architecture, agent to agent communication need to be summarized, and then you have the messed up scheduling parts, because you need to prioritize short running agents and give the planner a tool to wait for outputs of spawned contractor agents.
And that's not even talking about sandbox vs playground read/write/access policies of tools.
Harness engineering, if done correctly, is quite hard.
And all of this works 60% of the time, every time.
Anyways, that was somewhat the summary of the last 6 months building my exocomp agentic environment. And it's still not satisfying to work with.
With those measures (which are actually quite interesting) it can at times perform at Sonnet level.