undefined | Better HN

0 pointsimpulser_14d ago0 comments

Because there is literally nothing special about coding hardnesses. The models are doing all the lifting. It just user experience that separates them.

A coding hardness with just bash outperforms Codex, Claude Code, OpenCode, Pi ect. The added features are just user experience features.

0 comments

13 comments · 5 top-level

avadodin14d ago· 5 in thread

A harness(notice the lack of a 'd') is a strap system to gain control over something.

Like the thing people attach a dog lead to so that their kids won't just go kamikaze into a car.

Coding harnesses are named by analogy to that.

They are not hard.

TurdF3rguson14d ago

The reason I have a dog harness is to distributes weight so I don't choke her when she goes at the other dog that she doesn't like. I'm actually puzzling over kids kamikazeing into cars

imp0cat13d ago

It's actually only a problem if it's the other way around, isn't it?

If kids run into a car, they will most probably just bounce and continue, perhaps inflicting some minor damage. But if a car mows down a kid, that could well be a fatal injury. Leashes for all the cars! ;)

1 more reply

avadodin14d ago

It is a common fear for parents. Obviously they are not fighting for the emperor but chasing or running away from something.

The strapped kids are often normal with no apparent disabilities(but it is possible they have an ADHD diagnosis).

Never thought about doing it to my own.

impulser_OP14d ago

You got to miss spell these days or people assume your ai :)

IncRnd14d ago

That's very punnyy

Its like yuo're on fire!

Supermancho14d ago· 2 in thread

If harnesses are basically doing nothing, why would these metrics vary so widely?

https://www.endorlabs.com/research/ai-code-security-benchmar...

There's a lot of ways to configure agents and any implicit configuration to harnesses may have a non-trivial effect.

impulser_OP14d ago

It's because they do things that is why they score differently. Coding hardness add features for user experience not for agent efficiency. If they did all the coding hardnesses would be using bash and code mode and letting the agents write code to perform tasks but this doesn't work because you want humans in the loop. You want users to be able to approve and deny writes. You want uses to see edits. So you have to build tool for these. It's hard to show diffs when the agent is just using bash.

Supermancho14d ago

> The added features are just user experience features.

> It's because they do things that is why they score differently.

That was my point. Regardless of how you feel about UX, it's a value added set of features. The question initially posited, stands. Why would a company do any of these things?

> Coding hardness add features for user experience not for agent efficiency.

Pretending it was always about some metric you just decided was important is moving the goalpost. It's not compelling.

I think it makes more sense that it's Freemium Dominance or they act as Low-Cost Marketing tools.

cookiengineer13d ago· 1 in thread

I would disagree here.

Building a good and working coding harness with smaller models is really hard. Everything evolves around the limited context size.

Tools must be specification driven to reduce noise and high temp hallucinations, tool call shrinking needs to remove errors and tryouts of different formats of parameters (because LLMs always ignore descriptions in the JSON...), and you have to deal with long running agents because you can't afford them. Planner/orchestrator architecture, agent to agent communication need to be summarized, and then you have the messed up scheduling parts, because you need to prioritize short running agents and give the planner a tool to wait for outputs of spawned contractor agents.

And that's not even talking about sandbox vs playground read/write/access policies of tools.

Harness engineering, if done correctly, is quite hard.

And all of this works 60% of the time, every time.

Anyways, that was somewhat the summary of the last 6 months building my exocomp agentic environment. And it's still not satisfying to work with.

calgoo13d ago

In my limited experience, the smaller the model, the bigger the harness. Where with something like claude or deepseek the context size etc just let's you give it bash access and step back; small models tends to do better with simple action - response , new context each call. Context management becomes a continuous activity. Its a fun space , and I have found big models decent at building and improving these harnesses for the small ones. Using /loop and just run a continuous test - build - test loop.

vidarh13d ago

Try Kimi in Kimi CLI and Claude Code and try saying that again. Kimi quickly collapses into tool calling loops without measures in their CLI but not in Claude Code and is largely useless for any long running tasks in harnesses not taking this into account.

With those measures (which are actually quite interesting) it can at times perform at Sonnet level.

selcuka14d ago

Your reply doesn't answer the question: What is their motivation for any of it?

j / k navigate · click thread line to collapse

0 comments

13 comments · 5 top-level

avadodin14d ago· 5 in thread

A harness(notice the lack of a 'd') is a strap system to gain control over something.

Like the thing people attach a dog lead to so that their kids won't just go kamikaze into a car.

Coding harnesses are named by analogy to that.

They are not hard.

TurdF3rguson14d ago

The reason I have a dog harness is to distributes weight so I don't choke her when she goes at the other dog that she doesn't like. I'm actually puzzling over kids kamikazeing into cars

imp0cat13d ago

It's actually only a problem if it's the other way around, isn't it?

1 more reply

avadodin14d ago

It is a common fear for parents. Obviously they are not fighting for the emperor but chasing or running away from something.

The strapped kids are often normal with no apparent disabilities(but it is possible they have an ADHD diagnosis).

Never thought about doing it to my own.

impulser_OP14d ago

You got to miss spell these days or people assume your ai :)

IncRnd14d ago

That's very punnyy

Its like yuo're on fire!

Supermancho14d ago· 2 in thread

If harnesses are basically doing nothing, why would these metrics vary so widely?

https://www.endorlabs.com/research/ai-code-security-benchmar...

There's a lot of ways to configure agents and any implicit configuration to harnesses may have a non-trivial effect.

impulser_OP14d ago

Supermancho14d ago

> The added features are just user experience features.

> It's because they do things that is why they score differently.

That was my point. Regardless of how you feel about UX, it's a value added set of features. The question initially posited, stands. Why would a company do any of these things?

> Coding hardness add features for user experience not for agent efficiency.

Pretending it was always about some metric you just decided was important is moving the goalpost. It's not compelling.

I think it makes more sense that it's Freemium Dominance or they act as Low-Cost Marketing tools.

cookiengineer13d ago· 1 in thread

I would disagree here.

Building a good and working coding harness with smaller models is really hard. Everything evolves around the limited context size.

And that's not even talking about sandbox vs playground read/write/access policies of tools.

Harness engineering, if done correctly, is quite hard.

And all of this works 60% of the time, every time.

Anyways, that was somewhat the summary of the last 6 months building my exocomp agentic environment. And it's still not satisfying to work with.

calgoo13d ago

vidarh13d ago

With those measures (which are actually quite interesting) it can at times perform at Sonnet level.

selcuka14d ago

Your reply doesn't answer the question: What is their motivation for any of it?

j / k navigate · click thread line to collapse