undefined | Better HN

0 pointsisege6d ago0 comments

> Claude Code is the best autonomous coding agent.

If you look at the terminal-bench@2.0 leaderboard, you'll quickly see it's actually one of the weakest agentic harnesses. Anthropic's own models score lower with Claude Code than with virtually any other harness.

So it's quite the opposite. Claude Code is arguably the worst harness to run models with.

0 comments

DaanDL6d ago

Okay, but not all results on there are valid, ForgeCode for instance has been cheating in the past:

https://debugml.github.io/cheating-agents/#sneaking-the-answ...

andxor6d ago

Then the benchmarks are wrong.

cpursley6d ago

Those benches are completely and totally meaningless when it comes down to real world work tasks, and everyone knows it.

j / k navigate · click thread line to collapse