undefined | Better HN

0 pointssvantana5mo ago0 comments

SWEBench-Verified is probably benchmaxxed at this stage. Claude isn't even the top performer, that honor goes to Doubao [1].

Also, the confidence interval for a such a small dataset is about 3 percent points, so these differences could just be up to chance.

0 comments

claude 4.5 gets 82% on their own highly customized scaffolding. (parallel compute with a scoring function). That beats Doubao

j / k navigate · click thread line to collapse

claude 4.5 gets 82% on their own highly customized scaffolding. (parallel compute with a scoring function). That beats Doubao

j / k navigate · click thread line to collapse