undefined | Better HN

Skip to content

Top Best Ask Show New Jobs

0 pointsverdverm6mo ago0 comments

We're also in benchmark saturation territory. I heard it speculated that Anthropic emphasizes benchmarks less in their publications because internally they don't care about them nearly as much as making a model that works well on the day-to-day

0 comments

5 comments · 5 top-level

stego-tech6mo ago

These models still consistently fail the only benchmark that matters: if I give you a task, can you complete it successfully without making shit up?

Thus far they all fail. Code outputs don’t run, or variables aren’t captured correctly, or hallucinations are stated as factual rather than suspect or “I don’t know.”

It’s 2000’s PC gaming all over again (“gotta game the benchmark!”).

quantumHazer6mo ago

Seems pretty false if you look at the model card and web site of Opus 4.5 that is… (check notes) their latest model.

Mistletoe6mo ago

How do you measure whether it works better day to day without benchmarks?

brokensegue6mo ago

how do you quantitatively measure day-to-day quality? only thing i can think is A/B tests which take a while to evaluate

HDThoreaun6mo ago

Arc-AGI is just an iq test. I don’t see the problem with training it to be good at iq tests because that’s a skill that translates well.

j / k navigate · click thread line to collapse