undefined | Better HN

0 pointswfme1y ago0 comments

My experience reflects this too. My hunch is that GPT-4o was trained to game the benchmarks rather than output higher quality content.

In theory the benchmarks should be a pretty close proxy for quality, but that doesn't match my experience at all.

0 comments

margorczynski1y ago

A problem with a lot of benchmarks is that they are out in the open so the model basically trains to game them instead of actually acquiring knowledge that would let it solve it. Probably private benchmarks that are not in the training set of these models should give better estimates about their general performance.

j / k navigate · click thread line to collapse

0 pointswfme1y ago0 comments

My experience reflects this too. My hunch is that GPT-4o was trained to game the benchmarks rather than output higher quality content.

In theory the benchmarks should be a pretty close proxy for quality, but that doesn't match my experience at all.

0 comments

margorczynski1y ago

j / k navigate · click thread line to collapse