undefined | Better HN

0 pointsjo9091y ago0 comments

> Couldn't LLM provider just fine-tune their model for these tasks specifically - since they are static - to get ad value?

They could. They would easily be found out as they loose in real world usage or improved new unique benchmarks.

If you were in charge of a large and well funded model, would you rather pay people to find and "cheat" on LLM benchmarks by training on them, or would you pay people to identify benchmarks and make reasonably sure they specifically get excluded from training data?

I would exclude them as well as possible so I get feedback on how "real" any model improvement is. I need to develop real world improvements in the end, and any short term gain in usage by cheating in benchmarks seems very foolish.

0 comments

6 comments · 5 top-level

youoy1y ago· 1 in thread

> If you were in charge of a large and well funded model, would you rather pay people to find and "cheat" on LLM benchmarks by training on them, or would you pay people to identify benchmarks and make reasonably sure they specifically get excluded from training data?

You should already know by now that economic incentives are not always aligned with science/knowledge...

This is the true alignment problem, not the AI alignment one hahaha

concordDance1y ago

The AI alignement problem and the people alignment problem are actually the same problem! :D

One is just a bit harder due to the less familiar mind "design".

gloosx1y ago

It sounds very nice, but at the same time very naive, sorry. Funding is not a gift, and they must make money. The more funding they get - the more pressure there is to make money.

When you're in charge of a billion-dollar valuation company which is expected to remain unprofitable by 2029, it's hard to find a topic more crucial and intriguing than growth and making more money.

And yes, it is a recurring theme for vendors to tune their products specifically for industry-standard benchmarks. I can't find any specific reason for them not to pay people for training their model to score 90% on these 113 python tasks, as it directly drives profits up, whereas not doing it will bring absolute nothing to the table - surely they have their own internal benchmarks which they can exclude from training data.

carschno1y ago

They cannot be found out as long as there is no better evaluation. Sure, if they produce obvious nonsense, but the point of a systematic evaluation is exactly to overcome subjective impressions based on individual examples as a notion of quality.

Also, you are right that excluding test data from the training data improves your model. However, given the insane amounts of training data, this requires significant effort. If that additionally leads to your model performing worse in existing leaderboards, I doubt that (commercial) organizations would pay for such an effort.

And again, as long as there is no better evaluation method, you still won't know how much it really helps.

KeplerBoy1y ago

This market is all about hype and mindshare, proper testing is hard and not performed by individuals, so there are no incentives not to train a bit on the test set.

gershy1y ago

And if there is a board that will fire you if expected profits do not increase, do you still maintain this stance?

j / k navigate · click thread line to collapse