undefined | Better HN

0 pointsriku_iki5mo ago0 comments

> internal team to create an ARC replica, covering very similar puzzles

they can target benchmark directly, not just replica. If google or OAI are bad actors, they already have benchmark data from previous runs.

0 comments

energy1235mo ago

The 'private' set is just a pinkie promise not to store logs or not to use the logs when the evaluator uses the API to run the test, so yeah. It's trivially exploitable.

Not only do you have the financial self-interest to do it (helps with capital raising to be #1), but you are worried that your competitors are doing it, so you may as well cheat to make things fair. Easy to do and easy to justify.

Maybe a way to make the benchmark more robust to this adversarial environment is to introduce noise and random red herrings into the question, and run the test 20 times and average the correctness. So even if you assume they're training on it, you have some semblance of a test still happening. You'd probably end up with a better benchmark anyway which better reflects real-world usage, where there's a lot of junk in the context window.

riku_ikiOP5mo ago

they have two sets:

- semi-private, which they use to test proprietary models and which could be leaked

-private: used to test downloadable open source models.

ARG-AGI prize itself is for open source models.

stephc_int135mo ago

My point is that it does not matter if the set is private or not.

If you want to train your model you'd need more data than the private set anyway. So you have to build a very large training set on your own, using the same kind of puzzles.

It is not that hard, really, just tedious.

1 more reply

j / k navigate · click thread line to collapse