Ultimately, most benchmarks can be gamed and their real utility is thus short-lived.
But I think this is also fair to use any means to beat it.
However, many people have worked hard to optimize tools specifically for ARC over many years, and it's proven to be a particularly hard test to optimize for. This is why I find it so interesting that LLMs can do it well at all, regardless of whether tests like it are included in training.
ARC do not provide this kind of dataset, only a small public one and a private one where they do the benchmarks.
Building your own large private ARC set does not seem too difficult if you have enough resources.
There is for a fact teams creating puzzles to RL against as training environments. As it’s beneficial to RL training and in particular compute efficient if you schedule the environment difficulty throughout training. There was a great recent paper on this. Creating environment data that generalizes outside the environment is a challenging engineering task and super valuable whether it looks like AGC AGI or not.
Also ARC AGI is general enough that if you create similar data you’re just creating generic visual puzzle data. Should all visual puzzle data be off limits ?
they can target benchmark directly, not just replica. If google or OAI are bad actors, they already have benchmark data from previous runs.
Not only do you have the financial self-interest to do it (helps with capital raising to be #1), but you are worried that your competitors are doing it, so you may as well cheat to make things fair. Easy to do and easy to justify.
Maybe a way to make the benchmark more robust to this adversarial environment is to introduce noise and random red herrings into the question, and run the test 20 times and average the correctness. So even if you assume they're training on it, you have some semblance of a test still happening. You'd probably end up with a better benchmark anyway which better reflects real-world usage, where there's a lot of junk in the context window.
- semi-private, which they use to test proprietary models and which could be leaked
-private: used to test downloadable open source models.
ARG-AGI prize itself is for open source models.
Like with humans [1], generalized reasoning ability lets you skip the direct storage of that solution, and many many others, completely! You can just synthesize a solution when a problem is presented.
The trick is to not put more value in the score than what it is.
We have a global RL Pipeline on our hand.
If there is something new a LLM/AI model can't solve today, plenty of humans can't either.
But tomorrow every LLM/AI model can solve it and again plent of humans still can't.
Even if AGI is just the sum of companies adding more and more trainingdata, as long as this learning pipeline becomes faster and easier to train with new scenarios, that will start to bleed out humans in the loop.