undefined | Better HN

0 pointsstephc_int135mo ago0 comments

What I would do if I was in the position of a large company in this space is to arrange an internal team to create an ARC replica, covering very similar puzzles and use that as part of the training.

Ultimately, most benchmarks can be gamed and their real utility is thus short-lived.

But I think this is also fair to use any means to beat it.

0 comments

tylervigen5mo ago

I agree that for any given test, you could build a specific pipeline to optimize for that test. I supposed that's why it is helpful to have many tests.

However, many people have worked hard to optimize tools specifically for ARC over many years, and it's proven to be a particularly hard test to optimize for. This is why I find it so interesting that LLMs can do it well at all, regardless of whether tests like it are included in training.

1 more reply

benlivengood5mo ago

That's ok; just start publishing your real problems to solve as "AI benchmarks" and then it'll work in ~6 months.

nbardy5mo ago

This isn’t gaming the benchmark though. If training on similar data generalizes that’s called learning. Training on the exact set is memorization.

There is for a fact teams creating puzzles to RL against as training environments. As it’s beneficial to RL training and in particular compute efficient if you schedule the environment difficulty throughout training. There was a great recent paper on this. Creating environment data that generalizes outside the environment is a challenging engineering task and super valuable whether it looks like AGC AGI or not.

Also ARC AGI is general enough that if you create similar data you’re just creating generic visual puzzle data. Should all visual puzzle data be off limits ?

riku_iki5mo ago

> internal team to create an ARC replica, covering very similar puzzles

they can target benchmark directly, not just replica. If google or OAI are bad actors, they already have benchmark data from previous runs.

1 more reply

AstroBen5mo ago

Is "good at benchmarks instead of real world tasks" really something to optimize for? What does this achieve? Surely people would be initially impressed, try it out, be underwhelmed and then move on. That's not great for Google

3 more replies

simpsond5mo ago

Humans study for tests. They just tend to forget.

Blamklmo5mo ago

Doesn't even matter at this point.

We have a global RL Pipeline on our hand.

If there is something new a LLM/AI model can't solve today, plenty of humans can't either.

But tomorrow every LLM/AI model can solve it and again plent of humans still can't.

Even if AGI is just the sum of companies adding more and more trainingdata, as long as this learning pipeline becomes faster and easier to train with new scenarios, that will start to bleed out humans in the loop.

j / k navigate · click thread line to collapse

0 comments

tylervigen5mo ago

I agree that for any given test, you could build a specific pipeline to optimize for that test. I supposed that's why it is helpful to have many tests.

1 more reply

benlivengood5mo ago

That's ok; just start publishing your real problems to solve as "AI benchmarks" and then it'll work in ~6 months.

nbardy5mo ago

This isn’t gaming the benchmark though. If training on similar data generalizes that’s called learning. Training on the exact set is memorization.

Also ARC AGI is general enough that if you create similar data you’re just creating generic visual puzzle data. Should all visual puzzle data be off limits ?

riku_iki5mo ago

> internal team to create an ARC replica, covering very similar puzzles

they can target benchmark directly, not just replica. If google or OAI are bad actors, they already have benchmark data from previous runs.

1 more reply

AstroBen5mo ago

3 more replies

simpsond5mo ago

Humans study for tests. They just tend to forget.

Blamklmo5mo ago

Doesn't even matter at this point.

We have a global RL Pipeline on our hand.

If there is something new a LLM/AI model can't solve today, plenty of humans can't either.

But tomorrow every LLM/AI model can solve it and again plent of humans still can't.

j / k navigate · click thread line to collapse