> The benchmark is designed to test for AGI and intelligence, specifically the ability to solve novel problems.
It's a bunch of visual puzzles. They aren't a test for AGI because it's not general. If models (or any other system for that matter) could solve it, we'd be saying "this is a stupid puzzle, it has no practical significance". It's a test of some sort of specific intelligence. On top of that, the vast majority of blind people would fail - are they not generally intelligent?
The name is marketing hype.
The benchmark could be called "random puzzles LLMs are not good at because they haven't been optimized for it because it's not valuable benchmark". Sure, it wasn't designed for LLMs, but throwing LLMs at it and saying "see?" is dumb. We can throw in benchmarks for tennis playing, chess playing, video game playing, car driving and a bajillion other things while we are at it.