If you look at the ARC tasks failed by o3, they're really not well suited to humans. They lack the living context humans thrive on, and have relatively simple, analytical outcomes that are readily processed by simple structures. We're unlikely to see AI as "smart" until it can be asked to accomplish useful units of productive professional work at a "seasoned apprentice" level. Right now they're consuming ungodly amounts of power just to pass some irritating, sterile SAT questions. Train a human for a few hours a day over a couple weeks and they'll ace this no problem.
It works the same with humans. If they spend more time on the puzzle they are more likely to solve it.
While beyond current motels, that would be the final test of AGI capability.
Though to be clear, this wasn't a one shot thing - it was iirc a few months of back and forth chats with plenty of wrong turns too.
If you disagree with me, state why instead of opting to downvote me