> "I'm not sure Berkeley Function-Calling represents tasks that are easy for average humans. Maybe programmers could perform well on it."
Functions in this context are not programming function calls. In this context, function calls are a now-deprecated LLM API name for "parse input into this JSON template." No programmer experience needed. Entity extraction by another name, except, that'd be harder: here, you're told up front exactly the set of entities to identify. :)
> "Moravec's paradox isn't a benchmark per se."
Yup! It's a paradox :)
> "Of course it is much easier to design a test specifically to thwart LLMs now that we have them"
Yes.
Though, I'm concerned a simple yes might be insufficient for illumination here.
It is a tautology (it's easier to design a test that $X fails when you have access to $X), and it's unlikely you meant to just share a tautology.
A potential unstated-but-maybe-intended-communication is "it was hard to come up with ARC before LLMs existed" --- LLMs existed in 2019 :)
If they didn't, a hacky way to come up with a test that's hard for the top AIs at the time, BERT-era, would be to use one type of visual puzzle.
If, for conversations sake, we ignore that it is exactly one type of visual puzzle, and that it wasn't designed to be easy for humans, then we can engage with: "its the only one thats easy for humans, but hard for LLMs" --- this was demonstrated as untrue as well.
I don't think I have much to contribute past that, once we're at "It is a singular example of a benchmark thats easy for humans but nigh-impossible for llms, at least in 2019, and this required singular insight", there's just too much that's not even wrong, in the Pauli sense, and it's in a different universe from the original claims:
- "Congratulations to Francois Chollet on making the most interesting and challenging LLM benchmark so far."
- "A lot of people have criticized ARC as not being relevant or indicative of true reasoning...The fact that [o-series models show progress on ARC proves that what it measures really is relevant and important for reasoning."
- "...nobody could quantify exactly the ways the models were deficient..."
- "What we need right now are "easy" benchmarks that these models nevertheless fail."