Unless you have strong prior beliefs (like "computers can't be AGI") or something else that's problem specific ("these problems can be solved by these techniques which don't count as AGI"). So I guess that's my real question.
* How likely you think AGI is in general.
* How solvable you think the problem is, independently of what's solving it.
In the cases you've brought up that latter probability is very high, which means that they are extremely weak evidence that computers are AGI. So we agree!
In this case the latter probability seems to be quite low - attempts to solve it with computers have largely failed so far!
In real life, when people say "A is evidence of B" they mean strong evidence, or even overwhelming evidence. You just backpedalled by redefining evidence to mean anything and nothing, so you can salvage an obviously false claim.
Nobody in the real world says "rain is evidence of aliens" with the implicit assumption that it's just extremely weak evidence. The way English is used by people makes that sentence simply false, as is yours that anything previously not solved is evidence of AGI.
Edit: I think maybe the disagreement here is about the nature of evidence. I think there can be evidence that something is AGI even if it isn't, in fact, AGI. You seem to believe that if there's any evidence that something is AGI, it must be AGI, I think?
No.
Because there might undiscovered ways to solve these problems that no one claims is AGI.
The definition of AGI is notoriously fuzzy, but non-the-less if there was a 10 line python program (with no external dependencies or data) that could solve it then few would argue that was AGI.
So perhaps there is an algorithm that solves these puzzles 100% of the time and can be easily expressed.
So I agree that only being able to solve these problems doesn't define AGI.
1. Only humans are known to have solved problem X, and we've spent no time looking for alternative solutions.
2. Only humans are known to have solved problem X, and we've spent hundreds of thousands of hours looking for alternative solutions and failed.
Now suppose something solves the problem. I feel like in case 2 we are justified in saying there's evidence that something is a human-like AGI. In case 1 we probably aren't justified in saying that.
To me this seems evident regardless of what the problem actually is! Because if it's hard enough that thousands of human hours cannot find a simple/algorithmic solution it's probably something like an "AGI-complete" problem?
To be clear, I think we have AGI (LLMs with tool use are generalized enough) and we are currently finding edge cases that they fail at.
I guess the underlying issue with my argument is that we really have no idea how large the search space is for finding AGI, so applying something like Bayes theorem (which is basically my argument) tells you more about my priors than reality.
That said, we know that human AGI was a result of an optimisation process (natural selection), and we have rudimentary generic optimisers these days (deep neural nets), so you could argue we've narrowed the search space a lot since the days of symbolic/tree search AI.
That seems a pretty extreme position!
What's your definition of AGI ?
Testing whether an AI can play chess or solve Chollet's ARC problems, or some other set of narrow skills, doesn't prove generality. If you want to test for generality, then you either have to:
1) Have a huge and very broad test suite, covering as many diverse human-level skills as possible.
and/or,
2) Reductively understand what human intelligence is, and what combination of capabilities it provides, then test for all of those capabilities both individually and in combination.
As Chollet notes, a crucial part of any AGI test is solving novel problems that are not just templated versions (or shallow combinatins) of things the wanna-be AGI has been trained on, so for both of above tests this is key.
AGI can add 1+1 correctly, but an ability to do that is not a test for AGI.
"Absence of evidence is evidence of absence."
Presumably you would call this a simple logical fallacy for the same reason, but a little reflection would show that in many cases such a statement is true! It depends on context, in this case your estimate of how well your search covered the possible search space.
Evidence is a continuous variable - things can be weak evidence, strong evidence... There's a whole spectrum. I just take issue with statements like "X is zero evidence of Y" because often you can do a lot better than that with the information at hand.
So, just because a human can't do something, or struggles to do it, doesn't mean that the task requires a huge IQ or generality - it may just require a lot of compute/memory, such as DeepBlue playing chess.
In the case in point of these ARC puzzles, they are easy for a human, so "absense of evidence" doesn't even apply, and it's worth noting that one could also brute force solve them by trying all applicable solution techniques (as indicated by the examples and challenge description) in combinatorial fashion, or just (as Chollet notes) generate a massive training set and train an LLM on it, and solve them via recall rather than active inference, which again proves nothing about AGI.
The point of the ARC challenge is to encourage advances in active inference (i.e. reasoning/problem solving), which is what LLMs lack. It's HOW you solve them that matters if you want to show general intelligence. Even in the realm of static inference, which is what they are built for, LLMs are really closer to DeepBlue than something intelligent - they brute force extract the training set rules using gradient descent. The interesting thing is that they have any learning ability at all (in-context learning) at inference time, but it's clearly no match for a human and they are also architecturally missing all the machinery such as working memory and looping/iteration to perform any meaningful try/fail/backtrack/try-again (while learning the whole time) active inference.
It'll be interesting to see to what extent pre-trained transformers can be combined with other components (maybe some sort of DeepBlue/AlphaGo MCTS?) to get closer towards human-level problem solving ability, but IMO it's really the wrong architecture. We need to stop using gradient descent and find a learning algorithm that can be used at inference time too.
But in general I agree about active inference. Clearly there is something missing there.
Doing alpha-go style MCTS would be interesting but how would you approach training the policy and value net? It's not like we can take snapshots of people's thought processes as they read text in the same way you can perform arbitrary rollouts of your game engine.