As for the other post, degraded performances are highly non trivial still. Some aren't actually poor, just worse.
Even the authors admit humans would see degraded performance on counterfactuals unless given "enough time to reason and revise", something they don't try to do with GPT-4.
Think about it. Do you genuinely belief you would score as accurately on a multiplication arithmetic test taken in base 8 ?
No, but I believe this is a different question. I think the more relevant question is whether a human can (even with the caveat of needing more time to reason about it). The larger question for a LLM is whether it can answer it at all and interpret why, without additional training data.
The paper seems to point that the ability of LLM to transfer is related to proximity to the default case. E.g., if default is base 10, is better at base 9 than base 2. I would interpret that as indicating more simple pattern recognition than deductive reasoning. The implication being that real transference is more dependent on the latter.
arithmetic but in a different base is one of the counterfactual examples in the paper. That's why i mentioned that. and yes it can answer them with worse performance.
You can juice arithmetic performance as is with algorithmic instructions. https://arxiv.org/abs/2211.09066. I see no reason the same for other bases wouldn't work.
Even if you gave a human substantial time for it (say a week of study), i believe he/she almost certainly reach the same accuracy unless he had access to specific instructions for working in base 8 he/she could call upon when taking the test.
I know, that's why I referenced the proximity of bases seemingly being important to the LLM. I think this is what differentiates it.
>and yes it can answer them with worse performance.
It's accuracy is dependent on proximity to it's training set (going back to my original point). I think that points to a different mechanism than humans and that's what my last post was focusing on.
I think we agree that humans would do less well in most other bases than base-10. But that side-steps the point I was making. Will humans do worse in base-3 than base-9? I doubt it, but according the the article, it's reasonable to assume the LLM would be progressively worse. That, IMO, is an indicator that something different is going on. I.e., humans are deriving principles to work from rather than just pattern recognition. Those principles can be modified on-the-fly to adjust to novel circumstances without needing additional training data. Humans are using reasoning in addition to pattern recognition.
This is probably a clunky example, but I'll try. Suppose an autonomous vehicle is trained to recognize that when a ball rolls into the street, it needs to slow down or stop because a child may not be far behind. A human can infer that seeing a kite blow into the street may signal the same response, even though they've never witnessed a kite blow into the street. The question is: can the autonomous vehicle infer the same? (This shouldn't be conflated with the general case of "see object obstructing the street and slow down/stop." The case I'm drawing here specifically adjusts the risk by the nature of the object being a child's toy. So, can the AV not only recognize the object as a kite but also adjust the risk accordingly?) I think one of the possible pitfalls is that we solve a more simple problem like image/pattern recognition and conflate it to a more difficult problem set being solved.
Circling back to the original point, one guess is that it's not understanding context as much as merely matching patterns really, really well. That can be incredibly useful but it may be something different than what's going on in our heads and maybe would should be careful not to conflate the two. Or, it's possible that all we're doing is also matching patterns in context, and eventually LLM will get there too.
I genuinely don't see how that would be a reasonable assumption.
>Will humans do worse in base-3 than base-9?
Why not? If you haven't learnt base 3 but you have base 9 you'll do poorer on it.
>That, IMO, is an indicator that something different is going on.
Whether something different is going on is about as relevant as the question of whether submarines swim or plans fly or cars run.
>I.e., humans are deriving principles to work from rather than just pattern recognition.
Not really. Nearly all your brain does with sense data is predict what it should be and adjust your perception to fit. You can mold these predictions implicitly with your experiences but you're not deriving anything from first principles.
>This is probably a clunky example, but I'll try. Suppose an autonomous vehicle is trained to recognize that when a ball rolls into the street, it needs to slow down or stop because a child may not be far behind. A human can infer that seeing a kite blow into the street may signal the same response, even though they've never witnessed a kite blow into the street. The question is: can the autonomous vehicle infer the same? (This shouldn't be conflated with the general case of "see object obstructing the street and slow down/stop." The case I'm drawing here specifically adjusts the risk by the nature of the object being a child's toy. So, can the AV not only recognize the object as a kite but also adjust the risk accordingly?) I think one of the possible pitfalls is that we solve a more simple problem like image/pattern recognition and conflate it to a more difficult problem set being solved.
Casual reasoning ? all evidence points to LLMs being more than capable of that https://arxiv.org/abs/2305.00050