With those structured numbers will the LLMs be 100% accurate on new prompts or will they just be better than chance (even significantly better than chance)?
Because this is one thing, it has to learn the structure and then create probabilities based on the data, but does that mean it's actually learning the underlying algorithm for addition for example or is it just getting better probabilities because of a narrowing of them? If it can indeed learn underlying algorithms like this that's super interesting. The reason also this is in an issue if it _can't_ learn those, you can never trust the answer unless you check it, but that's sort of a sidepoint.
As it happens, conditioning even people to actually apply rules properly tends to take a lot of repetition. How many individual examples of step-by-step working out basic math problems do you think they have been in their training data?
Prompt them the way you'd prompt a child who is sloppy into working step by step and explaining how it applies the rules it has learned, and it will tend to do better. While tokenization might not help, I don't think there's an inherent problem there beyond feeding them enough training data. Whether that's worthwhile vs. having it resort to tools, is another matter.
However, I'm not sure why being bad at math (if even true based on other comments here), is a legitimate criticism. We've already got a lot of machinery that is very good at math. So, use the tools that make sense for the domain. Or (what I'm really waiting for) better yet find novel ways to stitch the tools that we do have together. Can LLMs turn a problem statement into a matlab program? If so then it doesn't really matter how bad at math they are.
https://gwern.net/scaling-hypothesis And specifically the part where it discusses addition under the heading Blessings of Scale.
And GPT4 indeed does just fine with things like 32 * 64 in multiple different ways that humans can also easily memorise the rules for. When I asked it to calculate it step by step, and use easily memorisable shortcuts, it first suggested the "doubling and halving method (though it stupidly started by doubling 32 and halving 64...), and got it right.
I then told it I know the powers of two up to 2^24 by heart, and asked if that changed things.
It then reasonably pointed out this means I know 2^5=32 and 2^6 = 64, and 2^5+2^6=2^(5+6) = 2^11 = 2048 and got the rules right (that was exactly what I intended when I pointed out I remember the powers of two).
So it's not all that awful at these things. It does badly when you effectively try to get it to do maths by blind recall and without nudging it to work step by step, sure.
Where it then falls down tends to be when you ask it to do calculations which involves repetitively applying the same rules many times over, where it will tend to start out well, but occasionally make stupid little mistakes.
If anything the type of mistakes it makes are scarily close to the same kind of lapses in focus humans get when doing the same, where we just get sloppy and fail to add two numbers < 10 correctly for no good reason in the middle of doing it correctly many times, and fail to go back and verify each step.
Where some see LLMs struggling with math, I see LLMs trying to do math in a way that is disturbingly close to how a human school child would, and making the same types of mistakes.
You can ask GPT-4 arbitrary arithmetic it could never see in it's training set. Even when it's not completely correct, it's extremely close. It is clearly computing algorithms even if those algorithms are not quite right.
Since the thing is a computer, why can’t it answer queries by writing and executing a program? Does a transformer-based AI have to be 100% transformer?
A better answer: every time the model needs to predict the digit of a sum the model needs to solve the entire addition to know the carries, a bigger sum requires a bigger model to solve them.
This paper uses GPT-2 transformer scale, on sinusoidal data:
>We trained a decoder-only Transformer [7] model of GPT-2 scale implemented in the Jax based machine learning framework, Pax4 with 12 layers, 8 attention heads, and a 256-dimensional embedding space (9.5M parameters) as our base configuration [4].
> Building on previous work, we investigate this question in a controlled setting, where we study transformer models trained on sequences of (x,f(x)) pairs rather than natural language.
Nowhere near definitive or conclusive.
Not sure why this is news outside of the Twitter-techno-pseudo-academic-influencer bubble.
I read an interesting paper recently that had a great take on this: If you add enough data, nothing is outside training data. Thus solving the generalization problem.
Wasn’t the main point of that paper, but it made me go ”Huh yeah … I guess … technically correct?”. It raises an interesting thought that yes if you just train your neural network on everything, then nothing falls outside its domain. Problem solved … now if only compute was cheap.
Having said that, the real question is what percentage of the learned representations do generalize. For a perfect model, it would learn only representations that generalize and none that overfit. But, that's unreasonable to expect for a machine *and* even for a human.
Maybe we just don't know. We are staring at a black box and doing some statistical tests, but actually don't know whether the current AI architecture is capable enough to get to some kind of human intelligence equivalent.
They haven't read the other papers either. It's really striking to me to watch people retweet this and it get written up in pseudo-media like Business Insider when other meta-learning papers on the distributional hypothesis of inducing meta-learning & generalization, which are at least as relevant, can't even make a peep on specialized research subreddits - like, "Pretraining task diversity and the emergence of non-Bayesian in-context learning for regression", Raventós et al 2023 https://arxiv.org/abs/2306.15063 (or https://arxiv.org/abs/2310.08391 ) both explains & obsoletes OP, and it was published months before! OP is a highly limited result which doesn't actually show anything that you wouldn't expect on ordinary Bayesian meta-reinforcement-learning grounds, but there's so much appetite for someone claiming that this time, for real, DL will 'hit the wall' that any random paper appears to be definitive to critics.
The paper is making the rounds despite being a weak result because it confirms what people want, for non-technical reasons, to be true. You see this kind of thing all the time in other fields: for decades, the media has elevated p-hacked psychology studies on three undergrads into the canon of pop psychology because these studies provide a fig leaf of objective backing for pre-determined conclusions
But every single day I am using OpenAI GPT4 to handle novel tasks. I am working on a traditional saas vertical, except with a pure chatbot. The model works, is able to understand which function to call, to extract which parameters, and to know when the inputs will not work. Sure, if you ask it to do some extraneous task, it fails.
Google/Deep Mind need to start showing up with some working results.
Where. are. the. models. google.
That said, even GPT4 certainly has pretty significant limitations on what it manages to reason about. But without comparing their capabilities in other aspects, arguably so do most humans. We tend to force our way past those limitations by learning incrementally by doing over and over. Current models don't get that luxury without complicated fine-tuning steps, so if anything what should surprise us is how well they do with the limitation of only context to act as short-term memory.
Another important thing to keep in mind is one paper(wish I could remember which one it was) that showed even larger scale llms have trouble understanding that A=B is same as B=A if they have not seen A or B before
1. The test mechanism is to use prediction of sinusoidal series. While it's certainly possible to train transformers on mathematical functions, it's not clear why findings from a model trained on sinusoidal functions would generalize into the domain of written human language (which is ironic, given the paper's topic).
2. Even if it were true that these models don't generalize beyond their training, large LLMs' training corpus is basically all of written human knowledge. So then the goalpost has been moved to "well, they won't push the frontier of human knowledge forward," which seems to be a much diminished claim, since the vast majority of humans are also not pushing the frontier of human knowledge forward and instead use existing human knowledge to accomplish their daily goals.
LLMs are trained on a tiny subset of the written human knowledge which we've proven to probably not be garbage and which is nicely formatted in simple text formats and which was published without too many paywalls on the web and on and on and on. It's a lot, and it definitely includes enough facts that no one person knows all the things the LLM "knows", but the average child knows plenty of things which never made it into that sort of a corpus. Yes, it's probably true that the vast majority of humans are also not pushing the frontier of human knowledge forward, but the vast majority of humans are working with a slightly different (partially overlapping) set of information than what the LLMs see.
https://arxiv.org/abs/2110.09485
"Supercharged Interpolation" is not a real thing.
They generalize fine when the data incentivizes that.
So - and I say this as someone who writes NLP papers too - who cares?
If you trained it on one function class, of course that's all it learned to do. That's all it ever saw!
If you want to learn arbitrary function classes to some degree, the solution is simple. Train it on many different function classes.
Untrained models are as blank slate as you could possibly imagine. They're not even comparable to new born humans with millions of years of evolution baked in. The data you feed them is their world. Their only world.