I just started out with mapping to be systematic. Mapping is ground zero, then interpolation i.e. any smooth fitting function or basis, then combinatorial where different bases are recognized and then project relative to their relevance to a new input.
Each of those increase modeling efficiency and power, but even combinatorial doesn't scale to problems like language.
I may be doing a poor job communicating. A formal breakdown of the scaling issues with lower order, but scaled to make up for it, modeling would be a great paper.
To prove me wrong (as a thought experiment), choose a lower order model, any kind you can imagine that would qualify as modeling without understanding. Demonstrate it can do anything close. That it could possibly scale to the human corpus with just a trillion parameters.
If it the number of parameters goes up far too fast, then that can't be the way deep learning solves the problem with a trillion, or a few billion, either.
And consider the other side. We have no idea how our own brains are lifting up what is relevant vs. what is not. We are used to it happening. We call it "understanding". But we don't know how it works, how we work. Despite experiencing it.
What we do know, because combinatorial is too resource intensive, is we are not just combinatorial either.
The way a LLM works is by creating a space of N dimensions, N being the number of token. This space contains all the possible combinations. The LLM will find the best combination, but will not scan the whole space. To find the best combination, it will minimize the loss function, which is low when the output corresponds to the target. By doing so, it will not explore the combination that "goes in the wrong direction", and therefore it is not true to say that increasing the space as a scale S corresponds to increasing the difficulty of running the model by a scale S.
Because of that, while the combination space scales like combinatorial, the model does not. A model with 2 weights (or rather tokens, but the number of weights should be at least the number of tokens) corresponds to 4 combinations (AA, AB, BA, BB can indeed be described by 2 binary weights of value "A" or "B"). A model with 3 weights corresponds to 9 combinations. A model with 4 weights corresponds to 16 combinations. ... A model with N weights corresponds to N to the power N combinations. The number of combination increases a lot, and yet the number of weights increase linearly.
In SOTA, we have billions of weights. That is a model that contains a very very very very big number of combinations, something so big that it is difficult to understand for a human. It will not try all of these combination one by one, the gradient descend method will help it finding the best combination without having to do so.
So, yes, SOTA are finding "the best combination" amongst an impressively huge number of combinations, yet without having to "scale like combinatorial".
> To prove me wrong (as a thought experiment), choose a lower order model, any kind you can imagine that would qualify as modeling without understanding. Demonstrate it can do anything close. That it could possibly scale to the human corpus with just a trillion parameters.
Yes. Easy. A SOTA LLM does that. It is a modeling without understanding. It does not understand, it finds the best patterns. And when you put it in a new situation, it uses these patterns to create a new text, without truly understanding the content of the text. And if you ask an additional question, it will use the previous text as context, and create a new text that, as it has been trained to, will be consistent with the output that has been given.
Your assertion "you can prove me wrong" is a circular reasoning: you start saying "if a model can do a text that looks realistic to me, then it means it has understanding. To prove me wrong, give me a text that looks realistic to me and has no understanding". Well, I cannot do that, because for you, if it looks realistic, it has to have understanding.
> If it the number of parameters goes up far too fast, then that can't be the way deep learning solves the problem with a trillion, or a few billion, either.
The combination space grows as N to the power N. So, a trillion parameters is not "just 1000 times bigger" than a billion parameters, but more than 1000 to the power of one billion bigger (the exact value is often even bigger than that). Do you realise the size of the combination space? That is 1 followed by 3 times one billion zeroes.
> What we do know, because combinatorial is too resource intensive, is we are not just combinatorial either.
I think you don't understand how LLM works: the find the best combinations in a incredibly huge parameter space, but don't need to explore the whole space, just the 1-dimension manifold that is the curve that follow the gradient descend within this huge combination space.
There are plenty of clues that SOTA don't "understand". For example, did you notice that SOTA happens to understand what human understand, and don't understand what human don't understand. If indeed the way SOTA works would be by "discovering the true mechanism", it means that it would discover with equal probability mechanisms that humans have already noticed and mechanisms that humans have not already noticed yet. For example, humans know that the Standard Model of particle physics is incomplete, and there are plenty of texts and books about that that the SOTA learnt about. Yet, SOTA did not "understood" the underlying mechanism that explain particle physics. It does not really know what an electron is by "making sense of what this object does", it only knows it as "a language word that can be used in some context in a specific way".
And, sure, SOTA is helping with new discoveries, but the way it does it is by using "reasoning" approach. If indeed SOTA creates its own understanding when learning the human language, then it should have the new discovery after the learning, without using any "reasoning" approach, because it would be something that it has already understood.
Yes, if it consistently produces good output for highly varied stimuli that can be intentionally picked to have been unlikely to ever had obvious representation in the training set, then yes it understands.
I think we are talking past each other a bit.
A series of increasingly challenging datasets, used to capture scaling efficiencies, would ground our discussion.
But the level of performance for models is simply too good vs. the number of parameters to be doing anything trivial.
Deep learning models do something combinatorial models do not. The linear tensor + non-linear transforms do two special things:
1. The tensor itself just projects a linear space into higher dimensions, but its still the same information space. Project a 2D surface into higher dimensions linearly, and there can be more parameters, but it is not more information, since there is an expansion of linear dependence to match.
2a. But then the nonlinear both (a) thresholds, squashes or otherwise alters the linear results, in a way that removes linear dependencies, increasing the useful dimensionality of the representation.
2b. And the squashing also allows dimensions to be folded down.
So by both expanding and flattening representational dimensions, deep learning models are able to model higher-order relationship directly, that any less expressive modeling would require cobbling together many patches of fitting.
Another way to put this, is deep learning models are able to learn higher-order relationships directly, not be memorizing and interpolating across learned points or regions.
So a dramatically greater ability to "understand" is why deep learning models are so much better. They are not doing simple combinatorial fitting.
"Understanding" or not, combinatorial relationships are the low bar for deep learning models, they are inherently great a learning much higher-order relationships.
I am falling asleep at this point. I feel like we need a blackboard and a computer. You are saying a lot of things that make me think, and make sense to me.
You keep saying "what I observe with GenAI can only be the result of 'understanding'" without providing any proofs at all. Just few beliefs.
You just say "look at this behavior, that's the proof". I truly don't think it is: nothing proves that this behavior requires 'understanding'. And nothing you provided helps: all you provided are impressive behaviors and then the unsubstantiated conclusions "and this behavior can only be done with understanding".
At the same time, there are too much clues showing that such behavior does not require understanding, even if it _looks_ incredibly clever:
1. GenAI does not understand (after the training phase) things that humans don't understand. If GenAI had the capacity of building an understanding during training, then there is no reason this understand will coincide with human understanding.
2. Optimisation does not always lead to "understanding". Human brains choose to optimise "learning multiplication table by heart" rather than building a pocket calculator inside the neurons.
3. Human brains, that have "understanding", are working fundamentally differently from GenAI (flow of thoughts, intrinsically intertwined memory and compute, optimised for world-model treatment rather than token treatment, ...). It is an unsubstantiated jump to simply conclude AI has "understanding", while it can be the result of fundamental differences.
4. "Basic" LLM are surprisingly good at creating convincing sentence and yet there are situations where it is blatantly clear they did not understood anything. More advanced SOTA are based of refinement of "basic LLM", and therefore the "sentence construction that is done without understanding" is still used, and impair the SOTA model to build a full understanding.
> Another way to put this, is deep learning models are able to learn higher-order relationships directly, not be memorizing and interpolating across learned points or regions.
It's exactly what I'm saying: deep learning models are very good at learning complex relationships. Such as "I don't know what 'Paris' is, I don't have any understand of what a city is in reality, but when the token Paris is associated with these other tokens in this complex order, even if I never saw it before, I have learnt the complex relationships and therefore I'm able to build a series of token".
They are very good at learning complex relationship that allows them to choose the correct combination even if they did not "understand" the content of the correct combination.
I understand that it is impressive: those relationships are very complex and very numerous (there are billions of them). It is easier to do anthropomorphism and conclude that the AI has "understood".
But again, the main problem is that you just pretend, without any proof, "no, I cannot believe that, I refuse to believe that".
(and, by the way, I personally think that AI (SOTA but also even "basic LLM") do have 'rules' that correspond to some kind of understanding of basic mechanism. I think they have basic "world models". But these world models are optimised "to write text" rather than to "understand the world", and therefore the large majority of AI output is just not-understood token chains)