[0]: https://www.scientificamerican.com/article/new-estimate-boos...
...but that's exactly what OP said, no?
I remember attending an ML presentation where the speaker shared a quote I can't find anymore (speaking of memory and generalization :)), which said something like: "To learn is to forget"
If we memorized everything perfectly, we would not learn anything: instead of remembering the concept of a "chair", you would remember thousands of separate instances of things you've seen that have a certain combination of colors and shapes etc
It's the fact that we forget certain details (small differences between all these chairs) that makes us learn what a "chair" is.
Likewise, if you remembered every single word in a book, you would not understand its meaning; understanding its meaning = being able to "summarize" (compress) this long list of words into something more essential: storyline, characters, feelings, etc.
Aside from having to eventually experience the death of all stars and light and the decay of most of the universe's baryonic matter and then face an eternity of darkness with nothing to touch, it's yet another reason I don't think immortality (as opposed to just a very long lifespan) is actually desirable.
Or is it always running at the same pace regardless of if it’s empty or not?
I guess the Brian doesn’t really work like that…. But I’m curious :-)
But having so much of the past being so accessible is tough. There are lots of memories I'd rather not have, that are vivid and easily called up. And still, I think it's only a fraction of what her memory seems to be like.
While the upper bound is technically "infinity", there is a tradeoff between the amount of concepts stored and the fundamental amount of information storable per concept, similar to how other tradeoff principles like the uncertainty principle, etc work.
We don’t know if the animal brain works the same way, but I suspect it is mostly compression algorithms designed to predict things, and doesn’t store much data at all.
Geometry is good for training in this way—and often very helpful for physics proofs too!
this 'compression' is what 'understanding' something really entails; at first... but then there's more.
when knowledge becomes understood it enables perception (e.g. we perceive meaning in words once we learn to read).
when we get really good at this understanding-perception we may start to 'manipulate' the abstractions we 'perceive'. an example would be to 'understand a cube' and then being able to rotate it around so to predict what would happen without really needing the cube. but this is an overly simplistic example
[1] https://en.wikipedia.org/wiki/Synaptic_pruning [2] https://en.wikipedia.org/wiki/Pruning_(artificial_neural_net...
L1 induces sparsity. Weight decay explicitly _does not_, as it is L2. This is a common misconception.
Something a lot of people don't know is that weight decay works because when applied as regularization it causes the network to approach the MDL, which reduces regret during training.
Pruning in the brain is somewhat related, but because the brain uses sparsity to (fundamentally, IIRC) induce representations instead of compression, it's basically a different motif entirely.
If you need a hint here on this one, think about the implicit biases of different representations and the downstream impacts that they can have on the learned (or learnable) representations of whatever system is in question.
I hope this answers your question.
But the best cure for over-fitting is to make the dataset larger and ensure data diversity. LLMs have datasets so large they usually train one epoch.
Just by trying to make the dataset diverse you could skew things to not reflect reality. I just don't think enough attention has been paid to the data, and too much the model. But I could be very wrong.
There is a natural temporality to the data humans receive. You can't relive the same moment twice. That said, human intelligence is on a scale too and may be affected in the same way.
This is also good life advice.
Note that L1 regularisation produces much more sparsity but it doesn't perform as well.
It's kind of amazing to watch this from the sidelines, a process of engineers getting ridiculously impressive results from some combo of sheer hackery and ingenuity, great data pipelining and engineering, extremely large datasets, extremely fast hardware, and computational methods that scale very well, but at the same time, gradually relearning lessons and re-inventing techniques that were perfected by statisticians over half a century ago.
It means roughly 'to understand completely, fully'.
To use the same term to describe generalization... just shows you didn't grok grokking.
[1] https://www.lesswrong.com/posts/GpSzShaaf8po4rcmA/qapr-5-gro...
Neural network training [edit: on a fixed point task, as is often the case {such as image->label}] is always (always) biphasic necessarily, so there is no "eventual recovery from overfitting". In my experience, it is just people newer to the field or just noodling around fundamentally misunderstanding what is happening, as their network goes through a very delayed phase change. Unfortunately there is a significant amplification to these kinds of posts and such, as people like chasing the new shiny of some fad-or-another-that-does-not-actually-exist instead of the much more 'boring' (which I find fascinating) math underneath it all.
To me, as someone who specializes in optimizing network training speeds, it just indicates poor engineering to the problem on the part of the person running the experiments. It is not a new or strange phenomenon, it is a literal consequence of the information theory underlying neural network training.
Why throw away the context and nuance?
That decision only further leans into the 'AI is magic' attitude.
“Grok” was Valentine Michael Smith’s rendering for human ears and vocal cords of a Martian word with a precise denotational semantic of “to drink”. The connotational semantics range from to literally or figuratively “drink deeply” all the way up through to consume the absented carcass of a cherished one.
I highly recommend Stranger in A Strange Land (and make sure to get the unabridged re-issue, 1990 IIRC).
And what is the indicator for a machine understanding something?
If anyone wants to come up with their own definition, read Robert Heinlein's 'Stranger in a Strange Land'. There is no definition in there, but you build an intuition of the meaning by its use.
One of the issues I have w/ the use in AI is that using the word 'grok' suggests that the machine understands (that's a common interpretation of the word grok, that it is an understanding greater than normal understanding).
By using an alien word, we are both suggesting something that probably isn't technically true, while simultaneously giving ourselves a slimy out. If you are going to suggest that AI understands, just have the courage to say it with common english, and be ready for argument.
Redefining a word that already exists to make the argument technical feels dishonest.
So the AI folks are just borrowing something that had already been co-opted 30+ years ago.
I also have a couple of little libraries for things like annotations, interleaving svg/canvas and making d3 a bit less verbose.
- https://github.com/PAIR-code/ai-explorables/tree/master/sour...
- https://1wheel.github.io/swoopy-drag/
Second, the article correctly states that typically L2 weight decay is used, leading to a lot of weights with small magnitudes. For models that generalize better, would it then be better to always use L1 weight decay to promote sparsity in combination with longer training?
I wonder whether deep learning models that only use sparse fourier features rather than dense linear layers would work better...
Longer answer: deep learning models are usually trying to find the best nonlinear basis in which to represent inputs; if the inputs are well-represented (read that as: can be sparsely represented) in some basis known a-priori, it usually helps to just put them in that basis, e.g., by FFT’ing RF signals.
The challenge is that the overall-optimal basis might not be the same as those of any local minima, so you’ve got to do some tricks to nudge the network closer.
Put another way, it isn't just how simple this task seems to be in the number of terms that are important, but isn't it also a rather dense function?
Probably better question to ask is how sensitive are models that are looking at less dense functions to this? (Or more dense.). I'm not trying to disavow the ideas.
https://en.wikipedia.org/wiki/Grid_cell
If you plot a head map of a neuron in the hidden layer on a 2D chart where one axis is $a$ and the other is $b$, I think you might get a triangular lattice. If it's doing what I think it is, then looking at another hidden neuron would give a different lattice with another orientation + scale.
Also you could make a base 67 adding machine by chaining these together.
I also can't help the gut feeling that the relationship between W_in-proj's neurons compared to the relationship between W_out-proj's neurons looks like the same mapping as the one between the semitone circle and the circle of fifths
https://upload.wikimedia.org/wikipedia/commons/thumb/6/6f/Pi...
On generalization - its still memorization. I think there has been some proof that chatgpt does 'try' to perform some higher level thinking but still has problems due to the dictionary type lookup table it uses. The higher level thinking or agi that people are excited about is a form of generalization that is so impressive we don't really think of it as memorization. But I actually question if our wantingness to generate original thought isn't as actually separate from what we currently are seeing.
Generalization doesn't require learning representations outside of the training set. It requires learning reusable representations that compose in ways that enable solving unseen problems.
> On generalization - its still memorization
Not sure what you mean by this. This statement sounds self contradictory to me. Generalization requires abstraction / compression. Not sure if that's what you mean by memorization.
Overparameterized models are able to generalize (and tend to, when trained appropriately) because there are far more parameterizations that minimize loss by compressing knowledge than there are parameterizations that minimize loss without compression.
This is fairly easy to see. Imagine a dataset and model such that the model has barely enough capacity to learn the dataset without compression. The only degrees of freedom would be through changes in basis. In contrast, if the model uses compression, that would increase the degrees of freedom. The more compression, the more degrees of freedom, and the more parameterizations that would minimize the loss.
If stochastic gradient descent is sufficiently equally as likely to find any given compressed minimum as any given uncompressed minimum, then the fact that there are exponentially many more compressed minimums than uncompressed minimums means it will tend to find a compressed minimum.
Of course this is only a probabilistic argument, and doesn't guarantee compression / generalization. And in fact we know that there are ways to train a model such that it will not generalize, such as training for many epochs on a small dataset without augmentation.
But, like all complexity, it is reduceable to component parts.
(In fact, we know this because we evolved to have this ability. )
I read Language in Our Brain [1] recently and I was amazed by what we've learned about the neurologicial basis of language, but I was even more astounded at how profoundly little we know.
> But, like all complexity, it is reduceable to component parts.
This is just false, no? Sometimes horrendously complicated systems are made of simple parts that interact in ways that are intractable to predict or that defy reduction.
[1] https://mitpress.mit.edu/9780262036924/language-in-our-brain
In the case of NNs we have a "modal knn" (memorising) going to a "mean knn" ('generalising') under the right sort of training.
I'd call both of these memorising, but the latter is a kind of weighted recall.
Generalisation as a property of statistical models (ie., models of conditional freqs) is not the same property as generalisation in the case of scientific models.
In the latter a scientific model is general because it models causally necessary effects from causes -- so, necessarily if X then Y.
Whereas generalisation in associative stats is just about whether you're drawing data from the empirical freq. distribution or whether you've modelled first. In all automated stats the only diff between the "model" and "the data" is some sort of weighted averaging operation.
So in automated stats (ie., ML,AI) it's really just whether the model uses a mean.
you can look at it by results: I give these models inputs its never seen before but it gives me outputs that are correct / acceptable.
you can look at it in terms of data: we took petabytes of data, and with an 8gb model (stable difusion) we can output an image of anything. That's an unheard of compression, only possible if its generalizing - not memorizing.
What they demonstrate is a neural network learning an algorithm that approximates modular addition. The exact workings of this algorithm is explained in the footnotes. The learned algorithm is general -- it is just as valid on unseen inputs as seen inputs.
There's no memorization going on in this case. It's actually approximating the process used to generate the data, which just isn't possible using k nearest neighbors.
We have suspected that neural nets are a kind of kNN. Here's a paper:
Every Model Learned by Gradient Descent Is Approximately a Kernel Machine
And, in particular, how to interpret the fact that different hyperparameters determined whether runs, obtaining equally high accuracy on the training data, got good or bad scores on the test data, in terms of the "view it as a kernel machine/interpolation" lens?
My understanding is that the behavior in at least one of those "models learned by gradient descent are equivalent to [some other model]" papers, works by constructing something which is based on the entire training history of the network. Is that the kernel machines one, or some other one?
There's some fox and hedgehog analogy I've never understood.
https://en.wikipedia.org/wiki/Percolation_theory
A relevant, recent paper I found from a quick search: The semantic landscape paradigm for neural networks (https://arxiv.org/abs/2307.09550)
It generalized splendidly - it's conclusion was that you always need to press "forward" and do nothing else, no matter what happens :)
From what I gather they're talking about double descent which afaik is the consequence of overparameterization leading to a smooth interpolation between the training data as opposed to what happens in traditional overfitting. Imagine a polynomial fit with the same degree as the number of data points (swinging up and down wildly away from the data) compared with a much higher degree fit that could smoothly interpolate between the points while still landing right on them.
None of this is what I would call generalization, it's good interpolation, which is what deep learning does in a very high dimensional space. It's notoriously awful at extrapolating, ie generalizing to anything without support in the training data.
Scientists are also pretty lousy at making new discoveries without labs. They just need training data.
It's a description of a behavior, not a mechanism. Which may or may not be appropriate depending on whether you are talking about *what* the model does or *how* it achieves it.
"generalize" means going from specific examples to general cases not seen before, which is a perfectly good description of the phenomenon. Why try to invent a new word?
It's not true, if you look at deep CNN the lower layers show lines, the higher complex stuff like eyes or football players etc.. Herarchisation of information actually emerges naturally in NNs.
Generalization often implies extrapolation on new data, which is just not the case most of the time with NNs and why i didn't like the word
You can train a classical ML model on the known orbits of the planets in the past, but it can presumably never predict orbits given unseen n-body gravity events like another dense mass moving through the solar system because of classical insufficiency to model quantum problems, for example.
Church-Turing-Deutsch doesn't say there could not exist a Classical / Quantum correspondence; but a classical model on a classical computer cannot be sufficient for quantum-hard problems. (e.g. Quantum Discord says that there are entanglement and non-entanglement nonlocal relations in the data.)
Regardless of whether they sufficiently generalize, [LLMs, ML Models, and AutoMLs] don't yet Critically Think and it's dangerous to take action without critical thought.
Critical Thinking; Logic, Rationality: https://en.wikipedia.org/wiki/Critical_thinking#Logic_and_ra...
Anyone who so much as taken a class on this knows that even the simplest of perceptron networks, decision trees, or any form of machine learning model generalizes. That's why we use them. If they don't, it's called overfit[1], where the model is so accurate on the training data that its inferential ability on new data suffers.
I know that the article might be talking about a higher form of generalization with LLMs or whatever, but I don't see why the same principle of "don't overfit the data" wouldn't apply to that situation.
No, really: what part of their base argument is novel?
Simple models predicting simple things will generally slowly overfit, and regularization keeps that overfitting in check.
This "grokking" phenomenon is when a model first starts by aggressively overfitting, then gradually prunes unnecessary weights until it suddenly converges on the one generalizable combination of weights (as it's the only one that both solves the training data and minimizes weights).
Why is this interesting? Because you could argue that this justifies using overparametrized models with high levels of regularization; e.g. models that will tend to aggressively overfit, but over time might converge to a better solution by gradual pruning of weights. The traditional approach is not to do this, but rather to use a simpler model (which would initially generalize better, but due to its simplicity might not be able to learn the underlying mechanism and reach higher accuracy).
tldr: don't oversimplify things: you underfit
P.S. please don't fucking review. Your complaints aren't critiques.