Although excellent the article could start off with the practical implications (maybe it is faster GPT training?) so that at least there is a motivation.
1. Gradient descent is path-dependent and doesn't forget the initial conditions. Intuitively reasonable - the method can only make local decisions, and figures out 'correct' by looking at the size of its steps. There's no 'right answer' to discover, and each initial condition follows a subtly different path to 'slow enough'...
because...
2. With enough simplification the path taken by each optimization process can be modeled using a matrix (their covariance matrix, K) with defined properties. This acts as a curvature of the mathematical space, and has some side-effects like being able to use eigen-magic to justify why the optimization process locks some parameters in place quickly, but others take a long time to settle.
which is fine, but doesn't help explain why wild over-fitting doesn't plague high-dimensional models (would you even notice if it did?). Enter implicit regularization, stage left. And mostly passing me by on the way in, but:
3. Because they decided to use random noise to generate the functions they combined to solve their optimization problem there is an additional layer of interpretation that they put on the properties of the aforementioned matrix that imply the result will only use each constituent function 'as necessary' (i.e. regularized, rather than wildly amplifying pairs of coefficients)
And then something something baysian, which I'm happy to admit I'm not across
https://towardsdatascience.com/gradient-kernel-regression-e4...
Not out of vanity (ok, a little) but because I think the idea has importance that has not been fully explored. The article's Bayesian perspective may be the whole story but somehow I don't think so. Unlike the article's author, my work left me feeling model architecture was the most important thing (behind training data) whereas they seem to feel it is ancillary.
Can someone help?
Is it basically about using stats to improve on linear models?
It's amusing how facts that look surprising and mysterious to the rest of the world are just table stakes at the right sort of math department. As a researcher I feel the pressure to make things that are "my own", but there's so much that already exists just waiting to be grokked and plugged in!
I'm interested especially in the lessons we can learn about the success of overparametrization. As mentioned at the beginning of the article:
> To use the picturesque idea of a "loss landscape" over parameter space, our problem will have a ridge of equally performing parameters rather than just a single optimal peak.
It has always been my intuition that overparametrization makes this ridge an overwhelming statistical majority of the parameter space, which would explain the success in training. What is less clear, as mentioned at the end, is why it hedges against overfitting. Could it be that "simple" function combinations are also overwhelmingly statistically likely vs more complicated ones? I'm imagining a hypersphere-in-many-dimensions kind of situation, where the "corners" are just too sharp to stay in for long before descending back into the "bulk".
Interested to hear others' perspectives or pointers to research on this in the context of a kernel-based interpretation. I hope understanding overparametrization may also go some way toward explaining the unreasonable effective of analog-based learning systems such as human brains.
> Since kerΠ can be described as the orthogonal complement to the set {Kti}, the orthogonal complement to kerΠ is exactly the closure of the span of the vectors Kti.
{Kti} is going to be very large in the overparametrized case, so the orthogonal complement will be small. Note also this part:
> Because v is chosen with minimal norm [in the context of the corresponding RKHS], it cannot be made smaller by adjusting it by an element of kerΠ...
So it sounds like all the "capacity" is taken up by representing the function itself and seemingly paradoxically the parameters λi are more constrained by the implicit regularization imposed by gradient descent (hypothetically enforcing the minimal-norm constraint). So the parameter space of functions that can possibly fit is tiny. The rub in practical applications is many combinations of NN parameters can correspond to one set of parameters in this kernel space, so the connection between p and λ (via f?) seems key to understanding the core of the issue.
There is literature on approximating exact GP inference with (something like) these objects when m << N (variational inference).
However, I’m not aware of anyone drawing a clear picture of the other direction, starting from the optimization picture and explaining it in terms of inference, similar to what TFA does.
In TFA the number of functions is large, so the system is underdetermined. In the variational inference the system is overdetermined and I wonder what inference, if any, gradient descent does..
Caveat: 1am and a few drinks deep so if I’m not making sense that’s ok
Is that my Water.css color scheme from so many years ago I see? :)
My own post basically covers the simplest possible case where gradient descent does something related to kernels. The problem is that the "tangent kernel" driving the evolution of the model over training is typically not constant. (In my case it is constant because my model is linear.)
Domingos' solution seems to be: in general, just integrate the tangent kernel over the path taken by your optimization and call it the path kernel. Then your resulting model can always be viewed as a kernel machine, with the subtlety that the kernel now depends on the trajectory you took during training. So in that sense, kernels are everywhere! I'll take another look on Monday :)
(Although I don't have time to respond meaningfully today, I really appreciate the comments pointing out other relationships by mvcalder, Joschkabraun, yagyu and others.)