There are many known universal approximations. Deep networks are one. SVMs are one. Heck, cubic splines are one, and they've been in use for nearly a hundred years IIRC.
The problem has never been one of finding a sufficiently powerful approximator. It has been training that approximator. My understanding of the significant advancement made by deep learning is that we finally figured out how to train a specific kind of universal approximator in a way such that it finds very good separation surfaces for what used to be impossible-to-solve classification problems.
But it should be no surprise to anyone that there exist, in theory, other universal approximations that approximately reproduce the same separation surfaces, should it? I'd expect any universal approximator to be powerful enough to reproduce the separation surfaces, hence the meaning of the word "universal". The problem was always finding the right weights, not finding the right approximator architecture.
Am I missing something?
The point of the paper is that if you train a deep learning model with gradient descent then the resulting model is effectively a kernel machine, regardless of model architecture.
The nice thing about a kernel machine is that it is simple (just one hidden layer) and we are able to use to analyze a kernel machine more effectively and conveniently.
So, I think the contribution here isn't "these sets of universal aproximators are equivalent" but rather "We have this effective but opauge deep learning thing, turns it it's actual a kernel machine in retropsect so we can bring 'kernel tooling' to analyze the deep learning mode"
This may have colossal practical implications, as long as the approximation stays good enough.
Kernels have been studied thoroughly from a theoretical perspective and people have proven things about them.
The goal of papers such as these is to find ways in which deep neural networks and kernel methods are similar, so that theoretical results and tools found for kernel methods may be adapted to deep neural networks.
You should be careful with the meaning you ascribe to the word 'universal'. The list of universal approximators is massive, and the sub-list of universal approximators that can be trained with OLS is still substantial. Still these models can differ significantly:
- How efficient are they (in #parameters required for a certain error) for specific tasks? There is a known 'maximum efficiency' for general tasks, but in high dimensions this efficiency is terrible, such that many models will fail terribly on high-dimensional data. Hence, you should pick a model that is exceptionally good for a specific task, although it might be less efficient for other tasks.
- How well can the model cope with noise? If your dependent variable is severely distorted (think financial data) then you need a model that can balance between interpolating datapoints and averaging out the noise.
Just to name my two favorite properties. The first one is _kind of_ related to learnability, since an inefficient model is often pretty much impossible to learn.
Similar to showing a problem falls in NP, you can reduce the problem down to another problem in NP and be done with it.
There's a number of problems with svms; complexity for training and inference scales with the amount of training data, which is pretty sad panda for complex problems.
Extremely spicy/cynical take: it's not cool to say "you all should go look at all these possible applications" when the thrust is the paper is to prop up the relevance of an obsolete approach. You gotta do the actual work to close the gap if you still want your PhD to be worth something...
That said, I haven't read the paper terribly closely, and am always happy to be proven wrong!
Is this true though, or does network architecture only matter in terms of efficiency? This is non rhetorical, I really don't know much about deep learning. :) I guess i'm asking if with enough data and compute, is architecture still relevant?
What are the leading theories for why this seems to be the case? Less nodes to capture and direct decisions?
Also it is representation of approximator itself and data compression. For cubic splines to approximate some NN you likely would need enormous number of intervals covering input space.
A kernel SVM always finds (one-of) the global best fit lines in the kernel space.
A gradient descent model explicitly converges to one of the nearest local minimas by definition.
Does this paper conclude that the local minimas that neural networks converge to are one of the many equivalent global maxima ? Won't this be a major revelation by itself ?
Example:https://towardsdatascience.com/support-vector-machine-simply...
I'd assume that a lot of the binary decision tree / chip level optimizations are similar: almost no slope worth analyzing.
The "SVM step" would still find a global optima within the kernel space, but the qualifications of the previous step mean that the kernel space generated might be useless.
I didn't have time to read the paper, but reading the abstract I don't see a claim that the gradient descent model approximated by a kernel machine is equivalent to an optimal fit obtained by SVM maximum margin hyperplane fitting.
I assume one likely ends up with different hyperplane fits from converting a NN/gradient-desc-learned model to kernel machine vs learning a kernel machine directly via SVM learning.
It also only applies to the continuous limit of non-stochastic GD, far from the real training methods used.
We don't gain any understanding either; understanding implies predictive power about some new situation, and I don't see any- and nor does the paper suggest them.
Looks like yet another attempt to attract attention by "understanding" NNs. Look, humans can't explain or understand how we drive, speak, translate, play chess, etc, so why should we expect to understand how models that do these work? Of course, we can understand the principles of the training process, and in fact we already do- the theory of SGD is well understood.
This implies there's no point in pursuing explainability, but many domains involve inferences where the significant predictors are much easier to abstract at a useful level.
For example, if a DLNN could make suggestions as to how to tune a greenhouse given certain yield objectives, then it's reasonable to pursue heuristic techniques aiming to explain what about the parameters most significantly led to the given suggestions.
I agree with you, but also it's amazing how much deepmind has achieved by putting neuroscientists and machine learning experts in the same room, and trying to make systems that work inside the human brain work efficiently on metal.
If you look at this talk for 2010, Demis was already listing attention as an example (which was responsible for the recent improvement in protein folding prediction as an example):
It may not have been the intention, but associative memory is the one of the only mechanisms that computational neuroscientists can agree on broadly. There's been recent work on energy-based models that suggest biologically plausible methods adjacent to attention. [0]
Even in cases like attention, the modern version (that actually works in GPT-3, AlphaFold2, etc), has little in common with both the english word and what we think of as attention. Its a formula with two matmuls and a softmax: softmax(AB)C. In particular, it doesn't necessarily look anywhere at all- just a weighted sum of the inputs. Nothing like the hard attention used by the human visual cortex. Its not even that different from a convolution where you allow the weights to be a function of the input.
So the inspiration might have come from humans, but the actual architectures have largely come from pure trial and error, with limited, difficult to explain intuition on what tends to work.
https://openreview.net/pdf?id=HJlnC1rKPB
,,This work provides evidence that attention layers can perform convolution and, indeed, they often learn to do so in practice. Specifically, we prove that a multi-head self-attention layer with sufficient number of heads is at least as expressive as any convolutional layer. Our numerical experiments then show that self-attention layers attend to pixel-grid patterns similarly to CNN layers, corroborating our analysis.''
On the other hand there are GPU accelerated SVM training such as: https://github.com/Xtra-Computing/thundersvm
A github or Google search will reveal other GPU accelerated SVM training.
A visual proof that neural networks can approximate any function
As defined in the paper, a "path kernel" measures, for any two data points, how similarly a model varies (specifically, how similarly the model's gradients change) at those two data points during training via SGD. This isn't exactly your usual, plain-vanilla, radial-basis type of kernel.
We've known for a long time that SVMs are universal approximators, i.e., in theory they can approximate any target function. The importance of this work is that it has found a new, surprising, deep connection between any model trained via SGD and SVMs, which are well understood :-)
One minor technical correction, the proof relief on the continuous model gradient flow not SGD. So it’s proven for GD and likely true for SGD and your intuitive explanation likely still holds, but it’s not obvious.
The assertion is known by the community at least since 2018, if not even well before.
I find this article and the buzz around a little awkward.
The math here is pretty much first-year undergraduate level calculus, and it's worth going through section 2 since it's quite clearly written (thanks Prof Domingos).
Essentially, what the author does is show that any model trained with "infinitely-small step size full-batch gradient descent" (i.e. a model following a gradient flow) can be written in a "kernelized" form
y(x) = \sum_{(x_i, y_i) \in L} a_i K(x, x_i) + b.
The intuition most people have for SVMs is that the constants a_i are, well, constant, that the a_i are sparse, and that the kernel function K(x, x_i) is cheap to compute (partly why it's called the 'kernel trick').However, none of those properties apply here, which is why this isn't particularly exciting computationally. The "trick" is that both a_i and K(x, x_i) involve path integrals along the gradient flow for the input x, and so doing a single inference is approximately as expensive as training the model from scratch (if I understand this correctly).
HN feature request: parse inline latex math please
The result is interesting insofar as the path kernel is interesting, which requires some more thought.
I can't tell if this paper is a useful insight or not.
Contrary to many related works that compare wide neural networks to kernel methods, our recent work shows that one can study a feature learning infinite-width limit with realistic learning rate.
https://arxiv.org/abs/2011.14522
We identified what separates the kernel regime (e.g., NTK) and the feature learning regime. In the infinite-width limit, OP's work could belong to either regime depending on the parametrization, i.e. the path kernel is either equal to the NTK or performing feature learning.
It's an incredibly interesting research topic. Please feel free to comment with thoughts on our work :)
Huh. So the implication here is that a deep network can never generalize results to inputs classes that were not explicit in the training set? And would this result apply for networks trained with something other than gradient based methods?
In other words, even though these two types of model are in a way equivalent, one can be much easier to train than the other for certain concepts (no free lunch).
No! Really !
They are learned by interaction with a network selection process of varying degrees of hokeyness that involves decisions being taken by humans about the characteristics of the network, training data, training process and testing processes. At the end -> the model! Gradient descent is v.critical to this, but it's not the only thing going on by a long way.
Also dropout.
It would be neat if this paper was accompanied by minimalist code to demonstrate the approximate equivalence on some toy "deep" networks, so any misconceptions could be avoided in the peanut gallery here (like every inference / evaluation requiring to integrate the path kernel from scratch, which is clearly not proposed, it just describes how training the original deep network is in some sense equivalent to integrating the path kernel)
Hah! Fancy seeing that here! Predicate invention is a main line of my PhD research.
Briefly, predicate invention is the ability of Inductive Logic Programming (ILP) systems to learn their own inductive bias. It is in a sense similar to feature learning or learning-to-learn. ILP systems learn logic programs from examples usually by searching the space of programs defined by a set of sub-programs, called the background knowledge (BK), and a language bias that determines the structure of learned programs. Predicate invention then means learning new BK and language bias to change the program search space while searching it.
The reference in Domingo's article is the first description of the concept which was for a long time more theoretical than practical: ILP approaches could only perform a limited form of predicate invention, e.g. could only invent BK programs of fixed structure or could not invent recursive programs etc. Things changed in 2013 with a new approach, Meta-Interpretive Learning (MIL). MIL systems are for the first time capable of unconstrained predicate invention, including the invention of mutually recursive programs. Full discolosure: my PhD research is on MIL.
Here are some more recent references on predicate invention in MIL:
Meta-interpretive learning of higher-order dyadic datalog: Predicate invention revisited (IJCAI 2013):
https://www.ijcai.org/Proceedings/13/Papers/231.pdf
Bias reformulation for one-shot function induction (ECAI 2014):
http://www.doc.ic.ac.uk/~shm/Papers/metabias.pdf
Logical minimisation of meta-rules within meta-interpretive learning (ICLP 2015):
https://www.doc.ic.ac.uk/~shm/Papers/minmeta.pdf
Like I say this is my field of study and as you can probably tell I'm very excited about it so I'm happy to answer questions- email in my profile.
> A key property of path kernels is that they combat the curse of dimensionality by incorporating derivatives into the kernel: two data points are similar if the candidate function’s derivatives at them are similar, rather than if they are close in the input space. This can greatly improve kernel machines’ ability to approximate highly variable functions (Bengio et al., 2005). It also means that points that are far in Euclidean space can be close in gradient space, potentially improving the ability to model complex functions. (For example, the maxima of a sine wave are all close in gradient space, even though they can be arbitrarily far apart in the input space.)