Every Model Learned by Gradient Descent Is Approximately a Kernel Machine (opens in new tab)

(arxiv.org)

406 pointsscottlocklin5y ago107 comments

107 comments

86 comments · 27 top-level

GlenTheMachine5y ago· 13 in thread

I'm confused by findings like this one, and I'm hoping someone here can educated me.

There are many known universal approximations. Deep networks are one. SVMs are one. Heck, cubic splines are one, and they've been in use for nearly a hundred years IIRC.

The problem has never been one of finding a sufficiently powerful approximator. It has been training that approximator. My understanding of the significant advancement made by deep learning is that we finally figured out how to train a specific kind of universal approximator in a way such that it finds very good separation surfaces for what used to be impossible-to-solve classification problems.

But it should be no surprise to anyone that there exist, in theory, other universal approximations that approximately reproduce the same separation surfaces, should it? I'd expect any universal approximator to be powerful enough to reproduce the separation surfaces, hence the meaning of the word "universal". The problem was always finding the right weights, not finding the right approximator architecture.

Am I missing something?

talolard5y ago

50/50 I can shed some light or am way over my head:

The point of the paper is that if you train a deep learning model with gradient descent then the resulting model is effectively a kernel machine, regardless of model architecture.

The nice thing about a kernel machine is that it is simple (just one hidden layer) and we are able to use to analyze a kernel machine more effectively and conveniently.

So, I think the contribution here isn't "these sets of universal aproximators are equivalent" but rather "We have this effective but opauge deep learning thing, turns it it's actual a kernel machine in retropsect so we can bring 'kernel tooling' to analyze the deep learning mode"

nine_k5y ago

Does this essentially mean that any multi-layer RNN can be reasonably approximated by a 1-layer network (something like a perceptron) for the "playback" purposes, that is, for recognition / transformation, not learning?

This may have colossal practical implications, as long as the approximation stays good enough.

3 more replies

pbhjpbhj5y ago

Does that mean in theory we can uncover an underlying model, a theorem effectively, that the model is effectively approximating and so remove some of the uncertainty?

1 more reply

make35y ago

An issue with deep learning is that it is very hard to analyse from a theoretical mathematical perspective, to prove things about them.

Kernels have been studied thoroughly from a theoretical perspective and people have proven things about them.

The goal of papers such as these is to find ways in which deep neural networks and kernel methods are similar, so that theoretical results and tools found for kernel methods may be adapted to deep neural networks.

sebasv_5y ago

Finding the right architecture, or more in general the right model, is very much still the main problem.

You should be careful with the meaning you ascribe to the word 'universal'. The list of universal approximators is massive, and the sub-list of universal approximators that can be trained with OLS is still substantial. Still these models can differ significantly:

- How efficient are they (in #parameters required for a certain error) for specific tasks? There is a known 'maximum efficiency' for general tasks, but in high dimensions this efficiency is terrible, such that many models will fail terribly on high-dimensional data. Hence, you should pick a model that is exceptionally good for a specific task, although it might be less efficient for other tasks.

- How well can the model cope with noise? If your dependent variable is severely distorted (think financial data) then you need a model that can balance between interpolating datapoints and averaging out the noise.

Just to name my two favorite properties. The first one is _kind of_ related to learnability, since an inefficient model is often pretty much impossible to learn.

xksteven5y ago

My understanding of the read was to show "how" they're equivalent as opposed to how to actually construct such an approximator or learn it.

Similar to showing a problem falls in NP, you can reduce the problem down to another problem in NP and be done with it.

sdenton45y ago

Agree, but also think the result may be too general to be useful. Proving that you can rewrite any network learned with gradient descent this way kinda suggests that the architecture doesn't matter, but we know that's not true. Eg, why are networks with skip connections SO much better than networks without? What about batch normalization? This makes me suspicious that it's a nice theoretical result a bit too high level to be useful. Yes, it was proved years ago that you can train an arbitrary function with a wide enough two-layer net, but it's not a terribly practical way to approach the world. Now we have architectures much better than two-layer networks, and, for that matter, SVMs.

There's a number of problems with svms; complexity for training and inference scales with the amount of training data, which is pretty sad panda for complex problems.

Extremely spicy/cynical take: it's not cool to say "you all should go look at all these possible applications" when the thrust is the paper is to prop up the relevance of an obsolete approach. You gotta do the actual work to close the gap if you still want your PhD to be worth something...

That said, I haven't read the paper terribly closely, and am always happy to be proven wrong!

nightski5y ago

I'd be curious if re-framing a trained neural network model as a SVM gives you insight into it's support vectors and maybe a little understanding on why the NN works the way it does?

runT1ME5y ago

>suggests that the architecture doesn't matter, but we know that's not true. Eg, why are networks with skip connections SO much better than networks without? What about batch normalization?

Is this true though, or does network architecture only matter in terms of efficiency? This is non rhetorical, I really don't know much about deep learning. :) I guess i'm asking if with enough data and compute, is architecture still relevant?

1 more reply

mycall5y ago

> why are networks with skip connections SO much better than networks without?

What are the leading theories for why this seems to be the case? Less nodes to capture and direct decisions?

1 more reply

riku_iki5y ago

> It has been training that approximator.

Also it is representation of approximator itself and data compression. For cubic splines to approximate some NN you likely would need enormous number of intervals covering input space.

VHRanger5y ago

Also, cubic splines don't extrapolate like the other mentioned models

natn5y ago

An even simpler universal approximator that the authors overlooked for some reason is just to use a very large hash map. There are some practical issues around generalization and storage but it has very predictable precision.

1 more reply

screye5y ago· 10 in thread

I do have a tangential technical question for someone who knows more math than I do.

A kernel SVM always finds (one-of) the global best fit lines in the kernel space.

A gradient descent model explicitly converges to one of the nearest local minimas by definition.

Does this paper conclude that the local minimas that neural networks converge to are one of the many equivalent global maxima ? Won't this be a major revelation by itself ?

dumb12245y ago

I don't have a strong math background but I think during the optimisation process of finding the hyperplane, the solver (algorithm that attempts to find best separating hyperplane) uses soft margin to allow mis-classified instances. Its tolerance is controlled by a hyper-parameter so it will comprise to find the best fit within the set parameter. So it is a 'best solution' with a condition. However there are many variations of implementations from different solvers to handle it.

Example:https://towardsdatascience.com/support-vector-machine-simply...

andreareina5y ago

Isn't one of the features of high-dimensional spaces that local minima are rare and there's usually some direction that slopes towards a lower loss?

dragontamer5y ago

I mean, obviously not in the general case. Cryptography is designed to be a high dimensional space with pretty much no slope.

I'd assume that a lot of the binary decision tree / chip level optimizations are similar: almost no slope worth analyzing.

sp3325y ago

Sure, and unbreakable crypto is notoriously difficult to make. I wouldn't expect a situation like that to come up in a real-world problem.

1 more reply

wxnx5y ago

In the context of this paper, you can think of the "gradient descent step" as optimizing the parameters of the kernel (i.e. the parameters that generate the kernel space). There are no explicitly optimality guarantees beyond those of standard gradient descent.

The "SVM step" would still find a global optima within the kernel space, but the qualifications of the previous step mean that the kernel space generated might be useless.

altarius5y ago

I don't think the output of the conversion is guaranteed to be equivalent to a hyperplane learned by an SVM.

I didn't have time to read the paper, but reading the abstract I don't see a claim that the gradient descent model approximated by a kernel machine is equivalent to an optimal fit obtained by SVM maximum margin hyperplane fitting.

I assume one likely ends up with different hyperplane fits from converting a NN/gradient-desc-learned model to kernel machine vs learning a kernel machine directly via SVM learning.

edjrage5y ago

Nitpick: Minima is already plural.

kaczordon5y ago

Doesn’t gradient descent use a convex cost function so that it always generates a global minimum?

wxnx5y ago

No, not necessarily. The objective functions used to train neural networks are generally non-convex (the nets themselves being non-convex as well), but are traditionally trained using stochastic gradient descent (and its variants).

mrtranscendence5y ago

This really threw me off when I first started learning about ANNs, coming from a traditional econometrics background. B-but ... the parameters aren’t identified!

burlesona5y ago· 10 in thread

This sounds interesting, but a little over my head. Can anyone offer an explanation for a software engineer with no AI/ML background?

hansvm5y ago

Part of the reason this is interesting is that deep learning has to have some kind of inductive bias to perform as well as we've seen (given a learning problem, it's biased toward learning in particular ways). In general though, a neural network can approximate _any_ function, so reigning in that complexity and uncovering which functions are actually learnable (or efficiently learnable) by deep learning is an important research direction. This paper says that the functions uncovered by deep learning (with caveats) are precisely those which are close to functions represented by a different learning technique, which is notable because this new class of functions is not "all functions" and because the characterization is explainable in some sense, giving insight into how deep learning works. That connection also winds up having a bunch of other interesting implications that somebody else can cover if they'd like.

peteretep5y ago

“You don’t need a fancy model, you can just find a similar example directly from the training data and get similar results”

ironSkillet5y ago

If you look at how they actually "translate" the fancy model to the simple one, it requires fully fitting the original model (and keeping track of the evolution of gradients over the training). So it wouldn't make training more efficient, but perhaps it would be useful in inference or probing the characteristics of the original model.

lumost5y ago

This has always anecdotally appeared to be the case when investigating the predictions of neural nets. Particularly when it comes time to answer the question “what does this model not handle”

smallnamespace5y ago

Defining ‘similar’ robustly is the meat of the problem, and what we’re finding deep NNs to do well.

moralestapia5y ago

Not an explanation, but a benefit could be that SVMs can be evaluated much faster and are more explainable (* Citation needed, I know).

eugenhotaj5y ago

Don’t kernel SVMs need a full pass through the data they were trained on to make predictions? How is that faster?

cscheid5y ago

No, they require a full pass over the support vectors, which are potentially a much smaller set. (That’s part of why everyone was so excited about SVMs when they were invented) The support vectors are the training values with nonzero hinge loss, or alternatively, training values sufficiently close to the decision boundary.

1 more reply

alexilliamson5y ago

You only need the "Support Vectors" to make predictions, not the whole dataset.

somurzakov5y ago

neural nets at the same time require multiple passes through the data (epochs). if we can train a model in one epoch jnstead of 10000 epochs thats a breakthrough!

2 more replies

Straw5y ago· 7 in thread

The kernel they find is a function of the gradient descent path, which is a function of the data. So no, its nothing at all like a normal kernel machine, where we pick the kernel before seeing the data.

It also only applies to the continuous limit of non-stochastic GD, far from the real training methods used.

We don't gain any understanding either; understanding implies predictive power about some new situation, and I don't see any- and nor does the paper suggest them.

Looks like yet another attempt to attract attention by "understanding" NNs. Look, humans can't explain or understand how we drive, speak, translate, play chess, etc, so why should we expect to understand how models that do these work? Of course, we can understand the principles of the training process, and in fact we already do- the theory of SGD is well understood.

scalablenotions5y ago

> humans can't explain or understand how we drive, speak, translate, play chess, etc, so why should we expect to understand how models that do these work?

This implies there's no point in pursuing explainability, but many domains involve inferences where the significant predictors are much easier to abstract at a useful level.

For example, if a DLNN could make suggestions as to how to tune a greenhouse given certain yield objectives, then it's reasonable to pursue heuristic techniques aiming to explain what about the parameters most significantly led to the given suggestions.

xiphias25y ago

,,Look, humans can't explain or understand how we drive, speak, translate, play chess, etc, so why should we expect to understand how models that do these work?''

I agree with you, but also it's amazing how much deepmind has achieved by putting neuroscientists and machine learning experts in the same room, and trying to make systems that work inside the human brain work efficiently on metal.

If you look at this talk for 2010, Demis was already listing attention as an example (which was responsible for the recent improvement in protein folding prediction as an example):

https://www.youtube.com/watch?v=F5PSyu7booU

Isinlor5y ago

As far as I'm aware, attention does not even attempt biological plausibility, nor was it in any way inspired by biology. The issues attention addresses are very specific to sequential nature of so called Recurrent Neural Networks. The first issue is known as exploding / vanishing gradients - basically as you keep multiplying some vector with matrices you will either explode that vector to infinity or squeeze to zero, the same happens with derivatives. The second issue is that you can not parallelize sequential operation. Attention address this issues by removing recurrence by using a specific invented mathematical structure. There was no name for it, but attention gives good intuition for what that mathematical structure is trying to do. Kind of like quantum chromodynamics uses the term "colors" in a way that has nothing to do with light, photons or even electromagnetic force.

whymauri5y ago

>As far as I'm aware, attention does not even attempt biological plausibility, nor was it in any way inspired by biology.

It may not have been the intention, but associative memory is the one of the only mechanisms that computational neuroscientists can agree on broadly. There's been recent work on energy-based models that suggest biologically plausible methods adjacent to attention. [0]

[0] https://arxiv.org/abs/2008.06996

Straw5y ago

Absolutely, modern NN architectures have been inspired by biological ones- despite their massive differences.

Even in cases like attention, the modern version (that actually works in GPT-3, AlphaFold2, etc), has little in common with both the english word and what we think of as attention. Its a formula with two matmuls and a softmax: softmax(AB)C. In particular, it doesn't necessarily look anywhere at all- just a weighted sum of the inputs. Nothing like the hard attention used by the human visual cortex. Its not even that different from a convolution where you allow the weights to be a function of the input.

So the inspiration might have come from humans, but the actual architectures have largely come from pure trial and error, with limited, difficult to explain intuition on what tends to work.

xiphias25y ago

Actually self attention is a generalization of convolution:

https://openreview.net/pdf?id=HJlnC1rKPB

,,This work provides evidence that attention layers can perform convolution and, indeed, they often learn to do so in practice. Specifically, we prove that a multi-head self-attention layer with sufficient number of heads is at least as expressive as any convolutional layer. Our numerical experiments then show that self-attention layers attend to pixel-grid patterns similarly to CNN layers, corroborating our analysis.''

smallnamespace5y ago

How do we really know that brains use hard attention?

1 more reply

bicepjai5y ago· 4 in thread

Watch your stock NVIDIA :)

rlili5y ago

Why, aren't GPUs pretty useful in training SVMs as well?

xksteven5y ago

SVMs are solved via convex optimization methods which have taken more time to get on the GPU train.

On the other hand there are GPU accelerated SVM training such as: https://github.com/Xtra-Computing/thundersvm

A github or Google search will reveal other GPU accelerated SVM training.

knuthsat5y ago

SVMs can also be trained using gradient descent.

abeppu5y ago

Well, and though the author shows an approximate equivalence, which helps us understand a class of models better, it's not obvious that it's preferable to use SVMs of the type described. In particular, it seems like often it would be preferable to deal with model weights (even if they are mathematically a "superposition" of datapoints) than to ship around and revisit the whole dataset.

scottlocklinOP5y ago· 4 in thread

Looking at you, deep learning.

wongarsu5y ago

A linear SVM can in turn be expressed as a very shallow neutral network. The main difference is that with SVMs you put all your effort into transforming inputs for the model (e.g. all the popular kernels) while with neural networks usually most of the effort goes into clever model architectures.

api5y ago

There is probably a ton of isomorphism between different models. It may come down to what is easiest to understand and fastest to implement in code.

segfaultbuserr5y ago

See also:

A visual proof that neural networks can approximate any function

https://news.ycombinator.com/item?id=19708620

api5y ago

So a "neural network" is actually a type of parameterized mathematical function that can be fit to any curve including higher dimensional surfaces, etc.?

2 more replies

cs7025y ago· 3 in thread

"Here we show that every model learned by this method [SGD], regardless of architecture, is approximately equivalent to a kernel machine [i.e., a support vector machine or SVM] with a particular type of kernel" -- a type of kernel which Domingos, the author, calls a "path kernel."

As defined in the paper, a "path kernel" measures, for any two data points, how similarly a model varies (specifically, how similarly the model's gradients change) at those two data points during training via SGD. This isn't exactly your usual, plain-vanilla, radial-basis type of kernel.

We've known for a long time that SVMs are universal approximators, i.e., in theory they can approximate any target function. The importance of this work is that it has found a new, surprising, deep connection between any model trained via SGD and SVMs, which are well understood :-)

nuclearnice15y ago

Great explanation of the intuitive understanding of the path kernel which seems to be the main takeaway from this paper.

One minor technical correction, the proof relief on the continuous model gradient flow not SGD. So it’s proven for GD and likely true for SGD and your intuitive explanation likely still holds, but it’s not obvious.

cs7025y ago

You're absolutely right. I substituted SGD for GD without giving it any thought because everyone uses SGD!

QuesnayJr5y ago

They sketch the argument for SGD, but they don't know if it actually holds (see Remark 5 in the paper).

FRGabriel5y ago· 3 in thread

So at the end, it rephrased a statement from "Neural Tangent Kernel: Convergence and Generalization in Neural Networks" [https://arxiv.org/abs/1806.07572], besides in a way which is kind of miss-leading.

The assertion is known by the community at least since 2018, if not even well before.

I find this article and the buzz around a little awkward.

QuesnayJr5y ago

To be fair, the article you cite is from 2018, and the OP is from 2012 (if the arXiv dates are right).

ajtulloch5y ago

The OP article was submitted on Mon, 30 Nov 2020 23:02:47 - the arXiv identifier 2012.xyz means it was published in the 12th month of 2020, not 2012.

QuesnayJr5y ago

Oh God that's embarrassing. I really thought that's how arXiv identifiers worked, and have been under the wrong impression for some time.

ajtulloch5y ago· 1 in thread

For everyone saying "oh wow, we can go back to SVMs now, huge speedups, etc" - that's more than a little premature. This is purely a formal equivalence, but not very useful computationally.

The math here is pretty much first-year undergraduate level calculus, and it's worth going through section 2 since it's quite clearly written (thanks Prof Domingos).

Essentially, what the author does is show that any model trained with "infinitely-small step size full-batch gradient descent" (i.e. a model following a gradient flow) can be written in a "kernelized" form

   y(x) = \sum_{(x_i, y_i) \in L} a_i K(x, x_i) + b.

The intuition most people have for SVMs is that the constants a_i are, well, constant, that the a_i are sparse, and that the kernel function K(x, x_i) is cheap to compute (partly why it's called the 'kernel trick').

However, none of those properties apply here, which is why this isn't particularly exciting computationally. The "trick" is that both a_i and K(x, x_i) involve path integrals along the gradient flow for the input x, and so doing a single inference is approximately as expensive as training the model from scratch (if I understand this correctly).

blueblisters5y ago

> y(x) = \sum_{(x_i, y_i) \in L} a_i K(x, x_i) + b.

HN feature request: parse inline latex math please

how_strange5y ago· 1 in thread

If, as in this paper, we allow ourselves to set the kernel after seeing the data, then the statement in the title is trivial: if my learning algorithm outputs function f, I can take the kernel K(x,x')=f(x)*f(x').

The result is interesting insofar as the path kernel is interesting, which requires some more thought.

moultano5y ago

If I'm understanding correctly, it doesn't just set the kernel after seeing the data, but also after training the entire model, because the path kernel can't be defined without the optimization process to define the path.

I can't tell if this paper is a useful insight or not.

abeppu5y ago· 1 in thread

From a skim, I think one asterisk which may be a gap between the claim in the title and what's actually shown in the paper is that the theorem focuses on gradient descent-trained models which minimize a loss function which is the sum of loss L(y_i, y*_i) on points from a given dataset. While that's clearly very broad, I think it _doesn't_ include things like GANs, where parts of the model produce fake data to train against.

Flashtoo5y ago

The claim also applies to GANs as you can simply use a masking function to indicate which inputs were used for each optimization timestep, like the author suggests for stochastic gradient descent in remark 5.

hobofan5y ago· 1 in thread

Is this a new finding? I'm not an expert on the deeper mathematical side of ML, but I remember a friend of mine already telling me about something that sounded exactly like this (ca. 2016-17).

ironSkillet5y ago

The fact that kernel machines can approximate other models isn't new. I think the novel idea is that they have explicitly constructed the "translation" between the arbitrary model to the kernel machine, and it's quite clean.

stosto885y ago· 1 in thread

Pretty sure it means a complex algorithm can be approximated by a super simple model.

Iv5y ago

Thing is, gradient descent is not really a complex algorithm.

sarosh5y ago

See also W. Brendel and M. Bethge. "Approximating CNNs with bag-of-local-features models works surprisingly well on ImageNet." https://arxiv.org/abs/1904.00760 (which is referenced in this paper) and explains "[t]his suggests that the improvements of DNNs over previous bag-of-feature classifiers in the last few years is mostly achieved by better fine-tuning rather than by qualitatively different decision strategies."

edwardjhu5y ago

The claim here is a bit misleading, as already pointed out by other comments, since the kernel is an evolving one that is essentially learned after seeing the data.

Contrary to many related works that compare wide neural networks to kernel methods, our recent work shows that one can study a feature learning infinite-width limit with realistic learning rate.

https://arxiv.org/abs/2011.14522

We identified what separates the kernel regime (e.g., NTK) and the feature learning regime. In the infinite-width limit, OP's work could belong to either regime depending on the parametrization, i.e. the path kernel is either equal to the NTK or performing feature learning.

It's an incredibly interesting research topic. Please feel free to comment with thoughts on our work :)

penguintester5y ago

"Perhaps the most significant implication of our result for deep learning is that it casts doubt on the common view that it works by automatically discovering new representations of the data, in contrast with other machine learning methods, which rely on predefined features (Bengio et al., 2013). As it turns out, deep learning also relies on such features, namely the gradients of a predefined function, and uses them for prediction via dot products in feature space, like other kernel machines."

Huh. So the implication here is that a deep network can never generalize results to inputs classes that were not explicit in the training set? And would this result apply for networks trained with something other than gradient based methods?

blackbear_5y ago

Interesting to compare this with [1], which shows that for some concept classes neural networks trained with SGD are much easier to train (i.e., require less data).

In other words, even though these two types of model are in a way equivalent, one can be much easier to train than the other for certain concepts (no free lunch).

[1] https://arxiv.org/abs/2001.04413

sgt1015y ago

Interestingly deep networks aren't ever really learned by gradient descent.

No! Really !

They are learned by interaction with a network selection process of varying degrees of hokeyness that involves decisions being taken by humans about the characteristics of the network, training data, training process and testing processes. At the end -> the model! Gradient descent is v.critical to this, but it's not the only thing going on by a long way.

Also dropout.

DoctorOetker5y ago

Given the discussion here seems to stray into conventional kernels and SVMs instead of the generalized version dubbed "path kernel", it seems a lot of comments or observations here on vanilla SVM may not be applicable to whats being described in the paper.

It would be neat if this paper was accompanied by minimalist code to demonstrate the approximate equivalence on some toy "deep" networks, so any misconceptions could be avoided in the peanut gallery here (like every inference / evaluation requiring to integrate the path kernel from scratch, which is clearly not proposed, it just describes how training the original deep network is in some sense equivalent to integrating the path kernel)

rahimiali5y ago

The observation is that every deep net f(x) trained on a dataset of (xi,yi) pairs using descent can be written as f(x)=sum_i a_i K(x,xi) + b, which looks like a kernel machine. But in fact a and b depend on x, and K depends on the entire dataset. So the paper is in fact saying f(x)=sum_i a_i(x) K(x, x1,...,xn, y1,...,yn, xi) + b(x). If you were thinking “my deep net is just a kernel SVM with flavor,” you’d need a LOT of flavor for the equivalence to hold.

api5y ago

Another thought: are neural nets just a weird analog / floating point kind of probabilistic data structure or lossy compression algorithm?

leonry5y ago

Since it hasn't been mentioned, Stan Kriventsov has very nicely summed up the paper on his blog: https://www.dl.reviews/2020/12/14/neural-networks-kernel-mac...

pietroppeter5y ago

It is worth noting that the author is the same of a very nice book to popularize machine learning, “the master algorithm”.

https://homes.cs.washington.edu/~pedrod/

YeGoblynQueenne5y ago

>> If gradient descent is limited in its ability to learn representations, better methods for this purpose are a key research direction. Current nonlinear alternatives include predicate invention (Muggleton and Buntine, 1988) and latent variable discovery in graphical models (Elidan et al., 2000).

Hah! Fancy seeing that here! Predicate invention is a main line of my PhD research.

Briefly, predicate invention is the ability of Inductive Logic Programming (ILP) systems to learn their own inductive bias. It is in a sense similar to feature learning or learning-to-learn. ILP systems learn logic programs from examples usually by searching the space of programs defined by a set of sub-programs, called the background knowledge (BK), and a language bias that determines the structure of learned programs. Predicate invention then means learning new BK and language bias to change the program search space while searching it.

The reference in Domingo's article is the first description of the concept which was for a long time more theoretical than practical: ILP approaches could only perform a limited form of predicate invention, e.g. could only invent BK programs of fixed structure or could not invent recursive programs etc. Things changed in 2013 with a new approach, Meta-Interpretive Learning (MIL). MIL systems are for the first time capable of unconstrained predicate invention, including the invention of mutually recursive programs. Full discolosure: my PhD research is on MIL.

Here are some more recent references on predicate invention in MIL:

Meta-interpretive learning of higher-order dyadic datalog: Predicate invention revisited (IJCAI 2013):

https://www.ijcai.org/Proceedings/13/Papers/231.pdf

Bias reformulation for one-shot function induction (ECAI 2014):

http://www.doc.ic.ac.uk/~shm/Papers/metabias.pdf

Logical minimisation of meta-rules within meta-interpretive learning (ICLP 2015):

https://www.doc.ic.ac.uk/~shm/Papers/minmeta.pdf

Like I say this is my field of study and as you can probably tell I'm very excited about it so I'm happy to answer questions- email in my profile.

nutanc5y ago

This can have very good real world consequences. If this is true, then it makes sense to first attack a problem with SVMs and then if the results are encouraging, try to go the deep learning route if needed.

6gvONxR4sf7o5y ago

This is a really cool new bit of intuition:

> A key property of path kernels is that they combat the curse of dimensionality by incorporating derivatives into the kernel: two data points are similar if the candidate function’s derivatives at them are similar, rather than if they are close in the input space. This can greatly improve kernel machines’ ability to approximate highly variable functions (Bengio et al., 2005). It also means that points that are far in Euclidean space can be close in gradient space, potentially improving the ability to model complex functions. (For example, the maxima of a sine wave are all close in gradient space, even though they can be arbitrarily far apart in the input space.)

mrfusion5y ago

Does this include transformers?

j / k navigate · click thread line to collapse

107 comments

86 comments · 27 top-level

GlenTheMachine5y ago· 13 in thread

I'm confused by findings like this one, and I'm hoping someone here can educated me.

There are many known universal approximations. Deep networks are one. SVMs are one. Heck, cubic splines are one, and they've been in use for nearly a hundred years IIRC.

Am I missing something?

talolard5y ago

50/50 I can shed some light or am way over my head:

The point of the paper is that if you train a deep learning model with gradient descent then the resulting model is effectively a kernel machine, regardless of model architecture.

The nice thing about a kernel machine is that it is simple (just one hidden layer) and we are able to use to analyze a kernel machine more effectively and conveniently.

nine_k5y ago

This may have colossal practical implications, as long as the approximation stays good enough.

3 more replies

pbhjpbhj5y ago

Does that mean in theory we can uncover an underlying model, a theorem effectively, that the model is effectively approximating and so remove some of the uncertainty?

1 more reply

make35y ago

An issue with deep learning is that it is very hard to analyse from a theoretical mathematical perspective, to prove things about them.

Kernels have been studied thoroughly from a theoretical perspective and people have proven things about them.

sebasv_5y ago

Finding the right architecture, or more in general the right model, is very much still the main problem.

Just to name my two favorite properties. The first one is _kind of_ related to learnability, since an inefficient model is often pretty much impossible to learn.

xksteven5y ago

My understanding of the read was to show "how" they're equivalent as opposed to how to actually construct such an approximator or learn it.

Similar to showing a problem falls in NP, you can reduce the problem down to another problem in NP and be done with it.

sdenton45y ago

There's a number of problems with svms; complexity for training and inference scales with the amount of training data, which is pretty sad panda for complex problems.

That said, I haven't read the paper terribly closely, and am always happy to be proven wrong!

nightski5y ago

I'd be curious if re-framing a trained neural network model as a SVM gives you insight into it's support vectors and maybe a little understanding on why the NN works the way it does?

runT1ME5y ago

>suggests that the architecture doesn't matter, but we know that's not true. Eg, why are networks with skip connections SO much better than networks without? What about batch normalization?

1 more reply

mycall5y ago

> why are networks with skip connections SO much better than networks without?

What are the leading theories for why this seems to be the case? Less nodes to capture and direct decisions?

1 more reply

riku_iki5y ago

> It has been training that approximator.

Also it is representation of approximator itself and data compression. For cubic splines to approximate some NN you likely would need enormous number of intervals covering input space.

VHRanger5y ago

Also, cubic splines don't extrapolate like the other mentioned models

natn5y ago

1 more reply

screye5y ago· 10 in thread

I do have a tangential technical question for someone who knows more math than I do.

A kernel SVM always finds (one-of) the global best fit lines in the kernel space.

A gradient descent model explicitly converges to one of the nearest local minimas by definition.

Does this paper conclude that the local minimas that neural networks converge to are one of the many equivalent global maxima ? Won't this be a major revelation by itself ?

dumb12245y ago

Example:https://towardsdatascience.com/support-vector-machine-simply...

andreareina5y ago

Isn't one of the features of high-dimensional spaces that local minima are rare and there's usually some direction that slopes towards a lower loss?

dragontamer5y ago

I mean, obviously not in the general case. Cryptography is designed to be a high dimensional space with pretty much no slope.

I'd assume that a lot of the binary decision tree / chip level optimizations are similar: almost no slope worth analyzing.

sp3325y ago

Sure, and unbreakable crypto is notoriously difficult to make. I wouldn't expect a situation like that to come up in a real-world problem.

1 more reply

wxnx5y ago

The "SVM step" would still find a global optima within the kernel space, but the qualifications of the previous step mean that the kernel space generated might be useless.

altarius5y ago

I don't think the output of the conversion is guaranteed to be equivalent to a hyperplane learned by an SVM.

I assume one likely ends up with different hyperplane fits from converting a NN/gradient-desc-learned model to kernel machine vs learning a kernel machine directly via SVM learning.

edjrage5y ago

Nitpick: Minima is already plural.

kaczordon5y ago

Doesn’t gradient descent use a convex cost function so that it always generates a global minimum?

wxnx5y ago

mrtranscendence5y ago

This really threw me off when I first started learning about ANNs, coming from a traditional econometrics background. B-but ... the parameters aren’t identified!

burlesona5y ago· 10 in thread

This sounds interesting, but a little over my head. Can anyone offer an explanation for a software engineer with no AI/ML background?

hansvm5y ago

peteretep5y ago

“You don’t need a fancy model, you can just find a similar example directly from the training data and get similar results”

ironSkillet5y ago

lumost5y ago

This has always anecdotally appeared to be the case when investigating the predictions of neural nets. Particularly when it comes time to answer the question “what does this model not handle”

smallnamespace5y ago

Defining ‘similar’ robustly is the meat of the problem, and what we’re finding deep NNs to do well.

moralestapia5y ago

Not an explanation, but a benefit could be that SVMs can be evaluated much faster and are more explainable (* Citation needed, I know).

eugenhotaj5y ago

Don’t kernel SVMs need a full pass through the data they were trained on to make predictions? How is that faster?

cscheid5y ago

1 more reply

alexilliamson5y ago

You only need the "Support Vectors" to make predictions, not the whole dataset.

somurzakov5y ago

neural nets at the same time require multiple passes through the data (epochs). if we can train a model in one epoch jnstead of 10000 epochs thats a breakthrough!

2 more replies

Straw5y ago· 7 in thread

It also only applies to the continuous limit of non-stochastic GD, far from the real training methods used.

We don't gain any understanding either; understanding implies predictive power about some new situation, and I don't see any- and nor does the paper suggest them.

scalablenotions5y ago

> humans can't explain or understand how we drive, speak, translate, play chess, etc, so why should we expect to understand how models that do these work?

This implies there's no point in pursuing explainability, but many domains involve inferences where the significant predictors are much easier to abstract at a useful level.

xiphias25y ago

,,Look, humans can't explain or understand how we drive, speak, translate, play chess, etc, so why should we expect to understand how models that do these work?''

If you look at this talk for 2010, Demis was already listing attention as an example (which was responsible for the recent improvement in protein folding prediction as an example):

https://www.youtube.com/watch?v=F5PSyu7booU

Isinlor5y ago

whymauri5y ago

>As far as I'm aware, attention does not even attempt biological plausibility, nor was it in any way inspired by biology.

[0] https://arxiv.org/abs/2008.06996

Straw5y ago

Absolutely, modern NN architectures have been inspired by biological ones- despite their massive differences.

So the inspiration might have come from humans, but the actual architectures have largely come from pure trial and error, with limited, difficult to explain intuition on what tends to work.

xiphias25y ago

Actually self attention is a generalization of convolution:

https://openreview.net/pdf?id=HJlnC1rKPB

smallnamespace5y ago

How do we really know that brains use hard attention?

1 more reply

bicepjai5y ago· 4 in thread

Watch your stock NVIDIA :)

rlili5y ago

Why, aren't GPUs pretty useful in training SVMs as well?

xksteven5y ago

SVMs are solved via convex optimization methods which have taken more time to get on the GPU train.