A Quick Look at Support Vector Machines (opens in new tab)

(generalabstractnonsense.com)

207 pointsirpapakons9y ago37 comments

37 comments

26 comments · 8 top-level

syntaxing9y ago· 14 in thread

One thing that I discovered recently which surprised me (while taking the Udacity SDC)is how effective and resilient these "older" ML algorithms can be. Neural networks was always my go to method for most of my classification or regression problems for my small side projects. But now I learned with the minimal dataset I have (<5K samples), linear regression, SVM, or decision tress is the way to go. I got higher accuracy and it's about 10X faster in terms of computational time!

akyu9y ago

Yes SVMs are still great models. The advantage neural nets have over them are that they can do automatic feature extraction. By the time you get to the last layer of a neural net, you are basically just doing a simple logistic classification, but the features coming in have been learned from all of the previous layers.

I've even seen people use pretrained ImageNet classifiers, chop off the last layer and use an SVM as the actual classifier, and it works very well for some problems.

bunderbunder9y ago

> they can do automatic feature extraction

Sort of. They can readily do really basic feature engineering along the lines of doing nonlinear transformations of the input. With a bit more doing, they can do spatial feature engineering (e.g., convolutional nets), and with a bit more foresight and planning they can learn the kinds of complex "hidden Markov process" style features you typically use in natural language processing.

But, as far as I'm aware, anyway, they can't necessarily do a great job with things like irregular time series (which is a huge chunk of big data), so you're still stuck doing some of that basic feature engineering. And I hesitate to say that some of the fancier architectures like LSTMs can be characterized as a turnkey solution for feature engineering, considering how much thought and effort and pre-existing knowledge and theory about what the engineered features should look like in the first place needed to go into designing them. So I feel like the "they can learn their own features" thing is a bit overhyped.

1 more reply

platz9y ago

> automatic feature extraction

Hope you have a *ton of data, otherwise it's not gonna happen

2 more replies

inlineint9y ago

On the other hand SVM doesn't scale as well as neural networks do because it has computational complexity between O(n^2) and O(n^3) [1] where n is the number of samples in the training set. So if you plan to add more data later you may eventually encounter scaling problems with SVM.

[1] http://scikit-learn.org/stable/modules/svm.html#complexity

samuell9y ago

We found great success with the LIBLINEAR SVM implementation [1] though: Extremely good performance, to the point that it affects scalability too, with predictive performance acceptably close to libSVM with the RBF kernel, for a large cheminformatics dataset:

Paper (open acccess): http://dx.doi.org/10.1186/s13321-016-0151-5

As can be seen in fig 5 [2] in the paper, a dataset size that took ~1 week with libSVM (actually, the parallel piSVM implementation) on 64 cores, took less than a minute with LIBLINEAR, which runs on just one core.

[1] https://www.csie.ntu.edu.tw/~cjlin/liblinear

[2] http://jcheminf.springeropen.com/articles/10.1186/s13321-016...

1 more reply

syntaxing9y ago

Good to know, I did not know that! I kind of wish scikit had some sort of CUDA capabilities to speed things up.

1 more reply

radarsat19y ago

I'm curious where the idea that SVM are "older" than neural networks comes from. The SVM wikipedia page claims that they were published by Vapnik & Chervonenkis in 1963, while Neural Networks date back at least to Rosenblatt's work in 1958, if not before.

sgt1019y ago

I think that SVM's were seen in the late 1990s as replacements for three layer networks. This was because the kernel trick allowed the creation of high dimensional decision surfaces over large (for those days) training sets by optimisation. Because of the restrictions of computing power and data collection in those days the idea of very large neural networks was under explored, and most people believed that a very broad network was required to capture detailed learned classifiers and that it was impractical to train such classifiers. The idea of deep networks was not widely considered because it was thought that these would be infeasible to train, and they seemed (to me at least) to be until we found out about stochastic gradient descent, initialization, transfer learning, distributed computing and GPU's. So, SVM's became very fashionable and many people said that they were basically the end state of supervised machine learning. This made people look more at unsupervised learning, apart from some people in Canada and Scotland (and various others too!). Now people think SVM's are old because the old people that they know used to do things with SVM's. Neural networks are new because now you can do things with them that are quite unexpected.

2 more replies

argonaut9y ago

Rosenblatt's perceptron has little to do with neural nets. Geoffrey Hinton regrets coining the name "multi-layer perceptron" precisely because they're really unrelated.

1 more reply

emcq9y ago

If you have good features there is little advantage to a complex model.

In production ML there are still many applications for random forests, linear models or svms. Though I prefer random forests because they require less preprocessing, are super fast to train, and can be easy to explain feature importances.

Scea919y ago

In addition, random forests often work very well out-of-the-box with 'default' hyperparameter settings.

nl9y ago

Boosted trees will nearly always beat neural nets for structured data. Maybe if you have multiple billion rows, but even then..

Large numbers of features are where it gets challenging.

hdespiritu9y ago

>But now I learned with the minimal dataset I have (<5K samples), linear regression, SVM, or decision tress is the way to go.

Decision Trees are prone to overfitting and especially susceptible for small datasets. Random Forest is a good substitute that's become standard practice.

badminton19y ago

If you start adding dimensions then neural networks perform better.

nafizh9y ago· 3 in thread

Aaah, I was hoping for an explanation of the kernel trick. I think that is the hardest concept in support vector machines.

jwr9y ago

I think I can help with that.

The article nicely explains the data transformation so that it becomes linearly separable. But the trick to the kernel trick is no to transform the data at all.

What you do is use a learning algorithm that doesn't need individual input vectors, but instead only needs their dot products. You then imagine a magical high-dimensional space where your data is (you suppose) linearly separable. The trick is that you never actually transform your data to that magical space — you don't need the input vectors, remember? You only need their dot products. So you define a function that given two vectors in your normal input space returns a scalar. Assuming your function behaves in a sane way (go read about the required properties if you need to), you can think of this function as a dot product. In some kind of magical space — you don't actually care much. You will never transform your data, it might not even be possible to: the most common gaussian kernel is defined over an infinite-dimensional space. But hey, who cares? You take your SVM, give it your kernel function and input data, and off it goes, working as usual, except your dot products are no longer computed in your input space, but in your magical infinite-dimensional space.

It's both really clever and really simple.

contravariant9y ago

The way I understand the kernel trick is as follows:

Basically SVMs are all about inner products. If you have some vector 'k' and a constant 'c' then you can divide your data set into those points x where k·x > c and those where k·x < c. The points defined by k·x = c are usually called the separating line / plane / hyperplane.

Now if you know the inner product between all your samples then you can also calculate the inner product between any sample and any weighted sum of samples. So if you have some vector 'k' which is a weighted sum of your samples then you can find the inner product between k and your samples without calculating any more inner products. Even better it turns out that, even if 'k' isn't a weighted sum of samples, there exists a different vector 'p' such that k·x = p·x for all samples x, and where 'p' is a weighted sum of samples. So the restriction that 'k' is a weighted sum of samples doesn't have any effect on the performance of a SVM.

The kernel trick then turns this around by simply declaring the inner products between your samples to have a certain value (e.g. x·y = exp(-(x - y)^2). You can then let your SVM find the weighted sum of samples which best separates your samples.

sixo9y ago

Here's a simple analogy:

You want to class data in a 2D plane by drawing a straight line and saying everything on one side is in one class. But there's not likely to be a line that does this in general.

So you assign a z coordinate to all your points (even randomly). And it's now much more likely that at plane divides them the way you want than a line did before, as there are many ways to slip a plane in between the groups that wouldn't have been possible with a single line in the 2D plane.

Swapping the inner product for another inner-product-y kernel is similar, but with many / infinite dimensions coming into the picture.

curiousgal9y ago· 1 in thread

Many aspects of Machine Learning boil down to optimization problems.

rs869y ago

Well all of them do. In ML we always try to select the best description for a dataset, and that involves minimizing some function that represents some kind of goodness of fit

shas39y ago

Very cool! However, I think the author should have spent a a few more words and figures to distinguish support vector machines from standard perceptrons. Maximum margin classification and the definition of 'support vectors,' in my experience, helps demystify the algorithm.

lallysingh9y ago

This is great! Any follow-ups describing kernels?

rs869y ago

Amazingly well written. Short and to the point, humbly sharing something cool!

LeanderK9y ago

well, that really was a quick look. Any reading-recommendations about the kernel-functions? How do they work and why are they fast?

rmchugh9y ago

Best name for a blog ever?

j / k navigate · click thread line to collapse

37 comments

26 comments · 8 top-level

syntaxing9y ago· 14 in thread

akyu9y ago

I've even seen people use pretrained ImageNet classifiers, chop off the last layer and use an SVM as the actual classifier, and it works very well for some problems.

bunderbunder9y ago

> they can do automatic feature extraction

1 more reply

platz9y ago

> automatic feature extraction

Hope you have a *ton of data, otherwise it's not gonna happen

2 more replies

inlineint9y ago

[1] http://scikit-learn.org/stable/modules/svm.html#complexity

samuell9y ago

Paper (open acccess): http://dx.doi.org/10.1186/s13321-016-0151-5

[1] https://www.csie.ntu.edu.tw/~cjlin/liblinear

[2] http://jcheminf.springeropen.com/articles/10.1186/s13321-016...

1 more reply

syntaxing9y ago

Good to know, I did not know that! I kind of wish scikit had some sort of CUDA capabilities to speed things up.

1 more reply

radarsat19y ago

sgt1019y ago

2 more replies

argonaut9y ago

Rosenblatt's perceptron has little to do with neural nets. Geoffrey Hinton regrets coining the name "multi-layer perceptron" precisely because they're really unrelated.

1 more reply

emcq9y ago

If you have good features there is little advantage to a complex model.

Scea919y ago

In addition, random forests often work very well out-of-the-box with 'default' hyperparameter settings.

nl9y ago

Boosted trees will nearly always beat neural nets for structured data. Maybe if you have multiple billion rows, but even then..

Large numbers of features are where it gets challenging.

hdespiritu9y ago

>But now I learned with the minimal dataset I have (<5K samples), linear regression, SVM, or decision tress is the way to go.

Decision Trees are prone to overfitting and especially susceptible for small datasets. Random Forest is a good substitute that's become standard practice.

badminton19y ago

If you start adding dimensions then neural networks perform better.

nafizh9y ago· 3 in thread

Aaah, I was hoping for an explanation of the kernel trick. I think that is the hardest concept in support vector machines.

jwr9y ago

I think I can help with that.

The article nicely explains the data transformation so that it becomes linearly separable. But the trick to the kernel trick is no to transform the data at all.

It's both really clever and really simple.

contravariant9y ago

The way I understand the kernel trick is as follows:

sixo9y ago

Here's a simple analogy:

You want to class data in a 2D plane by drawing a straight line and saying everything on one side is in one class. But there's not likely to be a line that does this in general.

Swapping the inner product for another inner-product-y kernel is similar, but with many / infinite dimensions coming into the picture.

curiousgal9y ago· 1 in thread

Many aspects of Machine Learning boil down to optimization problems.

rs869y ago

Well all of them do. In ML we always try to select the best description for a dataset, and that involves minimizing some function that represents some kind of goodness of fit

shas39y ago

lallysingh9y ago

This is great! Any follow-ups describing kernels?

rs869y ago

Amazingly well written. Short and to the point, humbly sharing something cool!

LeanderK9y ago

well, that really was a quick look. Any reading-recommendations about the kernel-functions? How do they work and why are they fast?

rmchugh9y ago

Best name for a blog ever?

j / k navigate · click thread line to collapse