I've even seen people use pretrained ImageNet classifiers, chop off the last layer and use an SVM as the actual classifier, and it works very well for some problems.
Sort of. They can readily do really basic feature engineering along the lines of doing nonlinear transformations of the input. With a bit more doing, they can do spatial feature engineering (e.g., convolutional nets), and with a bit more foresight and planning they can learn the kinds of complex "hidden Markov process" style features you typically use in natural language processing.
But, as far as I'm aware, anyway, they can't necessarily do a great job with things like irregular time series (which is a huge chunk of big data), so you're still stuck doing some of that basic feature engineering. And I hesitate to say that some of the fancier architectures like LSTMs can be characterized as a turnkey solution for feature engineering, considering how much thought and effort and pre-existing knowledge and theory about what the engineered features should look like in the first place needed to go into designing them. So I feel like the "they can learn their own features" thing is a bit overhyped.
Hope you have a *ton of data, otherwise it's not gonna happen
[1] http://scikit-learn.org/stable/modules/svm.html#complexity
Paper (open acccess): http://dx.doi.org/10.1186/s13321-016-0151-5
As can be seen in fig 5 [2] in the paper, a dataset size that took ~1 week with libSVM (actually, the parallel piSVM implementation) on 64 cores, took less than a minute with LIBLINEAR, which runs on just one core.
[1] https://www.csie.ntu.edu.tw/~cjlin/liblinear
[2] http://jcheminf.springeropen.com/articles/10.1186/s13321-016...
In production ML there are still many applications for random forests, linear models or svms. Though I prefer random forests because they require less preprocessing, are super fast to train, and can be easy to explain feature importances.
Large numbers of features are where it gets challenging.
Decision Trees are prone to overfitting and especially susceptible for small datasets. Random Forest is a good substitute that's become standard practice.
The article nicely explains the data transformation so that it becomes linearly separable. But the trick to the kernel trick is no to transform the data at all.
What you do is use a learning algorithm that doesn't need individual input vectors, but instead only needs their dot products. You then imagine a magical high-dimensional space where your data is (you suppose) linearly separable. The trick is that you never actually transform your data to that magical space — you don't need the input vectors, remember? You only need their dot products. So you define a function that given two vectors in your normal input space returns a scalar. Assuming your function behaves in a sane way (go read about the required properties if you need to), you can think of this function as a dot product. In some kind of magical space — you don't actually care much. You will never transform your data, it might not even be possible to: the most common gaussian kernel is defined over an infinite-dimensional space. But hey, who cares? You take your SVM, give it your kernel function and input data, and off it goes, working as usual, except your dot products are no longer computed in your input space, but in your magical infinite-dimensional space.
It's both really clever and really simple.
Basically SVMs are all about inner products. If you have some vector 'k' and a constant 'c' then you can divide your data set into those points x where k·x > c and those where k·x < c. The points defined by k·x = c are usually called the separating line / plane / hyperplane.
Now if you know the inner product between all your samples then you can also calculate the inner product between any sample and any weighted sum of samples. So if you have some vector 'k' which is a weighted sum of your samples then you can find the inner product between k and your samples without calculating any more inner products. Even better it turns out that, even if 'k' isn't a weighted sum of samples, there exists a different vector 'p' such that k·x = p·x for all samples x, and where 'p' is a weighted sum of samples. So the restriction that 'k' is a weighted sum of samples doesn't have any effect on the performance of a SVM.
The kernel trick then turns this around by simply declaring the inner products between your samples to have a certain value (e.g. x·y = exp(-(x - y)^2). You can then let your SVM find the weighted sum of samples which best separates your samples.
You want to class data in a 2D plane by drawing a straight line and saying everything on one side is in one class. But there's not likely to be a line that does this in general.
So you assign a z coordinate to all your points (even randomly). And it's now much more likely that at plane divides them the way you want than a line did before, as there are many ways to slip a plane in between the groups that wouldn't have been possible with a single line in the 2D plane.
Swapping the inner product for another inner-product-y kernel is similar, but with many / infinite dimensions coming into the picture.