More recently, people, especially the younger generation of deep learning researchers, tend to be skeptical of how much pretraining helps.
Advocates for pretraining now tend to argue that it helps you find better local minima, instead of focusing on it helping the vanishing gradient problem. For example, see this paper: http://www.jmlr.org/papers/volume11/erhan10a/erhan10a.pdf .
As I'm sure Michael will address in coming chapters, there's a bunch of tricks you can use that make training deep neural networks a lot easier. People tend to prefer, now, to just use those and a lot of computing power, rather than mess around with pretraining.
But thing is, first consider that being divided by a plane in a feature space is simply a convenient quality that many patterns have. It's similar to data you can draw a line along to extrapolate further values of. However, unlike that approximately linear data, you can't "why" your complex is separated by a particular plane in the feature space and the reason is that your neural network or SVM data is more or less trapper in the model - it's not going to be further processed except in using that model for that particular pattern.
Second of all, SVM does not create any feature space (i.e., embeddings). It just finds a good separator with a maximal margin. Deep NNs, on the other hand, do create features in their hidden layers.
Anyway, even ignoring these issues, I'm not sure I understood your main point.
There are many methods. The first to tackle is getting your data in the right format. Plotting software like Matplotlib can be really helpful when you're trying to debug.
https://www.reddit.com/r/MachineLearning/comments/2oeg5t/bac...
(Disclaimer: I'm just a beginner ML/DL enthusiast).