Why are deep neural networks hard to train? (opens in new tab)

(neuralnetworksanddeeplearning.com)

83 pointswxs11y ago12 comments

12 comments

7 comments · 4 top-level

What happens when you, instead of training the entire network at once, train for a while with a single layer, then add a second layer and train with both layers, then add a third layer and train with all three layers, and so on?

colah311y ago

Good intuition! What you are describing sounds like a technique called pretraining (in particular, greedy, layer-wise pretraining). Five years ago, pretraining was how everyone attacked this problem, although they usually did a different kind of pretraining (basically, we train a different kind of model, and then perform surgery, cutting it apart and using some layers for it for the earlier layers of our model).

More recently, people, especially the younger generation of deep learning researchers, tend to be skeptical of how much pretraining helps.

Advocates for pretraining now tend to argue that it helps you find better local minima, instead of focusing on it helping the vanishing gradient problem. For example, see this paper: http://www.jmlr.org/papers/volume11/erhan10a/erhan10a.pdf .

As I'm sure Michael will address in coming chapters, there's a bunch of tricks you can use that make training deep neural networks a lot easier. People tend to prefer, now, to just use those and a lot of computing power, rather than mess around with pretraining.

xtacy11y ago

Could you post a few pointers about the bunch of tricks to make deep training a lot easier?

2 more replies

joe_the_user11y ago· 1 in thread

So neural networks and support vector machines are essentially equivalent [1]. Thus both these approaches effectively project input into a high level feature-space and then draw a hyperplane between two different point sets. The cleverness or not of this depends on how the algorithm effectively creates the feature-space. The article's comments could be interpreted as Deep neural networks allow feature-spaces which otherwise require many more neurons.

But thing is, first consider that being divided by a plane in a feature space is simply a convenient quality that many patterns have. It's similar to data you can draw a line along to extrapolate further values of. However, unlike that approximately linear data, you can't "why" your complex is separated by a particular plane in the feature space and the reason is that your neural network or SVM data is more or less trapper in the model - it's not going to be further processed except in using that model for that particular pattern.

[1] http://www.scm.keele.ac.uk/staff/p_andras/PAnpl2002.pdf

warsheep11y ago

This comment is very confusing. First of all, the linked paper doesn't state what you claim it states. The authors show equivalence between two specific frameworks of neural networks: SVM-NN and Regularized-NN, and not equivalence between SVM and NN. Generally, SVM and NN are equivalent only in the sense that all discriminative models are equivalent. The kernel trick in SVM requires your embedding to have an "easily" calculable inner product. I'm not an expert, but I think this places strong constraints on the embeddings you can use.

Second of all, SVM does not create any feature space (i.e., embeddings). It just finds a good separator with a maximal margin. Deep NNs, on the other hand, do create features in their hidden layers.

Anyway, even ignoring these issues, I'm not sure I understood your main point.

vonnik11y ago

We've tried to consolidate some training tips here: http://deeplearning4j.org/debug.html http://deeplearning4j.org/troubleshootingneuralnets.html http://deeplearning4j.org/trainingtricks.html

There are many methods. The first to tackle is getting your data in the right format. Plotting software like Matplotlib can be really helpful when you're trying to debug.

yudlejoza11y ago

my recent comment on reddit might be relevant to this:

https://www.reddit.com/r/MachineLearning/comments/2oeg5t/bac...

(Disclaimer: I'm just a beginner ML/DL enthusiast).

j / k navigate · click thread line to collapse