undefined | Better HN

0 pointsxtacy11y ago0 comments

Could you post a few pointers about the bunch of tricks to make deep training a lot easier?

0 comments

5 comments · 2 top-level

dave_sullivan11y ago· 2 in thread

Just to add a couple others:

rmsprop is a great technique I don't hear talked about as much, example implementation here: https://github.com/BRML/climin/blob/master/climin/rmsprop.py

Using nesterov momentum and a "sparse" weight initialization scheme rather than uniform: https://www.cs.toronto.edu/~hinton/absps/momentum.pdf

Reducing the learning rate exponentially and increasing the momentum rate linearly over the course of training. Learning rate from .5 to .0001, momentum from .7 to .995. I've seen variations on this, like adjusting based on sigmoid curve.

Dropout may or may not help, adjusting dropout rate (percentage of activations that are discarded) may or may not help.

Mini-batch size can make a difference. Somewhere between 2 and 200?

You can use bayesian optimization to intelligently search hyperparameters: https://github.com/JasperSnoek/spearmint

Try rmsprop though, I've heard good things.

benanne11y ago

I haven't had any luck so far with rmsprop, adagrad and adadelta. SGD + Nesterov momentum has served me best.

xtacyOP11y ago

Great, thanks for the pointers! I've tried momentum trick before and it has helped. I'll try rmsprop.

colah311y ago· 1 in thread

Certainly!

* The biggest one is probably just train for a very long time. Competitive neural nets for many tasks are trained on GPUs, clusters, or GPU clusters, for days or weeks.

* Using convolutional layers really helps. Roughly, convolutional layers have multiple copies of the same neuron, applied to different inputs. This results in them needing to learn much less. It also leads to them kind of concentrating the gradient on just a few neurons. Because of this, if the first few layers of a network are convolutional layers, they are much easier to train. For a long time, these were the only kind of remotely deep neural networks we could train.

(I wrote a blog post on conv nets, which you can read here: http://colah.github.io/posts/2014-07-Conv-Nets-Modular/)

* So far, Michael's book has only talked about sigmoid neurons (I think). But you can use neurons with other activation functions. They still multiply their inputs by different weights and add a bias, but instead of applying sigmoid they apply a different function. Using a different kind of neuron, ReLU neurons, tends to help a lot. Unlike sigmoid neurons, which tend to have a very small derivative, ReLU neurons have a derivative of 1 a lot of the time. I've had mixed experiences, but most people swear by them.

* Using higher learning rates for early layers may be helpful.

xtacyOP11y ago

Thank you colah. Your blog posts are inspiring! It's a lot of hard work and effort; keep it up!

j / k navigate · click thread line to collapse

0 comments

5 comments · 2 top-level

dave_sullivan11y ago· 2 in thread

Just to add a couple others:

rmsprop is a great technique I don't hear talked about as much, example implementation here: https://github.com/BRML/climin/blob/master/climin/rmsprop.py

Using nesterov momentum and a "sparse" weight initialization scheme rather than uniform: https://www.cs.toronto.edu/~hinton/absps/momentum.pdf

Dropout may or may not help, adjusting dropout rate (percentage of activations that are discarded) may or may not help.

Mini-batch size can make a difference. Somewhere between 2 and 200?

You can use bayesian optimization to intelligently search hyperparameters: https://github.com/JasperSnoek/spearmint

Try rmsprop though, I've heard good things.

benanne11y ago

I haven't had any luck so far with rmsprop, adagrad and adadelta. SGD + Nesterov momentum has served me best.

xtacyOP11y ago

Great, thanks for the pointers! I've tried momentum trick before and it has helped. I'll try rmsprop.

colah311y ago· 1 in thread

Certainly!

* The biggest one is probably just train for a very long time. Competitive neural nets for many tasks are trained on GPUs, clusters, or GPU clusters, for days or weeks.

(I wrote a blog post on conv nets, which you can read here: http://colah.github.io/posts/2014-07-Conv-Nets-Modular/)

* Using higher learning rates for early layers may be helpful.

xtacyOP11y ago

Thank you colah. Your blog posts are inspiring! It's a lot of hard work and effort; keep it up!

j / k navigate · click thread line to collapse