rmsprop is a great technique I don't hear talked about as much, example implementation here: https://github.com/BRML/climin/blob/master/climin/rmsprop.py
Using nesterov momentum and a "sparse" weight initialization scheme rather than uniform: https://www.cs.toronto.edu/~hinton/absps/momentum.pdf
Reducing the learning rate exponentially and increasing the momentum rate linearly over the course of training. Learning rate from .5 to .0001, momentum from .7 to .995. I've seen variations on this, like adjusting based on sigmoid curve.
Dropout may or may not help, adjusting dropout rate (percentage of activations that are discarded) may or may not help.
Mini-batch size can make a difference. Somewhere between 2 and 200?
You can use bayesian optimization to intelligently search hyperparameters: https://github.com/JasperSnoek/spearmint
Try rmsprop though, I've heard good things.
* The biggest one is probably just train for a very long time. Competitive neural nets for many tasks are trained on GPUs, clusters, or GPU clusters, for days or weeks.
* Using convolutional layers really helps. Roughly, convolutional layers have multiple copies of the same neuron, applied to different inputs. This results in them needing to learn much less. It also leads to them kind of concentrating the gradient on just a few neurons. Because of this, if the first few layers of a network are convolutional layers, they are much easier to train. For a long time, these were the only kind of remotely deep neural networks we could train.
(I wrote a blog post on conv nets, which you can read here: http://colah.github.io/posts/2014-07-Conv-Nets-Modular/)
* So far, Michael's book has only talked about sigmoid neurons (I think). But you can use neurons with other activation functions. They still multiply their inputs by different weights and add a bias, but instead of applying sigmoid they apply a different function. Using a different kind of neuron, ReLU neurons, tends to help a lot. Unlike sigmoid neurons, which tend to have a very small derivative, ReLU neurons have a derivative of 1 a lot of the time. I've had mixed experiences, but most people swear by them.
* Using higher learning rates for early layers may be helpful.