In 2006, Hinton introduced greedy layer-wise pretraining, which was intended to solve the problem of backpropagation getting stuck in poor local optima. The theory was that you'd pretrain to find a good initial set of connection weights, then apply backprop to "fine-tune" discriminatively. And the theory seemed correct since the experimental results were good: http://www.cs.toronto.edu/~hinton/absps/fastnc.pdf http://machinelearning.wustl.edu/mlpapers/paper_files/NIPS20...
Does pretraining truly help solve the problem of poor local optima? In 2010, some empirical studies suggested the answer was yes: http://machinelearning.wustl.edu/mlpapers/paper_files/AISTAT...
But that same year, a student in Geoff Hinton's lab discovered that if you added information about the 2nd-derivatives of the loss function to backpropagation ("Hessian-free optimization"), you could skip pretraining and get the same or better results: http://machinelearning.wustl.edu/mlpapers/paper_files/icml20...
And around ~2012, a bunch of researchers have reported you don't even need 2nd-derivative information. You just have to initialize the neural net properly. Apparently, all the most recent results in speech recognition just use standard backpropagation with no unsupervised pretraining. (Although people are still trying more complex variants of unsupervised pretraining algorithms, often involving multiple types of layers in the neural network.)
So now, after seven years of work, we're back where we started: the plain ol' backpropgation algorithm from 1974 worked all along.
This whole topic is really interesting to me from a history of science perspective. What other old, discarded ideas from the past might be ripe, now that we have millions of times more data and computation?