The key idea is that if you train each layer in an unsupervised manner and then feed its outputs as features for the next layer it performs better when you go on to train it in a supervised way. That is, back-propagation on the pre-trained Neural net, learns a far more robust set of weights than without pretraining. Stochastic gradient descent is a very simple technique that is useful for optimization when you are working with massive data.
The architecture Dahl used layers as RBM (very similar to autoencoders) to seed a regular ole but many layered Feedforward network. SGD is used to do back propagation. RBMs themselves are trained using a generative technique - see Contrastive divergence for more.
The google architecture is more complex and based on biological models. It is not trying to learn an explicit classifier hence they train a many layered autoencoder network to learn features. I only skimmed the paper but they have multiple layers specialized to a particular type of processing (think photoshop not intel) and using SGD they optimize an objective that is essentially learning an effective decomposition on the data.
The main takeaway is if you can find an effective way to build layered abstractions then you will learn robustly.