That would suggest that 1-Hidden-Layer neural nets would work fine, since they are also universal function approximators. But no -- when people talk about "deep learning", the word "deep" refers to having lots of hidden layers.
I'm not an expert, but the motivation seems more like this:
- Linear regression and SVM sometimes work. But they apply to very few problems.
- We can fit those models using gradient descent. Alternatives to gradient descent do exist, but they become less useful as the above models get varied and generalised.
- Empirically, if we compose with some simple non-linearities, we get very good results on otherwise seemingly intractable problems like OCR. See Kernel SVM and Krieging.
- Initially, one might choose this non-linearity from a known list. And then fit the model using specialised optimisation algorithms. But gradient descent still works fine.
- To further improve results, the choice of non-linearity must itself be optimised. Call the non-linearity F. We break F into three parts: F' o L o F'', where L is linear, and F' and F'' are "simpler" non-linearities. We recursively factorise the F' and F'' in a similar way. Eventually, we get a deep feedforward neural network. We cannot use fancy algorithms to fit such a model anymore.
- Somehow, gradient descent, despite being a very generic optimisation algorithm, works much better than expected at successfully fitting the above model. We have derived Deep Learning.