Here is the abstract:
This note provides a family of classification problems, indexed by a positive integer k, where all shallow networks with fewer than exponentially (in k) many nodes exhibit error at least 1/6, whereas a deep network with 2 nodes in each of 2k layers achieves zero error, as does a recurrent network with 3 distinct nodes iterated k times. The proof is elementary, and the networks are standard feedforward networks with ReLU (Rectified Linear Unit) nonlinearities.
What evidence exists that the 'multiple levels of representation', which I understand to generally be multiple hidden layers of a neural network, actually correspond to 'levels of abstraction'?
2) I'm further confused by, "Deep learning is a kind of representation learning in which there are multiple levels of features. These features are automatically discovered and they are composed together in the various levels to produce the output. Each level represents abstract features that are discovered from the features represented in the previous level. "
This implies to me that this is "unsupervised learning". Are deep learning nets all unsupervised? Most traditional neural nets are supervised.
2) Deep learning is really a term that denotes machine learning using models that attempt to abstract the data via multiple layers (popularly in artificial neural networks). Not all deep neural nets are unsupervised, but unsupervised pre-training [2] was an approach that was [3] very popular until dropout [4,5] (and its variations) appeared. See, for instance, some of the standard datasets [6] of the field, on some of which deep neural nets achieved state of the art accuracy using supervised learning.
[0]: http://www.rsipvision.com/wp-content/uploads/2015/04/Slide6....
[1]: http://www.rsipvision.com/exploring-deep-learning/
[2]: https://www.youtube.com/watch?v=Oq38pINmddk
[3]: http://fastml.com/deep-learning-these-days/
[4]: http://arxiv.org/pdf/1207.0580.pdf
[5]: http://www.cs.toronto.edu/~hinton/absps/JMLRdropout.pdf
I think the presentations by Yann Lecun and Leon Bottou are more interesting - and tend to involve more uncertainty and fewer pronouncements.
The problem is that a computer comes in without knowing anything about tangential phenomenon. So it needs lots of data to catch up to me and my years of forming associative connections about other handwriting I've seen.
If I showed you alien (ie not human) handwritten samples, you'd probably stuggle too.
It's because we use much better algorithms in our brains (compared to the ones we currently have in DL). Having "lots of data" allows us to get good results even while using inferior algorithms.
(didn't read it yet though, will do when I have time)