Otherwise, how precisely do you "skip" a layer without corrupting the training of lower layers?
Edit: the answer is in the definition of "skip layers", introduced in a previous paper: http://arxiv.org/abs/1512.03385 which introduces identity functions into the layer equation.. I guess I have more reading to do on this topic.
Deep Residual Networks are similar but different. There, you add with identity. Sth like g(x) = x + f(x).
Instead, the depth seems to be giving something like a progressive unwinding of the feature space.
It would be interesting to compare the trained networks to networks trained in the usual way, to see if they're coming up with similar coefficients in spite of the different training methods, out if this is producing something completely different.
It's a nice - if somewhat controversial - summary.
40% speedup on DNN training with state-of-the-art results.
Right now I basically run N architectures on N GPUs at the same time to speed things up. And that's a luxury.