Deep Networks with Stochastic Depth (opens in new tab)

(arxiv.org)

70 pointsnicklo10y ago8 comments

8 comments

8 comments · 5 top-level

radarsat110y ago· 2 in thread

Having not read the paper, something I find unclear: is it only the feedback path that is skipped, or is the feedforward path also skipped? The abstract mentions replacing the layer with an identity function. I'm not sure how this would work, wouldn't it change the result (i.e the encoding used by the following layer would be corrupted) if you just multiply the inputs by 1 and add them?

Otherwise, how precisely do you "skip" a layer without corrupting the training of lower layers?

Edit: the answer is in the definition of "skip layers", introduced in a previous paper: http://arxiv.org/abs/1512.03385 which introduces identity functions into the layer equation.. I guess I have more reading to do on this topic.

albertzeyer10y ago

The full layer is skipped. I.e. replaced by identity. Sth like g(x) = switch(prob, x, f(x)).

Deep Residual Networks are similar but different. There, you add with identity. Sth like g(x) = x + f(x).

radarsat110y ago

Yes, but my question was more, when the layer is "skipped", what happens to the input for the next layer? But clearly, it is designed such that the identity function still provides somehow useful information to the next layer. (i.e doesn't significantly transform its domain and range) I was wondering how this could work. It's just that, intuitively, I would think that the next layer is being trained on a specific transformation performed by the skipped layer, so I still don't fully understand how replacing a whole layer with the identity function doesn't completely mess up the training of all subsequent layers. But maybe the secret is that it only lasts for a small number of iterations, and perhaps this short-lived deviation actually helps inject some minima-escaping trajectory. (I have read that injecting random noise can have similar effects. Is this just a different kind of random noise?)

sdenton410y ago· 1 in thread

It's kind of bonkers that this works. It suggests that the whole belief that layers are learning different representations is completely wrong: if layer three is expecting a certain kind of intermediate representation from layer two, and is then given the raw input, one would expect layer three to choke.

Instead, the depth seems to be giving something like a progressive unwinding of the feature space.

It would be interesting to compare the trained networks to networks trained in the usual way, to see if they're coming up with similar coefficients in spite of the different training methods, out if this is producing something completely different.

albertzeyer10y ago

Note that this was done for 100-1000 layer depth. So each individual layer only slightly increases the "high-levelness" of the features. In the same sense, that is why Deep Residual network works - initially, all layers are close to identity.

nl10y ago

I suspect this was posted because of Delip Rao's write-up[1] (which I suggest might be a better link).

It's a nice - if somewhat controversial - summary.

40% speedup on DNN training with state-of-the-art results.

[1] http://deliprao.com/archives/134

romaniv10y ago

I'm reading Delip's followup post[1] and it reminds me how much of ANN stuff is till pretty much alchemy.

[1] http://deliprao.com/archives/137

karterk10y ago

This is literally one of the most exciting papers I have read recently that will have quite some impact on deep learning models. The major drawback of deep architectures today is training time and any.improvement to that will have a drastic effect on my productivity.

Right now I basically run N architectures on N GPUs at the same time to speed things up. And that's a luxury.

j / k navigate · click thread line to collapse

8 comments

8 comments · 5 top-level

radarsat110y ago· 2 in thread

Otherwise, how precisely do you "skip" a layer without corrupting the training of lower layers?

albertzeyer10y ago

The full layer is skipped. I.e. replaced by identity. Sth like g(x) = switch(prob, x, f(x)).

Deep Residual Networks are similar but different. There, you add with identity. Sth like g(x) = x + f(x).

radarsat110y ago

sdenton410y ago· 1 in thread

Instead, the depth seems to be giving something like a progressive unwinding of the feature space.

albertzeyer10y ago

nl10y ago

I suspect this was posted because of Delip Rao's write-up[1] (which I suggest might be a better link).

It's a nice - if somewhat controversial - summary.

40% speedup on DNN training with state-of-the-art results.

[1] http://deliprao.com/archives/134

romaniv10y ago

I'm reading Delip's followup post[1] and it reminds me how much of ANN stuff is till pretty much alchemy.

[1] http://deliprao.com/archives/137

karterk10y ago

Right now I basically run N architectures on N GPUs at the same time to speed things up. And that's a luxury.

j / k navigate · click thread line to collapse