Understanding deep learning requires rethinking generalization (opens in new tab)

(arxiv.org)

131 pointsvisionscaper9y ago18 comments

18 comments

17 comments · 7 top-level

bmh_ca9y ago· 3 in thread

I remember an AI professor I had once asked the class to define "the number 3".

The answer he chose, which stuck with me (if I recall the nuance correctly), is that the number three is: The set of all things in the universe of which there three, three is that which they have in common.

Where it became interesting for me is observing our children growing up, especially learning colours and shapes. They exhibited a pattern of learning based upon observations of common patterns in communication by vocalization.

For example, children decided things were "red" based upon that trait being in-common with other things we called red. Circles based upon other things we call circles.

It's really quite a fascinating phenomenon to observe in children, and I expect there is a key atomicity of association from which more complex patterns - up to consciousness - can be created. Too fine grained and the patterns will be noise; too large and certain higher order structures will never form - a "Goldilocks" zone for the complex system of interpreting reality by observational exposure and initially arbitrary relation.

TheOtherHobbes9y ago

The prof's definition begs the question.

Humans can be observed recognising patterns, but that doesn't say anything useful about the process of recognition.

It seems we have experiences first, and generalise from them later in terms of our experience. So we have common experiences of threeness, circleness, and redness, and from them we generalise what "three", "circle", and "red" means.

But what defines an experience of threeness? Is it really atomic, or is it made of further component parts/relationships? Is it learned, or innate? How does the labelling process influence the experience/generalisation process? (I'm not sure if it's an anthropological myth, but supposedly there are primitive tribes where counting goes "One, two, many..." What's their experience of threeness?)

The exact mechanics of this pattern recognition remain mysterious. It turns out that when we build NNs to recognise patterns, the process is still mysterious. Which seems mysterious in itself. It's remarkable that we have these tools and they seem to work to an extent, but really no one understands why.

alimw9y ago

> The prof's definition begs the question.

It was good enough for Russell so I doubt that. See https://en.wikipedia.org/wiki/Set-theoretic_definition_of_na...

2 more replies

nl9y ago

There is some evidence (in eg Tversky & Kahneman) that human judgement is mostly about the difference between sets, not the similarities.

We seem to learn to generalise, and then compare differences.

dkarapetyan9y ago· 2 in thread

> Brute-force memorization is typically not thought of as an effective form of learning. At the same time, it’s possible that sheer memorization can in part be an effective problem-solving strategy for natural tasks.

I like the conclusion. Basically neural nets are just beasts with too many parameters and they even show you don't even need that many parameters to fit any data set of size n. This is one reason I think neural nets are kinda a dead end. People don't understand them and it is impossible to get any explanatory results from them and based on these results that kinda makes sense. Neural nets don't learn, they just memorize.

hiddencost9y ago

https://arxiv.org/abs/1311.2901

Check out figure 2. The network learns composition from fundamental shapes and gradients to compositional ones. It's kinda awe inspiring.

Not NN specific, but some more work on explanations: https://homes.cs.washington.edu/~marcotcr/blog/lime/

Visualization of attention mechanisms is pretty cool for explanations: https://arxiv.org/pdf/1502.03044.pdf http://torch.ch/blog/2015/09/21/rmva.html

dkarapetyan9y ago

The local linear approximation is a cool idea in this context even though most non-linear systems are modeled in exactly this way. You take a complicated thing and linearized it to understand it. I'll have to look further into lime.

gambler9y ago· 1 in thread

Good to see someone testing the limits of neural nets, rather just squeezing a few percent of performance on an artificial benchmark.

That said, is this result really all that surprising? Especially given the results demonstrated in that paper on fooling DNNs from 2015 and visualization experiments a-la Deep Dream.

Unless you believe in networks "painting" stuff, Deep Dream demonstrated that neural networks capture and store certain chunks of their training data and you can get those back out if you're clever enough.

That other paper[1] demonstrated that a trained DNN can classify noise as a particular label with very high confidence, as long as you construct that noise carefully enough. This hints at the fact that DNNs may do matching by applying some complex transformation that usually results in the correct answer, but does not necessarily capture the underlying patterns. (Kind of like guessing about the weather by telltale signs, without knowing anything air pressure, currents and so on.)

[1] - http://www.evolvingai.org/fooling

dbecker9y ago

Adversarial noise isn't specific to deep learning models. Most model working on high-dimensional input will confidently misclassify noise with high confidence if you construct the noise right.

The original "adversarial pixel" paper demonstrates this with logistic regression.

https://arxiv.org/pdf/1412.6572v3.pdf

AlexCoventry9y ago· 1 in thread

We discussed this paper in our reading group last week[0]. I think the key to understanding what's going on here is figure 1(a). The fastest learning happens with true labels, and the slowest with random labels. Shuffled pixels is the second fastest. I believe the reason this is happening is that given training data composed of structured images, the convolutional architecture heavily favors learning filters which reflect geometric features, as opposed to random filters which can memorize the data. This results in fastest learning with the true labels because the geometric features correspond to the learning target, but for memorizing random labels, geometric features have lower capacity than random filters. On the other hand, it learns shuffled pixels pretty fast because the convolutional architecture makes it easy to capture a color histogram and learn off that.

[0] This week we discussed the Alpha Go paper. URL for that, although we don't generally advertise our meetings unless we think there's going to be broad interest: https://www.meetup.com/Cambridge-Artificial-Intelligence-Mee...

argonaut9y ago

> I believe the reason this is happening is that given training data composed of structured images, the convolutional architecture heavily favors learning filters which reflect geometric features, as opposed to random filters which can memorize the data

And interesting idea, but you really need to test this out before assigning any particular confidence to this actually being what is happening.

maxander9y ago· 1 in thread

My halfway informed interpretation, just from the abstract- it turns out that modern image-recognition networks are capable of learning labels randomly assigned to sets of random images, which means that it's still mysterious why they learn labels with intelligible meaning when given non-random images (rather than just memorizing the training set via some nonsense model.)

I'd guess the resolution would have to involve an ordering over possible models, where (for well-designed networks) intelligible models are preferred over unintelligble ones. Filing this away to read later.

sgt1019y ago

I think that it depends what you mean by "learning". Creating a mapping to a training set where you have a sufficiently expressive representation is trivial - simply create a list, and then find the most efficient representation of the list; zip it for example. The point of learning is to generalize from the training set to unseen examples via the representation and figure 1 of the paper shows (part c) that this is not what is claimed for corrupted or randomized examples in this paper. How the non generalised claims then extend to the discussions in section 2 leaves me flailing, but that's not unusual! However, my thought is that the difficulty is measurement of the structural risk of a deep network where network weights very subtly encode information, or don't depending on the network. Perhaps sweeping the networks and then setting weights below a threshold to 0 and measuring the generalisation error impact would be a way to measure what in the network is useful encoding and what isn't? The rest of the network could be easily "measured" as bits to encode?

yazr9y ago· 1 in thread

> number of parameters exceeds the number of data points as it usually does in practice

I dont get this part.

In reality, isnt the dataset much larger than the parameters of the nueral net ?

hiddencost9y ago

They're not saying "size of the data set in bits", they're saying "number of items in the dataset". In speech and image recognition, it's normal to have more parameters than data points. This is a bit old, although it's still a very good architecture, but: GoogLeNet [0] has around 10M parameters, and was trained on 1.2M images.

In fact, the 2016 winner of a bunch of the ILSVRC challenges [1,2] was topologically basically the same as GoogLeNet.

EDIT: There's a perspective on machine learning which is basically just: "what if your model learns a hash-map". Check out Vapnik-Chervonenkis dimension.

[0] https://www.cs.unc.edu/~wliu/papers/GoogLeNet.pdf [1] https://arxiv.org/pdf/1601.05150v2.pdf [2] http://image-net.org/challenges/LSVRC/2016/results (CUImage)

miles79y ago· 1 in thread

Is is possible that although neural nets can overfit as this paper shows, practitioners just stop training early before this happens? And/or they use a validation set? Would that be enough to explain the good generalization despite the huge number of parameters

argonaut9y ago

It doesn't explain why it learns anything in the first place (e.g. why it doesn't just overfit from the start; or learn some weak signals, then start overfitting).

j / k navigate · click thread line to collapse

18 comments

17 comments · 7 top-level

bmh_ca9y ago· 3 in thread

I remember an AI professor I had once asked the class to define "the number 3".

For example, children decided things were "red" based upon that trait being in-common with other things we called red. Circles based upon other things we call circles.

TheOtherHobbes9y ago

The prof's definition begs the question.

Humans can be observed recognising patterns, but that doesn't say anything useful about the process of recognition.

alimw9y ago

> The prof's definition begs the question.

It was good enough for Russell so I doubt that. See https://en.wikipedia.org/wiki/Set-theoretic_definition_of_na...

2 more replies

nl9y ago

There is some evidence (in eg Tversky & Kahneman) that human judgement is mostly about the difference between sets, not the similarities.

We seem to learn to generalise, and then compare differences.

dkarapetyan9y ago· 2 in thread

hiddencost9y ago

https://arxiv.org/abs/1311.2901

Check out figure 2. The network learns composition from fundamental shapes and gradients to compositional ones. It's kinda awe inspiring.

Not NN specific, but some more work on explanations: https://homes.cs.washington.edu/~marcotcr/blog/lime/

Visualization of attention mechanisms is pretty cool for explanations: https://arxiv.org/pdf/1502.03044.pdf http://torch.ch/blog/2015/09/21/rmva.html

dkarapetyan9y ago

gambler9y ago· 1 in thread

Good to see someone testing the limits of neural nets, rather just squeezing a few percent of performance on an artificial benchmark.

That said, is this result really all that surprising? Especially given the results demonstrated in that paper on fooling DNNs from 2015 and visualization experiments a-la Deep Dream.

[1] - http://www.evolvingai.org/fooling

dbecker9y ago

Adversarial noise isn't specific to deep learning models. Most model working on high-dimensional input will confidently misclassify noise with high confidence if you construct the noise right.

The original "adversarial pixel" paper demonstrates this with logistic regression.

https://arxiv.org/pdf/1412.6572v3.pdf

AlexCoventry9y ago· 1 in thread

argonaut9y ago

And interesting idea, but you really need to test this out before assigning any particular confidence to this actually being what is happening.

maxander9y ago· 1 in thread

sgt1019y ago

yazr9y ago· 1 in thread

> number of parameters exceeds the number of data points as it usually does in practice

I dont get this part.

In reality, isnt the dataset much larger than the parameters of the nueral net ?

hiddencost9y ago

In fact, the 2016 winner of a bunch of the ILSVRC challenges [1,2] was topologically basically the same as GoogLeNet.

EDIT: There's a perspective on machine learning which is basically just: "what if your model learns a hash-map". Check out Vapnik-Chervonenkis dimension.

[0] https://www.cs.unc.edu/~wliu/papers/GoogLeNet.pdf [1] https://arxiv.org/pdf/1601.05150v2.pdf [2] http://image-net.org/challenges/LSVRC/2016/results (CUImage)

miles79y ago· 1 in thread

argonaut9y ago

It doesn't explain why it learns anything in the first place (e.g. why it doesn't just overfit from the start; or learn some weak signals, then start overfitting).

j / k navigate · click thread line to collapse