The answer he chose, which stuck with me (if I recall the nuance correctly), is that the number three is: The set of all things in the universe of which there three, three is that which they have in common.
Where it became interesting for me is observing our children growing up, especially learning colours and shapes. They exhibited a pattern of learning based upon observations of common patterns in communication by vocalization.
For example, children decided things were "red" based upon that trait being in-common with other things we called red. Circles based upon other things we call circles.
It's really quite a fascinating phenomenon to observe in children, and I expect there is a key atomicity of association from which more complex patterns - up to consciousness - can be created. Too fine grained and the patterns will be noise; too large and certain higher order structures will never form - a "Goldilocks" zone for the complex system of interpreting reality by observational exposure and initially arbitrary relation.
Humans can be observed recognising patterns, but that doesn't say anything useful about the process of recognition.
It seems we have experiences first, and generalise from them later in terms of our experience. So we have common experiences of threeness, circleness, and redness, and from them we generalise what "three", "circle", and "red" means.
But what defines an experience of threeness? Is it really atomic, or is it made of further component parts/relationships? Is it learned, or innate? How does the labelling process influence the experience/generalisation process? (I'm not sure if it's an anthropological myth, but supposedly there are primitive tribes where counting goes "One, two, many..." What's their experience of threeness?)
The exact mechanics of this pattern recognition remain mysterious. It turns out that when we build NNs to recognise patterns, the process is still mysterious. Which seems mysterious in itself. It's remarkable that we have these tools and they seem to work to an extent, but really no one understands why.
It was good enough for Russell so I doubt that. See https://en.wikipedia.org/wiki/Set-theoretic_definition_of_na...
We seem to learn to generalise, and then compare differences.
I like the conclusion. Basically neural nets are just beasts with too many parameters and they even show you don't even need that many parameters to fit any data set of size n. This is one reason I think neural nets are kinda a dead end. People don't understand them and it is impossible to get any explanatory results from them and based on these results that kinda makes sense. Neural nets don't learn, they just memorize.
Check out figure 2. The network learns composition from fundamental shapes and gradients to compositional ones. It's kinda awe inspiring.
Not NN specific, but some more work on explanations: https://homes.cs.washington.edu/~marcotcr/blog/lime/
Visualization of attention mechanisms is pretty cool for explanations: https://arxiv.org/pdf/1502.03044.pdf http://torch.ch/blog/2015/09/21/rmva.html
That said, is this result really all that surprising? Especially given the results demonstrated in that paper on fooling DNNs from 2015 and visualization experiments a-la Deep Dream.
Unless you believe in networks "painting" stuff, Deep Dream demonstrated that neural networks capture and store certain chunks of their training data and you can get those back out if you're clever enough.
That other paper[1] demonstrated that a trained DNN can classify noise as a particular label with very high confidence, as long as you construct that noise carefully enough. This hints at the fact that DNNs may do matching by applying some complex transformation that usually results in the correct answer, but does not necessarily capture the underlying patterns. (Kind of like guessing about the weather by telltale signs, without knowing anything air pressure, currents and so on.)
The original "adversarial pixel" paper demonstrates this with logistic regression.
[0] This week we discussed the Alpha Go paper. URL for that, although we don't generally advertise our meetings unless we think there's going to be broad interest: https://www.meetup.com/Cambridge-Artificial-Intelligence-Mee...
And interesting idea, but you really need to test this out before assigning any particular confidence to this actually being what is happening.
I'd guess the resolution would have to involve an ordering over possible models, where (for well-designed networks) intelligible models are preferred over unintelligble ones. Filing this away to read later.
I dont get this part.
In reality, isnt the dataset much larger than the parameters of the nueral net ?
In fact, the 2016 winner of a bunch of the ILSVRC challenges [1,2] was topologically basically the same as GoogLeNet.
EDIT: There's a perspective on machine learning which is basically just: "what if your model learns a hash-map". Check out Vapnik-Chervonenkis dimension.
[0] https://www.cs.unc.edu/~wliu/papers/GoogLeNet.pdf [1] https://arxiv.org/pdf/1601.05150v2.pdf [2] http://image-net.org/challenges/LSVRC/2016/results (CUImage)